gaborgsomogyi commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-557149731 @HeartSaVioR Checking all the files in all the directories in each micro-batch is definitely an overkill. Considering metadata we can have the following cases: 1. Metadata doesn't exist => Files created outside of Spark and starting a new Spark query intersecting with this directory should be considered error 2. Metadata exists in the root => Spark created it so we must use it and we can rely on that it won't be deleted 3. Metadata exists but not in the root => Spark created part/all of the files and such case delete/archive can mess up metadata <=> files consistency. Only the last one is questionable what to do. Considering the possible solution complexity (globbing through the whole tree to find metadata) we can document this as `configuration error`. Of course if there is a relatively simple way to detect it then it would be a good idea to stop the query in advance (but at the first glance I can't find such easy way).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
