gaborgsomogyi commented on issue #26590: [SPARK-29953][SS] Don't clean up 
source files for FileStreamSource if the files belong to the output of 
FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-557149731
 
 
   @HeartSaVioR Checking all the files in all the directories in each 
micro-batch is definitely an overkill.
   Considering metadata we can have the following cases:
   1. Metadata doesn't exist => Files created outside of Spark and starting a 
new Spark query intersecting with this directory should be considered error
   2. Metadata exists in the root => Spark created it so we must use it and we 
can rely on that it won't be deleted
   3. Metadata exists but not in the root => Spark created part/all of the 
files and such case delete/archive can mess up metadata <=> files consistency.
   
   Only the last one is questionable what to do. Considering the possible 
solution complexity (globbing through the whole tree to find metadata) we can 
document this as `configuration error`. Of course if there is a relatively 
simple way to detect it then it would be a good idea to stop the query in 
advance (but at the first glance I can't find such easy way).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to