HeartSaVioR commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-556964599
 
 
   @zsxwing 
   Ah OK got it. That's a good point - reading files in FileStreamSink output 
directory without metadata information is unsafe anyway.
   
   Btw, actually I and @gaborgsomogyi considered about edge-cases which the 
query reads `sub-directory(-ies)` or `ancestor with recursive option` of 
FileStreamSink output directory, because the actual impact here is a kind of 
"side-effect" which "affects" other queries. It might be less problematic that 
the query will read the directory "incorrectly" and incorrect output will come 
up. The thing is, the query will also mess up the output directory as well 
since processed files will be cleaned up, which will also break other queries 
as well.
   
   So I feel we still have to make a decision with consideration of possible 
side-effect; 1) try our best to prevent all known cases with (high?) costs, 2) 
consider these edge-cases as bad input and we don't care at all (maybe we could 
document it instead.) What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to