HeartSaVioR opened a new pull request #26590: [SPARK-29953][SS] Don't clean up 
source files for FileStreamSource if the files belong to the output of 
FileStreamSink
URL: https://github.com/apache/spark/pull/26590
 
 
   ### What changes were proposed in this pull request?
   
   This patch prevents the cleanup operation in FileStreamSource if the source 
files belong to the FileStreamSink. This is needed because the output of 
FileStreamSink can be read with multiple Spark queries and queries will read 
the files based on the metadata log, which won't reflect the cleanup.
   
   To simplify the condition, this patch assumes that if the source files 
belong to the FileStreamSink, the matched source path is the root of output 
directory for FileStreamSink. For example, suppose we provide a glob path 
`/a/b/c/*/*` and FileStreamSource processes the file `/a/b/c/d/e/f/g/file`. 
Then we only check `/a/b/c/d/e` to see whether there's FileStreamSink metadata 
log available.
   
   ### Why are the changes needed?
   
   Without this patch, if end users turn on cleanup option with the path which 
is the output of FileStreamSink, there may be out of sync between metadata and 
available files which may break other queries reading the path.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Added UT.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to