zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source 
files for FileStreamSource if the files belong to the output of FileStreamSink
URL: https://github.com/apache/spark/pull/26590#issuecomment-557216576
 
 
   > Checking all the files in all the directories in each micro-batch is 
definitely an overkill.
   
   +1.
   
   I think the fundamental issue is the FileIndex interface doesn't work for 
complicated things. There are multiple issues here. Another example: if a user 
is using a glob path in `FileStreamSource`, we always go to 
`InMemoryFileIndex`, even if there are some matched paths created by 
`FileStreamSink`. `InMemoryFileIndex` knowns nothing about 
`MetadataLogFileIndex` and uses its own logic to list files.
   
   Ideally, the defending codes should be added when doing the file listing if 
we would like to prevent such cases because it can also prevent reading 
incorrect files. However, I think that's a pretty large change and probably not 
worth (I have not yet figured out how to make Hadoop's glob pattern codes 
understand `MetadataLogFileIndex`, maybe impossible).
   
   Hence I suggest we just block the `cleanSource` option when listing files 
using `MetadataLogFileIndex`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to