mikedias commented on issue #23782: [SPARK-26875][SQL] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-463625508 Some producers does not care much about the uniqueness of filenames, leading into possible/often file overriding. The motivation of this patch is exactly when you can't change the producer's behavior 😄 In my view, this option is a good complimentary of https://github.com/apache/spark/pull/22952 where we would be able to archive/delete processed files. Without this option, if we upload a file with same name as the previous processed and deleted one, it wouldn't get processed leading into a non-intuitive behavior. Addressing your concerns: - No random exception will be introduced by the option. It only changes the behavior of considering the file for processing or not for each microbatch. The possible race condition that you mention can happen even for a brand new file being written while processing, not related with the patch. - Again, the patch does not change anything about how the files are processed. It just introduces another option to control what files were already processed besides the filename. When enabled, it basically treats an already processed file with a new timestamp as a new file again.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
