mikedias commented on issue #23782: [SPARK-26875][SQL] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463625508
 
 
   Some producers does not care much about the uniqueness of filenames, leading 
into possible/often file overriding. The motivation of this patch is exactly 
when you can't change the producer's behavior 😄 
   
   In my view, this option is a good complimentary of 
https://github.com/apache/spark/pull/22952 where we would be able to 
archive/delete processed files. Without this option, if we upload a file with 
same name as the previous processed and deleted one, it wouldn't get processed 
leading into a non-intuitive behavior.
   
   Addressing your concerns:
    - No random exception will be introduced by the option. It only changes the 
behavior of considering the file for processing or not for each microbatch. The 
possible race condition that you mention can happen even for a brand new file 
being written while processing, not related with the patch.
    - Again, the patch does not change anything about how the files are 
processed. It just introduces another option to control what files were already 
processed besides the filename. When enabled, it basically treats an already 
processed file with a new timestamp as a new file again.
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to