mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-553598174
 
 
   Hi @zsxwing, thanks for taking time and share your thoughts. The idea of 
this configuration to add another condition to consider if a file should be 
processed or not. It does not make any assumption about concurrently modified 
files or anything else. Everything remains the same.
   
   The scenario that I'm trying to solve here is:
    - User uploads the file  `lastest_sales.csv` to the source folder
    - Spark processes it
    - File `lastest_sales.csv` gets deleted (manually or via the new 
configuration https://github.com/apache/spark/pull/22952)
    - User uploads the file `lastest_sales.csv` to the source folder
    - Spark does not process it because it already processed the 
`lastest_sales.csv` filename
    - User gets confused. Even if explained/documented, there is no way to tell 
which filenames were already processed.
   
   What this PR simply proposes is: If enabled, instead of only check the 
`filename` to determine if a file was already processed, check the file 
`timestamp` as well. Race conditions, file system specifics, stream semantics, 
and everything else remains the same.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to