mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-553598174 Hi @zsxwing, thanks for taking time and share your thoughts. The idea of this configuration to add another condition to consider if a file should be processed or not. It does not make any assumption about concurrently modified files or anything else. Everything remains the same. The scenario that I'm trying to solve here is: - User uploads the file `lastest_sales.csv` to the source folder - Spark processes it - File `lastest_sales.csv` gets deleted (manually or via the new configuration https://github.com/apache/spark/pull/22952) - User uploads the file `lastest_sales.csv` to the source folder - Spark does not process it because it already processed the `lastest_sales.csv` filename - User gets confused. Even if explained/documented, there is no way to tell which filenames were already processed. What this PR simply proposes is: If enabled, instead of only check the `filename` to determine if a file was already processed, check the file `timestamp` as well. Race conditions, file system specifics, stream semantics, and everything else remains the same.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
