mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-463881234 I think that the archive/delete race condition can be addressed by checking the file timestamp before archive/delete. If it is the same as the processed, proceed. If not, skip. This extra step can be enabled only if `includeModifiedFiles` is enabled, which tells that files can be overridden. Talking about end users expectations, if they upload a file and it gets deleted/archived, they probably expect a new file with the same name to be processed as well when uploaded again. Do not process the file is not intuitive and is also hard to debug which files names were processed in past. Why my file is not getting processed can be a frequently asked question. I totally understand the implications of files been unintentionally modified as well pointed by @HeartSaVioR and that's why the option is `false` by default, but I do think we need to provide an option to cover more use-cases and give a solution for users who understand that their files can be overridden.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
