mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-555509325 Thank you @zsxwing @HeartSaVioR for taking time and reviewing this PR. Glad to see activity here :) @zsxwing The files are uploaded directly to the source folder. I don't see a need for an intermediate step that moves files around. @HeartSaVioR A common use case for deleting files outside Spark is to remove the old files sitting in the source folder impacting the performance of the ListObjects operation. We use S3 lifecycle policies to delete the files after 15 days (giving plenty of time to Spark process them). I agree that having an option with `modified file` in its name might suggest that Spark is providing extra guarantees for race conditions. But what if we rename it to something like: `fileUniqueness: filename` (default) and `fileUniqueness: filename+timestamp`? That would be more accurate to what the PR is trying to achieve and does not set wrong expectations.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
