HeartSaVioR commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-555556037 @mikedias > A common use case for deleting files outside Spark is to remove the old files sitting in the source folder impacting the performance of the ListObjects operation. We use S3 lifecycle policies to delete the files after 15 days (giving plenty of time to Spark process them). I agree that some manual operations have to be taken for such case, but as Spark has no idea about the current status of files if they can be modified, unfortunately that has to be with your own risk. There's an option `spark.files.ignoreMissingFiles` which helps to tolerate file deletion, but for "overwrite" there's no option to help tolerating this, and @zsxwing already explained how hard "in depth" to do it right. SPARK-20568 provides the way to remove/archive processed files in safe manner officially, so I would agree there's a valid case if end users see the folder and confirm the file doesn't exist and put the file there while the other file with same path was actually processed and removed/archived. I guess that might be considered if we all agree about this as valid use case.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
