HeartSaVioR edited a comment on issue #23782: [SPARK-26875][SS] Add an option 
on FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-555556037
 
 
   @mikedias 
   
   > A common use case for deleting files outside Spark is to remove the old 
files sitting in the source folder impacting the performance of the ListObjects 
operation. We use S3 lifecycle policies to delete the files after 15 days 
(giving plenty of time to Spark process them).
   
   I agree that some manual operations have to be taken for such case, but as 
Spark has no idea about the current status of files if they can be modified, 
unfortunately that has to be with your own risk. There's an option 
`spark.files.ignoreMissingFiles` which helps to tolerate file deletion, but for 
"overwrite" there's no option to help tolerating this, and @zsxwing already 
explained how hard "in depth" to do it right.
   
   SPARK-20568 provides the way to remove/archive processed files in safe 
manner officially, so I would agree there's a valid case if end users see the 
folder and confirm the file doesn't exist (so they are NOT overwriting existing 
file) and put the file there while the other file with same path was actually 
processed and removed/archived. I guess that might be considered if we all 
agree about this as valid use case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to