HeartSaVioR commented on issue #23782: [SPARK-26875][SS] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-555556037
 
 
   @mikedias 
   
   > A common use case for deleting files outside Spark is to remove the old 
files sitting in the source folder impacting the performance of the ListObjects 
operation. We use S3 lifecycle policies to delete the files after 15 days 
(giving plenty of time to Spark process them).
   
   I agree that some manual operations have to be taken for such case, but as 
Spark has no idea about the current status of files if they can be modified, 
unfortunately that has to be with your own risk. There's an option 
`spark.files.ignoreMissingFiles` which helps to tolerate file deletion, but for 
"overwrite" there's no option to help tolerating this, and @zsxwing already 
explained how hard "in depth" to do it right.
   
   SPARK-20568 provides the way to remove/archive processed files in safe 
manner officially, so I would agree there's a valid case if end users see the 
folder and confirm the file doesn't exist and put the file there while the 
other file with same path was actually processed and removed/archived. I guess 
that might be considered if we all agree about this as valid use case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to