mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-555509325
 
 
   Thank you @zsxwing  @HeartSaVioR for taking time and reviewing this PR. Glad 
to see activity here :) 
   
   @zsxwing The files are uploaded directly to the source folder. I don't see a 
need for an intermediate step that moves files around. 
   
   @HeartSaVioR A common use case for deleting files outside Spark is to remove 
the old files sitting in the source folder impacting the performance of the 
ListObjects operation. We use S3 lifecycle policies to delete the files after 
15 days (giving plenty of time to Spark process them). 
   
   I agree that having an option with  `modified file` in its name might 
suggest that Spark is providing extra guarantees for race conditions. But what 
if we rename it to something like: `fileUniqueness: filename` (default) and 
`fileUniqueness: filename+timestamp`? That would be more accurate to what the 
PR is trying to achieve and does not set wrong expectations.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to