mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463881234
 
 
   I think that the archive/delete race condition can be addressed by checking 
the file timestamp before archive/delete. If it is the same as the processed, 
proceed. If not, skip. This extra step can be enabled only if 
`includeModifiedFiles` is enabled, which tells that files can be overridden. 
   
   Talking about end users expectations, if they upload a file and it gets 
deleted/archived, they probably expect a new file with the same name to be 
processed as well when uploaded again. Do not process the file is not intuitive 
and is also hard to debug which files names were processed in past. Why my file 
is not getting processed can be a frequently asked question.
   
   I totally understand the implications of files been unintentionally modified 
as well pointed by @HeartSaVioR and that's why the option is `false` by 
default, but I do think we need to provide an option to cover more use-cases 
and give a solution for users who understand that their files can be overridden.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to