[GitHub] [spark] zsxwing edited a comment on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files

GitBox Wed, 13 Nov 2019 11:11:50 -0800

zsxwing edited a comment on issue #23782: [SPARK-26875][SS] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-553548527
 
 
   Thanks a lot for your contribution. However, I think overwriting an existing 
file is an anti-pattern. Most of storage systems cannot handle this properly:
   
   - S3: overwriting will be eventually consistent. There is no guarantee that 
which version we will get.
   - Azure blob storage and data lake: reading a file that's being modified may 
throw a conflict error.
   - HDFS: I don't know whether it supports reading a file that's being 
modified.
   
   Generally, file stream source requires the files appear in the directory 
atomically so that we don't need to handle the case that Spark reads a file 
that is still being written. Overwriting an existing file breaks this 
assumption.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zsxwing edited a comment on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files

Reply via email to