zsxwing edited a comment on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files URL: https://github.com/apache/spark/pull/23782#issuecomment-553548527 Thanks a lot for your contribution. However, I think overwriting an existing file is an anti-pattern. Most of storage systems cannot handle this properly: - S3: overwriting will be eventually consistent. There is no guarantee that which version we will get. - Azure blob storage and data lake: reading a file that's being modified may throw a conflict error. - HDFS: I don't know whether it supports reading a file that's being modified. Generally, file stream source requires the files appear in the directory atomically so that we don't need to handle the case that Spark reads a file that is still being written. Overwriting an existing file breaks this assumption.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
