Github user frreiss commented on the issue: https://github.com/apache/spark/pull/15258 This change allows FileInputStream to consume partial outputs of a system such as Hadoop or another copy of Spark, provided that the system adheres rigidly to the write policy of recent versions of Hadoop. That is: First, write to a temporary file. Then close and flush the temporary file. Then rename the temporary file, using one of the newer, atomic HDFS APIs for renaming files. I worry that users might write data in a subtly different way that does not follow this procedure 100%, which could result in Spark reading incorrect data every once in a while. I recommend documenting under exactly what conditions the "ignore temporary files" option guarantees correct behavior. Also, it would be a good idea to include a mode in which FileInputStream will ignore a directory of files until the special file _SUCCESS appears, indicating that the directory is complete. Otherwise, Spark could end up consuming partial results from failed jobs.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org