[GitHub] spark issue #15258: [SPARK-17689][SQL][STREAMING] added excludeFiles option ...

frreiss Tue, 27 Sep 2016 20:56:56 -0700

Github user frreiss commented on the issue:

    https://github.com/apache/spark/pull/15258
  
    This change allows FileInputStream to consume partial outputs of a system 
such as Hadoop or another copy of Spark, provided that the system adheres 
rigidly to the write policy of recent versions of Hadoop. That is: First, write 
to a temporary file. Then close and flush the temporary file. Then rename the 
temporary file, using one of the newer, atomic HDFS APIs for renaming files. I 
worry that users might write data in a subtly different way that does not 
follow this procedure 100%, which could result in Spark reading incorrect data 
every once in a while. I recommend documenting under exactly what conditions 
the "ignore temporary files" option guarantees correct behavior.
    
    Also, it would be a good idea to include a mode in which FileInputStream 
will ignore a directory of files until the special file _SUCCESS appears, 
indicating that the directory is complete. Otherwise, Spark could end up 
consuming partial results from failed jobs.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15258: [SPARK-17689][SQL][STREAMING] added excludeFiles option ...

Reply via email to