[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

rdblue Thu, 08 Feb 2018 09:34:54 -0800

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20490#discussion_r167011220
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
 ---
    @@ -78,10 +78,11 @@ default void onDataWriterCommit(WriterCommitMessage 
message) {}
        * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
        * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
        *
    -   * Note that, one partition may have multiple committed data writers 
because of speculative tasks.
    -   * Spark will pick the first successful one and get its commit message. 
Implementations should be
    -   * aware of this and handle it correctly, e.g., have a coordinator to 
make sure only one data
    -   * writer can commit, or have a way to clean up the data of 
already-committed writers.
    +   * Note that speculative execution may cause multiple tasks to run for a 
partition. By default,
    +   * Spark uses the OutputCommitCoordinator to allow only one attempt to 
commit.
    +   * {@link DataWriterFactory} implementations can disable this behavior. 
If disabled, multiple
    --- End diff --
    
    I clarified this and added a note about how to do it.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

Reply via email to