[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....

cloud-fan Tue, 30 Jan 2018 17:22:24 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20386#discussion_r164930522
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
 ---
    @@ -63,32 +68,42 @@
       DataWriterFactory<Row> createWriterFactory();
     
       /**
    -   * Commits this writing job with a list of commit messages. The commit 
messages are collected from
    -   * successful data writers and are produced by {@link 
DataWriter#commit()}.
    +   * Handles a commit message which is collected from a successful data 
writer.
    +   *
    +   * Note that, implementations might need to cache all commit messages 
before calling
    +   * {@link #commit()} or {@link #abort()}.
        *
        * If this method fails (by throwing an exception), this writing job is 
considered to to have been
    -   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
    -   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
    +   * failed, and {@link #abort()} would be called. The state of the 
destination
    +   * is undefined and @{@link #abort()} may not be able to deal with it.
    +   */
    +  void add(WriterCommitMessage message);
    +
    +  /**
    +   * Commits this writing job.
    +   * When this method is called, the number of commit messages added by
    +   * {@link #add(WriterCommitMessage)} equals to the number of input data 
partitions.
        *
    -   * Note that, one partition may have multiple committed data writers 
because of speculative tasks.
    -   * Spark will pick the first successful one and get its commit message. 
Implementations should be
    -   * aware of this and handle it correctly, e.g., have a coordinator to 
make sure only one data
    -   * writer can commit, or have a way to clean up the data of 
already-committed writers.
    +   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
    +   * failed, and {@link #abort()} would be called. The state of the 
destination
    +   * is undefined and @{@link #abort()} may not be able to deal with it.
        */
    -  void commit(WriterCommitMessage[] messages);
    +  void commit();
    --- End diff --
    
    This is something we wanna improve at the API level. I think the 
implementation should be free to decide how to store the messages, in case each 
message is big and there are a lot of them. If this is not a problem at all, we 
can follow `FileCommitProtocol`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....

Reply via email to