Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148875607
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to 
executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would 
fail and no Spark job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit 
messages are collected from
    -   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
    -   * fails(throw exception), this writing job is considered to be failed, 
and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link 
DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is 
considered to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. 
The state of the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be 
able to deal with it.
        *
        * Note that, one partition may have multiple committed data writers 
because of speculative tasks.
        * Spark will pick the first successful one and get its commit message. 
Implementations should be
    --- End diff --
    
    I haven't read Steve's points here entirely, but I agree that Spark should 
be primarily responsible for task commit coordination. Most implementations 
would be fine using the current [output commit 
coordinator](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala),
 which does a good job balancing the trade-offs that you've been discussing. It 
ensures that only one task is authorized to commit and has well-defined failure 
cases (when a network partition prevents the authorized committer from 
responding before its commit authorization times out).
    
    I think that Spark should use the current commit coordinator unless an 
implementation opts out of using it (and I'm not sure that opting out is a use 
case we care to support at this point). It's fine if Spark documents how its 
coordinator works and there are some drawbacks, but expecting implementations 
to handle their own commit coordination (which requires RPC for Spark) is, I 
think, unreasonable. Let's use the one we have by default, however imperfect.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to