Github user jose-torres commented on a diff in the pull request:
https://github.com/apache/spark/pull/20386#discussion_r164810169
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
---
@@ -63,32 +68,42 @@
DataWriterFactory<Row> createWriterFactory();
/**
- * Commits this writing job with a list of commit messages. The commit
messages are collected from
- * successful data writers and are produced by {@link
DataWriter#commit()}.
+ * Handles a commit message which is collected from a successful data
writer.
+ *
+ * Note that, implementations might need to cache all commit messages
before calling
+ * {@link #commit()} or {@link #abort()}.
*
* If this method fails (by throwing an exception), this writing job is
considered to to have been
- * failed, and {@link #abort(WriterCommitMessage[])} would be called.
The state of the destination
- * is undefined and @{@link #abort(WriterCommitMessage[])} may not be
able to deal with it.
+ * failed, and {@link #abort()} would be called. The state of the
destination
+ * is undefined and @{@link #abort()} may not be able to deal with it.
+ */
+ void add(WriterCommitMessage message);
+
+ /**
+ * Commits this writing job.
+ * When this method is called, the number of commit messages added by
+ * {@link #add(WriterCommitMessage)} equals to the number of input data
partitions.
*
- * Note that, one partition may have multiple committed data writers
because of speculative tasks.
- * Spark will pick the first successful one and get its commit message.
Implementations should be
- * aware of this and handle it correctly, e.g., have a coordinator to
make sure only one data
- * writer can commit, or have a way to clean up the data of
already-committed writers.
+ * If this method fails (by throwing an exception), this writing job is
considered to to have been
+ * failed, and {@link #abort()} would be called. The state of the
destination
+ * is undefined and @{@link #abort()} may not be able to deal with it.
*/
- void commit(WriterCommitMessage[] messages);
+ void commit();
--- End diff --
WDYT of using the same API as FileCommitProtocol, where the engine both
calls add() for each message but also passes them in to commit() at the end? It
seems like most writers will have to keep an array of the messages they
received.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]