Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20386#discussion_r164930522
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java
---
@@ -63,32 +68,42 @@
DataWriterFactory<Row> createWriterFactory();
/**
- * Commits this writing job with a list of commit messages. The commit
messages are collected from
- * successful data writers and are produced by {@link
DataWriter#commit()}.
+ * Handles a commit message which is collected from a successful data
writer.
+ *
+ * Note that, implementations might need to cache all commit messages
before calling
+ * {@link #commit()} or {@link #abort()}.
*
* If this method fails (by throwing an exception), this writing job is
considered to to have been
- * failed, and {@link #abort(WriterCommitMessage[])} would be called.
The state of the destination
- * is undefined and @{@link #abort(WriterCommitMessage[])} may not be
able to deal with it.
+ * failed, and {@link #abort()} would be called. The state of the
destination
+ * is undefined and @{@link #abort()} may not be able to deal with it.
+ */
+ void add(WriterCommitMessage message);
+
+ /**
+ * Commits this writing job.
+ * When this method is called, the number of commit messages added by
+ * {@link #add(WriterCommitMessage)} equals to the number of input data
partitions.
*
- * Note that, one partition may have multiple committed data writers
because of speculative tasks.
- * Spark will pick the first successful one and get its commit message.
Implementations should be
- * aware of this and handle it correctly, e.g., have a coordinator to
make sure only one data
- * writer can commit, or have a way to clean up the data of
already-committed writers.
+ * If this method fails (by throwing an exception), this writing job is
considered to to have been
+ * failed, and {@link #abort()} would be called. The state of the
destination
+ * is undefined and @{@link #abort()} may not be able to deal with it.
*/
- void commit(WriterCommitMessage[] messages);
+ void commit();
--- End diff --
This is something we wanna improve at the API level. I think the
implementation should be free to decide how to store the messages, in case each
message is big and there are a lot of them. If this is not a problem at all, we
can follow `FileCommitProtocol`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]