Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/19623#discussion_r148507385
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
---
@@ -50,28 +53,34 @@
/**
* Creates a writer factory which will be serialized and sent to
executors.
+ *
+ * If this method fails (by throwing an exception), the action would
fail and no Spark job was
+ * submitted.
*/
DataWriterFactory<Row> createWriterFactory();
/**
* Commits this writing job with a list of commit messages. The commit
messages are collected from
- * successful data writers and are produced by {@link
DataWriter#commit()}. If this method
- * fails(throw exception), this writing job is considered to be failed,
and
- * {@link #abort(WriterCommitMessage[])} will be called. The written
data should only be visible
- * to data source readers if this method succeeds.
+ * successful data writers and are produced by {@link
DataWriter#commit()}.
+ *
+ * If this method fails (by throwing an exception), this writing job is
considered to to have been
+ * failed, and {@link #abort(WriterCommitMessage[])} would be called.
The state of the destination
+ * is undefined and @{@link #abort(WriterCommitMessage[])} may not be
able to deal with it.
*
* Note that, one partition may have multiple committed data writers
because of speculative tasks.
* Spark will pick the first successful one and get its commit message.
Implementations should be
- * aware of this and handle it correctly, e.g., have a mechanism to make
sure only one data writer
- * can commit successfully, or have a way to clean up the data of
already-committed writers.
+ * aware of this and handle it correctly, e.g., have a coordinator to
make sure only one data
+ * writer can commit, or have a way to clean up the data of
already-committed writers.
*/
void commit(WriterCommitMessage[] messages);
/**
* Aborts this writing job because some data writers are failed to write
the records and aborted,
* or the Spark job fails with some unknown reasons, or {@link
#commit(WriterCommitMessage[])}
- * fails. If this method fails(throw exception), the underlying data
source may have garbage that
- * need to be cleaned manually, but these garbage should not be visible
to data source readers.
+ * fails.
+ *
+ * If this method fails (by throwing an exception), the underlying data
source may have garbage
+ * that need to be cleaned manually.
--- End diff --
"may require manual cleanup". It could be more than just "garbage", which
implies filesystem temp data...it could be tables in a database or similar
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]