Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/19623#discussion_r148226526
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
---
@@ -75,8 +82,10 @@
/**
* Aborts this writing job because some data writers are failed to write
the records and aborted,
* or the Spark job fails with some unknown reasons, or {@link
#commit(WriterCommitMessage[])}
- * fails. If this method fails(throw exception), the underlying data
source may have garbage that
- * need to be cleaned manually, but these garbage should not be visible
to data source readers.
+ * fails.
+ *
+ * If an exception was thrown, the underlying data source may have
garbage that need to be
+ * cleaned manually, but these garbage should not be visible to data
source readers.
--- End diff --
bit optimistic on the "should". Maybe: "state of the destination is
undefined".
The issue here is what if there's a network failure and abort() fails for
that. it hasn't made any changes to the destination. Thus its state is whatever
the previous state of the dest was. If no task was committed, nothing should be
directly observable. 1+ task committed and there *may* be output. If job commit
failed and abort() was called then, nobody is going to have any idea of the
final state.
Maybe: "Implementations are required to perform best-effort cleanup
irrespective of the current state of the destination".
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]