Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/19269
> The only contract Spark needs is: data written/committed by tasks should
not be visible to data source readers until the job-level commitment. But they
can be visible to others like other writing tasks, so it's possible for data
sources to implement "abort the output of the other writer".
I'm not following what you mean here.
> making DataSourceV2Writer.abort take commit messages is still a
"best-effort" to clean up the data
Agreed. We should state something about this in the abort job docs: "Commit
messages passed to abort are the messages for all commits that succeeded and
sent a commit message to the driver. It is possible, though unlikely, for an
executor to successfully commit data to a data source, but fail before sending
the commit message to the driver."
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]