Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20710#discussion_r172319605
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -48,6 +48,9 @@
* same task id but different attempt number, which
means there are multiple
* tasks with the same task id running at the same
time. Implementations can
* use this attempt number to distinguish writers
of different task attempts.
+ * @param epochId A monotonically increasing id for streaming queries
that are split in to
+ * discrete periods of execution. For non-streaming
queries,
+ * this ID will always be 0.
*/
- DataWriter<T> createDataWriter(int partitionId, int attemptNumber);
+ DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long
epochId);
--- End diff --
Why are we using the same interface for streaming and batch here? Is there
a compelling reason to do so instead of adding `StreamingWriterFactory`? Are
the guarantees for an epoch identical to those of a single batch job?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]