Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/21606#discussion_r197291875
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -42,15 +42,12 @@
* Usually Spark processes many RDD partitions at the
same time,
* implementations should use the partition id to
distinguish writers for
* different partitions.
- * @param attemptNumber Spark may launch multiple tasks with the same
task id. For example, a task
- * failed, Spark launches a new task wth the same
task id but different
- * attempt number. Or a task is too slow, Spark
launches new tasks wth the
- * same task id but different attempt number, which
means there are multiple
- * tasks with the same task id running at the same
time. Implementations can
- * use this attempt number to distinguish writers
of different task attempts.
+ * @param taskId A unique identifier for a task that is performing the
write of the partition
+ * data. Spark may run multiple tasks for the same
partition (due to speculation
+ * or task failures, for example).
* @param epochId A monotonically increasing id for streaming queries
that are split in to
* discrete periods of execution. For non-streaming
queries,
* this ID will always be 0.
*/
- DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long
epochId);
+ DataWriter<T> createDataWriter(int partitionId, int taskId, long
epochId);
--- End diff --
Hmm, interesting. But there is an API in 2.3:
```
DataWriter<T> createDataWriter(int partitionId, int attemptNumber);
```
Which I guess would still suffer from the problem Ryan describes in the
bug. In any case, that makes it not possible to cleanly backport this, so we
can make the type change here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]