Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/21606#discussion_r197290427
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -42,15 +42,12 @@
* Usually Spark processes many RDD partitions at the
same time,
* implementations should use the partition id to
distinguish writers for
* different partitions.
- * @param attemptNumber Spark may launch multiple tasks with the same
task id. For example, a task
- * failed, Spark launches a new task wth the same
task id but different
- * attempt number. Or a task is too slow, Spark
launches new tasks wth the
- * same task id but different attempt number, which
means there are multiple
- * tasks with the same task id running at the same
time. Implementations can
- * use this attempt number to distinguish writers
of different task attempts.
+ * @param taskId A unique identifier for a task that is performing the
write of the partition
+ * data. Spark may run multiple tasks for the same
partition (due to speculation
+ * or task failures, for example).
* @param epochId A monotonically increasing id for streaming queries
that are split in to
* discrete periods of execution. For non-streaming
queries,
* this ID will always be 0.
*/
- DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long
epochId);
+ DataWriter<T> createDataWriter(int partitionId, int taskId, long
epochId);
--- End diff --
if this patch targets 2.3, I'd say we should not change any API or
document, just pass `taskId.toInt` as `attemptNumber` and add comments to
explain this hacky workaround.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]