Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/21606#discussion_r197291190
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -42,15 +42,12 @@
* Usually Spark processes many RDD partitions at the
same time,
* implementations should use the partition id to
distinguish writers for
* different partitions.
- * @param attemptNumber Spark may launch multiple tasks with the same
task id. For example, a task
- * failed, Spark launches a new task wth the same
task id but different
- * attempt number. Or a task is too slow, Spark
launches new tasks wth the
- * same task id but different attempt number, which
means there are multiple
- * tasks with the same task id running at the same
time. Implementations can
- * use this attempt number to distinguish writers
of different task attempts.
+ * @param taskId A unique identifier for a task that is performing the
write of the partition
+ * data. Spark may run multiple tasks for the same
partition (due to speculation
+ * or task failures, for example).
* @param epochId A monotonically increasing id for streaming queries
that are split in to
* discrete periods of execution. For non-streaming
queries,
* this ID will always be 0.
*/
- DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long
epochId);
+ DataWriter<T> createDataWriter(int partitionId, int taskId, long
epochId);
--- End diff --
Just so I understand, what's the reason for not changing the parameter name
and API docs? The name is not a public API in Java, so it doesn't break
anything.
And regardless of the parameter name, the API documentation is wrong (since
it says you can have multiple tasks with the same ID, but different attempts,
which does not happen).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]