[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

vanzin Thu, 21 Jun 2018 15:03:21 -0700

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21606#discussion_r197291190
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
 ---
    @@ -42,15 +42,12 @@
        *                    Usually Spark processes many RDD partitions at the 
same time,
        *                    implementations should use the partition id to 
distinguish writers for
        *                    different partitions.
    -   * @param attemptNumber Spark may launch multiple tasks with the same 
task id. For example, a task
    -   *                      failed, Spark launches a new task wth the same 
task id but different
    -   *                      attempt number. Or a task is too slow, Spark 
launches new tasks wth the
    -   *                      same task id but different attempt number, which 
means there are multiple
    -   *                      tasks with the same task id running at the same 
time. Implementations can
    -   *                      use this attempt number to distinguish writers 
of different task attempts.
    +   * @param taskId A unique identifier for a task that is performing the 
write of the partition
    +   *               data. Spark may run multiple tasks for the same 
partition (due to speculation
    +   *               or task failures, for example).
        * @param epochId A monotonically increasing id for streaming queries 
that are split in to
        *                discrete periods of execution. For non-streaming 
queries,
        *                this ID will always be 0.
        */
    -  DataWriter<T> createDataWriter(int partitionId, int attemptNumber, long 
epochId);
    +  DataWriter<T> createDataWriter(int partitionId, int taskId, long 
epochId);
    --- End diff --
    
    Just so I understand, what's the reason for not changing the parameter name 
and API docs? The name is not a public API in Java, so it doesn't break 
anything.
    
    And regardless of the parameter name, the API documentation is wrong (since 
it says you can have multiple tasks with the same ID, but different attempts, 
which does not happen).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

Reply via email to