[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

vanzin Thu, 21 Jun 2018 14:45:00 -0700

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21606#discussion_r197286890
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
    @@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging {
           jobTrackerId: String,
           commitJobId: Int,
           sparkPartitionId: Int,
    -      sparkAttemptNumber: Int,
    +      sparkTaskId: Long,
           committer: FileCommitProtocol,
           iterator: Iterator[(K, V)]): TaskCommitMessage = {
         // Set up a task.
         val taskContext = config.createTaskAttemptContext(
    -      jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber)
    +      jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt)
    --- End diff --
    
    But what does "run out" mean?
    
    If your task ID goes past `Int.MaxValue`, you'll start getting negative 
values here. Eventually you'll get to a long value that wraps back again to `0` 
when cast to an integer:
    
    ```
    (2L + Int.MaxValue + Int.MaxValue).toInt
    res2: Int = 0
    ```
    
    So for this to "not work", which means you'd have a conflict where two 
tasks will generate the same output file name based on all these values (stage, 
task, partition, etc, etc), you need that situation to happen, which means you 
need about 4 billion tasks in the same stage for this to be a problem.
    
    In other situations, you may get weird values because of the cast, but it 
should still work.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

Reply via email to