[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

vanzin Thu, 21 Jun 2018 18:01:42 -0700

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21606#discussion_r197316565
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
    @@ -76,13 +76,29 @@ object SparkHadoopWriter extends Logging {
         // Try to write all RDD partitions as a Hadoop OutputFormat.
         try {
           val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: 
Iterator[(K, V)]) => {
    +        // Generate a positive integer task ID that is unique for the 
current stage. This makes a
    +        // few assumptions:
    +        // - the task ID is always positive
    +        // - stages cannot have more than Int.MaxValue
    +        // - the sum of task counts of all active stages doesn't exceed 
Int.MaxValue
    +        //
    +        // The first two are currently the case in Spark, while the last 
one is very unlikely to
    +        // occur. If it does, two tasks IDs on a single stage could have a 
clashing integer value,
    +        // which could lead to code that generates clashing file names for 
different tasks. Still,
    +        // if the commit coordinator is enabled, only one task would be 
allowed to commit.
    --- End diff --
    
    Ok, I'll use that. I think Spark might fail everything before you even go 
that high in attempt numbers anyway...



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21606: [SPARK-24552][core][SQL] Use task ID instead of a...

Reply via email to