Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/21606#discussion_r197316565
--- Diff:
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -76,13 +76,29 @@ object SparkHadoopWriter extends Logging {
// Try to write all RDD partitions as a Hadoop OutputFormat.
try {
val ret = sparkContext.runJob(rdd, (context: TaskContext, iter:
Iterator[(K, V)]) => {
+ // Generate a positive integer task ID that is unique for the
current stage. This makes a
+ // few assumptions:
+ // - the task ID is always positive
+ // - stages cannot have more than Int.MaxValue
+ // - the sum of task counts of all active stages doesn't exceed
Int.MaxValue
+ //
+ // The first two are currently the case in Spark, while the last
one is very unlikely to
+ // occur. If it does, two tasks IDs on a single stage could have a
clashing integer value,
+ // which could lead to code that generates clashing file names for
different tasks. Still,
+ // if the commit coordinator is enabled, only one task would be
allowed to commit.
--- End diff --
Ok, I'll use that. I think Spark might fail everything before you even go
that high in attempt numbers anyway...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]