Github user vanzin commented on a diff in the pull request:
https://github.com/apache/spark/pull/21606#discussion_r197286890
--- Diff:
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
@@ -104,12 +104,12 @@ object SparkHadoopWriter extends Logging {
jobTrackerId: String,
commitJobId: Int,
sparkPartitionId: Int,
- sparkAttemptNumber: Int,
+ sparkTaskId: Long,
committer: FileCommitProtocol,
iterator: Iterator[(K, V)]): TaskCommitMessage = {
// Set up a task.
val taskContext = config.createTaskAttemptContext(
- jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber)
+ jobTrackerId, commitJobId, sparkPartitionId, sparkTaskId.toInt)
--- End diff --
But what does "run out" mean?
If your task ID goes past `Int.MaxValue`, you'll start getting negative
values here. Eventually you'll get to a long value that wraps back again to `0`
when cast to an integer:
```
(2L + Int.MaxValue + Int.MaxValue).toInt
res2: Int = 0
```
So for this to "not work", which means you'd have a conflict where two
tasks will generate the same output file name based on all these values (stage,
task, partition, etc, etc), you need that situation to happen, which means you
need about 4 billion tasks in the same stage for this to be a problem.
In other situations, you may get weird values because of the cast, but it
should still work.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]