cloud-fan commented on a change in pull request #23608:
URL: https://github.com/apache/spark/pull/23608#discussion_r655446703
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -170,7 +170,7 @@ object FileFormatWriter extends Logging {
description = description,
sparkStageId = taskContext.stageId(),
sparkPartitionId = taskContext.partitionId(),
- sparkAttemptNumber = taskContext.attemptNumber(),
+ sparkAttemptNumber = taskContext.taskAttemptId().toInt &
Integer.MAX_VALUE,
Review comment:
After more than 2 years, I revisited this code path again and realized
that this is not the best fix.
The original motivation is still correct: Spark violates the contract of
`TaskAttemptID`, as Spark resets task attempt number after stage retry, making
`TaskAttemptID` not unique.
The root cause is: Spark job has stages while Hadoop job directly has tasks
(no DAG). We map Spark stage id to Hadoop job id, which is inaccurate as this
doesn't count stage attempt number.
I think a better fix is to generate the Hadoop job id with both stage id and
stage attempt number, or generate the Hadoop task attempt number with both
Spark task and stage attempt number. However, the current fix also works, as
this only decides the intermedia staging directory name, which we don't care as
long as it's unique.
I'm leaving this comment just for future references.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]