Hi, I found that ShuffleMapStage has this (apparently superfluous) pendingPartitions registry [1] for DAGScheduler and the description says:
" /** * Partitions that either haven't yet been computed, or that were computed on an executor * that has since been lost, so should be re-computed. This variable is used by the * DAGScheduler to determine when a stage has completed. Task successes in both the active * attempt for the stage or in earlier attempts for this stage can cause paritition ids to get * removed from pendingPartitions. As a result, this variable may be inconsistent with the pending * tasks in the TaskSetManager for the active attempt for the stage (the partitions stored here * will always be a subset of the partitions that the TaskSetManager thinks are pending). */ " I'm curious why there is a need for this pendingPartitions since isAvailable or findMissingPartitions (using MapOutputTrackerMaster) know it already and I think are even more up-to-date. Why is there this extra registry? [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60 Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
