Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3009#issuecomment-63588578
I just pushed a bunch of fixes to several UI issues, including several
sorting problems. The biggest change is the addition of a "pending" state on
the job details page (@pwendell, the implementation here is much simpler than
the JobProgressListener hacking that I mentioned earlier; this shouldn't have
any GC issues).
@kayousterhout:
> The good news is that the JobProgressListener already tracks the stages
that have already completed, so when we get the JobStartEvent, we can subtract
the stage IDs that are already done.
I don't think that this will work, since it seems that the skipped stages
are assigned new StageIds. For instance, try
```
val rdd = sc.parallelize(Seq(1, 2,
3)).map(identity).groupBy(identity).map(identity).groupBy(identity).map(identity)
rdd.count()
rdd.count()
```
In this case, both jobs will be submitted with three stage ids, but none of
those stage ids will be the same. You can see this by looking at the "all
stages" page since there are many gaps in the sage number sequence due to all
of the extra stage ids being assigned to stages that aren't run.
Given this, is there an easy way to figure out which stages will be run?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]