Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3009#issuecomment-63566220
> The good news is that the JobProgressListener already tracks the stages
that have already completed, so when we get the JobStartEvent, we can subtract
the stage IDs that are already done. This is imperfect for many of the reasons
you mentioned above (e.g., a job could hang with "0" stages running if a fetch
failed happens for a stage that we thought was complete when the job was
submitted) but I think is most intuitive in the general case.
I think that this is still subject to weird anomalies even if failures
don't occur, but I don't think that's going to be a problem due to the reasons
that Patrick discussed upthread. Imagine that I have a stage DAG that looks
something like
```
A ----\
C --~ some shuffle ~---> D ---> result
B-----/
```
where `C` is cached. In this scenario, it's possible that stages `A` and
`B` are missing and `C` is available. However, the job only need to compute
`D` to compute the result. In this case, I think we'll end up over-estimating
the amount of work that needs to be performed. (I realize that I've kind of
conflated RDDs and stages in this example, but I hope that my point is still
clear).
So:
- If failures occur, we end up undercounting the number of tasks.
- If cached RDDs are present, we end up overcounting the number of tasks.
As Patrick said, it's better err in favor of under-promising and
over-delivering in the common, failure-free cases. The suggestion of only
counting uncompleted stages towards the total number of tasks is a net
improvement in the sense that it reduces our total error.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]