Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3009#issuecomment-63566220
  
    > The good news is that the JobProgressListener already tracks the stages 
that have already completed, so when we get the JobStartEvent, we can subtract 
the stage IDs that are already done. This is imperfect for many of the reasons 
you mentioned above (e.g., a job could hang with "0" stages running if a fetch 
failed happens for a stage that we thought was complete when the job was 
submitted) but I think is most intuitive in the general case.
    
    I think that this is still subject to weird anomalies even if failures 
don't occur, but I don't think that's going to be a problem due to the reasons 
that Patrick discussed upthread.  Imagine that I have a stage DAG that looks 
something like
    
    ```
    A ----\
           C --~ some shuffle ~---> D  ---> result
    B-----/
    ```
    
    where `C` is cached.  In this scenario, it's possible that stages `A` and 
`B` are missing and `C` is available.  However, the job only need to compute 
`D` to compute the result.  In this case, I think we'll end up over-estimating 
the amount of work that needs to be performed.  (I realize that I've kind of 
conflated RDDs and stages in this example, but I hope that my point is still 
clear).
    
    So:
    
    - If failures occur, we end up undercounting the number of tasks.
    - If cached RDDs are present, we end up overcounting the number of tasks.
    
    As Patrick said, it's better err in favor of under-promising and 
over-delivering in the common, failure-free cases.  The suggestion of only 
counting uncompleted stages towards the total number of tasks is a net 
improvement in the sense that it reduces our total error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to