Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/4029#issuecomment-69976817
This is a nice patch, but I wonder whether there's a smaller fix that
doesn't require changing SparkListener events; that will make it easier to
backport that patch to `branch-1.2`. The job page already knows the last stage
in the job (the result stage), so I think we might be able to use the final
stage's completion time as the job completion time and the first stage's
submission time as the job start time. However, there are a couple of
corner-cases that this might miss: I could submit a job that spends a bunch of
time queued behind other jobs before its first stage starts running, in which
case it would be helpful to be able to distinguish between scheduler delays and
stage durations. Similarly, there might be a corner-case related to the job
completion time if we have a job that spends a lot of time fetching results
back to the driver after they've been stored in the block manager by completed
tasks.
So, I guess the approach here seems like the right fix. I'd guess we might
be able to do a separate fix in branch-1.2 to use the first/last stage time
approximations.
I have a couple of comments on the code here, so I'll comment on those
inline.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]