GitHub user nongli opened a pull request:
https://github.com/apache/spark/pull/10116
[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines.
This patch adds timing metrics to portions of the execution. Row at a time
pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in
this patch should be
interpreted as a event timeline where 0 means the beginning of the
execution.
The metric is computed as when the last task of the blocking operator
computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock
latency of that phase.
For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases:
the end of the first
agg (when it's done the agg and not yet returned the results) and similar
for the second agg.
This patch adds a timing to each that is the time since the beginning. For
example it might
contain
Agg1: 10 seconds
Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1
returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the
time spent entirely
in one pipeline.
This adds the metrics to Agg and Sort but we should add more in subsequent
patches. The patch
also does not account of clock skew between the different machines. If this
is a problem in
practice, we can adjust for that as well if ntp is not used.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nongli/spark metrics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10116.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10116
----
commit c386e56beec70577fced050be67d4c8af9205326
Author: Nong Li <[email protected]>
Date: 2015-11-30T22:46:55Z
[SPARK-12113] [SQL] Add some timing metrics for blocking pipelines.
This patch adds timing metrics to portions of the execution. Row at a time
pipelining means these
metrics can only be added at the end of blocking phases. The metrics add in
this patch should be
interpreted as a event timeline where 0 means the beginning of the
execution.
The metric is computed as when the last task of the blocking operator
computes. This makes sense
if it is interpreted as a timeline and gives a measure of wall clock
latency of that phase.
For example, in a plan with
Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases:
the end of the first
agg (when it's done the agg and not yet returned the results) and similar
for the second agg.
This patch adds a timing to each that is the time since the beginning. For
example it might
contain
Agg1: 10 seconds
Agg2: 11 seconds
This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1
returning results +
exchange + agg2 took 1 second. We can post process this timeline to get the
time spent entirely
in one pipeline.
This adds the metrics to Agg and Sort but we should add more in subsequent
patches. The patch
also does not account of clock skew between the different machines. If this
is a problem in
practice, we can adjust for that as well if ntp is not used.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]