GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10116

    [SPARK-12113] [SQL] Add some timing metrics for blocking pipelines.

    This patch adds timing metrics to portions of the execution. Row at a time 
pipelining means these
    metrics can only be added at the end of blocking phases. The metrics add in 
this patch should be
    interpreted as a event timeline where 0 means the beginning of the 
execution.
    
    The metric is computed as when the last task of the blocking operator 
computes. This makes sense
    if it is interpreted as a timeline and gives a measure of wall clock 
latency of that phase.
    
    For example, in a plan with
    Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: 
the end of the first
    agg (when it's done the agg and not yet returned the results) and similar 
for the second agg.
    This patch adds a timing to each that is the time since the beginning. For 
example it might
    contain
      Agg1: 10 seconds
      Agg2: 11 seconds
    This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 
returning results +
    exchange + agg2 took 1 second. We can post process this timeline to get the 
time spent entirely
    in one pipeline.
    
    This adds the metrics to Agg and Sort but we should add more in subsequent 
patches. The patch
    also does not account of clock skew between the different machines. If this 
is a problem in
    practice, we can adjust for that as well if ntp is not used.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10116
    
----
commit c386e56beec70577fced050be67d4c8af9205326
Author: Nong Li <[email protected]>
Date:   2015-11-30T22:46:55Z

    [SPARK-12113] [SQL] Add some timing metrics for blocking pipelines.
    
    This patch adds timing metrics to portions of the execution. Row at a time 
pipelining means these
    metrics can only be added at the end of blocking phases. The metrics add in 
this patch should be
    interpreted as a event timeline where 0 means the beginning of the 
execution.
    
    The metric is computed as when the last task of the blocking operator 
computes. This makes sense
    if it is interpreted as a timeline and gives a measure of wall clock 
latency of that phase.
    
    For example, in a plan with
    Scan -> Agg -> Exchange -> Agg -> Project. There are two blocking phases: 
the end of the first
    agg (when it's done the agg and not yet returned the results) and similar 
for the second agg.
    This patch adds a timing to each that is the time since the beginning. For 
example it might
    contain
      Agg1: 10 seconds
      Agg2: 11 seconds
    This captures a timeline so it means that Scan + Agg1 took 10 seconds. Agg1 
returning results +
    exchange + agg2 took 1 second. We can post process this timeline to get the 
time spent entirely
    in one pipeline.
    
    This adds the metrics to Agg and Sort but we should add more in subsequent 
patches. The patch
    also does not account of clock skew between the different machines. If this 
is a problem in
    practice, we can adjust for that as well if ntp is not used.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to