GitHub user ksakellis opened a pull request:

    https://github.com/apache/spark/pull/4067

    [SPARK-4874] [CORE] Collect record count metrics

    Collects record counts for both Input/Output and Shuffle Metrics. For the 
input/output metrics, it just appends the counter every time the iterators get 
accessed.
        
    For shuffle on the write side, we count the metrics post aggregation (after 
a map side combine) and on the read side we count the metrics pre aggregation. 
This allows both the bytes read/written metrics and the records read/written to 
line up.
    
    For backwards compatibility, if we deserialize an older event that doesn't 
have record metrics, we set the metric to -1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ksakellis/spark kostas-spark-4874

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4067
    
----
commit 3fe42a778381b80125895ed001a074614f821ab9
Author: Kostas Sakellis <[email protected]>
Date:   2014-11-04T01:59:18Z

    [SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds
    
    When calculating the input metrics there was an assumption
    that one task only reads from one block - this is not true
    for some operations including coalesce. This patch simply
    increments the task's input metrics if previous ones existed
    of the same read method.
    
    A limitation to this patch is that if a task reads from
    two different blocks of different read methods, one will override
    the other.

commit 0f2f7d430f44ae0488dcd86a65ff910e99fe07e0
Author: Kostas Sakellis <[email protected]>
Date:   2014-11-11T18:41:27Z

    CR feedback

commit a8b5626a574cd023914917cedf2b288c12e47a41
Author: Kostas Sakellis <[email protected]>
Date:   2014-12-13T00:54:04Z

    Add bytesReadCallback to InputMetrics
    
    Also added a test for interleaving reads.

commit 66257f7c14c4e013ae897d5b7a0be108afc07a89
Author: Kostas Sakellis <[email protected]>
Date:   2015-01-14T02:03:29Z

    Drops metrics if conflicting read methods exist
    
    Tasks now only store/accumulate input metrics from
    the same read method. If a task has interleaved reads
    from more than one block of different read methods, we
    choose to store the first read methods metrics.
    
    https://issues.apache.org/jira/browse/SPARK-5225
    addresses keeping track of all input metrics.
    
    This change also centralizes this logic in TaskMetrics
    and gates how inputMetrics can be added to TaskMetrics.

commit 571cb69f694b7f05e829c7237a07f68fa058e118
Author: Kostas Sakellis <[email protected]>
Date:   2015-01-15T04:45:35Z

    [SPARK-4874] [CORE] Collect record count metrics
    
    Collects record counts for both Input/Output and Shuffle
    Metrics. For the input/output metrics, it just appends
    the counter everytime the iterators get accessed.
    
    For shuffle on the write side, we count the metrics post
    aggregation (after a map side combine) and on the read
    side we count the metrics pre aggregation. This allows
    both the bytes read/written metrics and the
    records read/written to line up.
    
    For backwards compatibiliy, if we deserialize an older
    event that doesn't have record metrics, we set the
    metric to -1.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to