GitHub user ksakellis opened a pull request:
https://github.com/apache/spark/pull/4067
[SPARK-4874] [CORE] Collect record count metrics
Collects record counts for both Input/Output and Shuffle Metrics. For the
input/output metrics, it just appends the counter every time the iterators get
accessed.
For shuffle on the write side, we count the metrics post aggregation (after
a map side combine) and on the read side we count the metrics pre aggregation.
This allows both the bytes read/written metrics and the records read/written to
line up.
For backwards compatibility, if we deserialize an older event that doesn't
have record metrics, we set the metric to -1.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ksakellis/spark kostas-spark-4874
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4067.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4067
----
commit 3fe42a778381b80125895ed001a074614f821ab9
Author: Kostas Sakellis <[email protected]>
Date: 2014-11-04T01:59:18Z
[SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds
When calculating the input metrics there was an assumption
that one task only reads from one block - this is not true
for some operations including coalesce. This patch simply
increments the task's input metrics if previous ones existed
of the same read method.
A limitation to this patch is that if a task reads from
two different blocks of different read methods, one will override
the other.
commit 0f2f7d430f44ae0488dcd86a65ff910e99fe07e0
Author: Kostas Sakellis <[email protected]>
Date: 2014-11-11T18:41:27Z
CR feedback
commit a8b5626a574cd023914917cedf2b288c12e47a41
Author: Kostas Sakellis <[email protected]>
Date: 2014-12-13T00:54:04Z
Add bytesReadCallback to InputMetrics
Also added a test for interleaving reads.
commit 66257f7c14c4e013ae897d5b7a0be108afc07a89
Author: Kostas Sakellis <[email protected]>
Date: 2015-01-14T02:03:29Z
Drops metrics if conflicting read methods exist
Tasks now only store/accumulate input metrics from
the same read method. If a task has interleaved reads
from more than one block of different read methods, we
choose to store the first read methods metrics.
https://issues.apache.org/jira/browse/SPARK-5225
addresses keeping track of all input metrics.
This change also centralizes this logic in TaskMetrics
and gates how inputMetrics can be added to TaskMetrics.
commit 571cb69f694b7f05e829c7237a07f68fa058e118
Author: Kostas Sakellis <[email protected]>
Date: 2015-01-15T04:45:35Z
[SPARK-4874] [CORE] Collect record count metrics
Collects record counts for both Input/Output and Shuffle
Metrics. For the input/output metrics, it just appends
the counter everytime the iterators get accessed.
For shuffle on the write side, we count the metrics post
aggregation (after a map side combine) and on the read
side we count the metrics pre aggregation. This allows
both the bytes read/written metrics and the
records read/written to line up.
For backwards compatibiliy, if we deserialize an older
event that doesn't have record metrics, we set the
metric to -1.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]