GitHub user kayousterhout opened a pull request:
https://github.com/apache/spark/pull/962
[SPARK-1683] Track task read metrics.
This commit adds a new metric in TaskMetrics to record
the input data size and displays this information in the UI.
An earlier version of this commit also added the read time,
which can be useful for diagnosing straggler problems,
but unfortunately that change introduced a significant performance
regression for jobs that don't do much computation. In order to
track read time, we'll need to do sampling.
The screenshots below show the UI with the new "Input" field,
which I added to the stage summary page, the executor summary page,
and the per-stage page.



You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kayousterhout/spark-1 read_metrics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/962.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #962
----
commit 40a028bf32197e360eaebc0927a9e8cbd9f61792
Author: Kay Ousterhout <[email protected]>
Date: 2014-04-10T20:55:44Z
[SPARK-1683] Track task read metrics.
This commit adds a new metric in TaskMetrics to record
the input data size and displays this information in the UI.
An earlier version of this commit also added the read time,
which can be useful for diagnosing straggler problems,
but unfortunately that change introduced a significant performance
regression for jobs that don't do much computation. In order to
track read time, we'll need to do sampling.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---