GitHub user sryza opened a pull request:
https://github.com/apache/spark/pull/2087
SPARK-2621. Update task InputMetrics incrementally
The patch takes advantage an API provided in Hadoop 2.5 that allows getting
accurate data on Hadoop FileSystem bytes read. It eliminates the old method,
which naively accepts the split size as the input bytes. An impact of this
change will be that input metrics go away when using against Hadoop versions
earlier thatn 2.5. I can add this back in, but my opinion is that no metrics
are better than inaccurate metrics.
This is difficult to write a test for because we don't usually build
against a version of Hadoop that contains the function we need. I've tested it
manually on a pseudo-distributed cluster.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sryza/spark sandy-spark-2621
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2087.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2087
----
commit b5f4c6c5d0be646798bc8188f610b00fb4be83fa
Author: Sandy Ryza <[email protected]>
Date: 2014-07-22T20:42:28Z
SPARK-2621. Update task InputMetrics incrementally
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]