GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/2087

    SPARK-2621. Update task InputMetrics incrementally

    The patch takes advantage an API provided in Hadoop 2.5 that allows getting 
accurate data on Hadoop FileSystem bytes read.  It eliminates the old method, 
which naively accepts the split size as the input bytes.  An impact of this 
change will be that input metrics go away when using against Hadoop versions 
earlier thatn 2.5.  I can add this back in, but my opinion is that no metrics 
are better than inaccurate metrics.
    
    This is difficult to write a test for because we don't usually build 
against a version of Hadoop that contains the function we need.  I've tested it 
manually on a pseudo-distributed cluster.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-2621

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2087
    
----
commit b5f4c6c5d0be646798bc8188f610b00fb4be83fa
Author: Sandy Ryza <[email protected]>
Date:   2014-07-22T20:42:28Z

    SPARK-2621. Update task InputMetrics incrementally

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to