GitHub user jerryshao opened a pull request:
https://github.com/apache/spark/pull/17617
[SPARK-20244][Core] Handle get bytesRead from different thread in Hadoop RDD
## What changes were proposed in this pull request?
Hadoop FileSystem's statistics in based on thread local variables, this is
ok if the RDD computation chain is running in the same thread. But if child RDD
creates another thread to consume the iterator got from Hadoop RDDs, the
bytesRead computation will be error, because now the iterator's `next()` and
`close()` may run in different threads. This could be happened when using
PySpark with PythonRDD.
So here building a map to track the `bytesRead` for different thread and
add them together. This method will be used in three RDDs, `HadoopRDD`,
`NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called
directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
## How was this patch tested?
Unit test and local cluster verification.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jerryshao/apache-spark SPARK-20244
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17617.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17617
----
commit d6f3c42c74ab38b0b6becc80a80b5aeda4459c40
Author: jerryshao <[email protected]>
Date: 2017-04-12T06:22:15Z
Handle get bytesRead from different thread
Change-Id: I8e64393151ef3eef22b868f6ae47a48ecb8694d3
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]