GitHub user sitalkedia opened a pull request:
https://github.com/apache/spark/pull/12309
[SPARK-14542][CORE] PipeRDD should allow configurable buffer size forâ¦
## What changes were proposed in this pull request?
Currently PipedRDD internally uses PrintWriter to write data to the stdin
of the piped process, which by default uses a BufferedWriter of buffer size 8k.
In our experiment, we have seen that 8k buffer size is too small and the job
spends significant amount of CPU time in system calls to copy the data. We
should have a way to configure the buffer size for the writer.
## How was this patch tested?
Ran PipedRDDSuite tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sitalkedia/spark bufferedPipedRDD
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12309.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12309
----
commit 697433d49fde2b5f76ab2a7b986e133c435efdc3
Author: Sital Kedia <[email protected]>
Date: 2016-04-11T22:43:04Z
[SPARK-14542][CORE] PipeRDD should allow configurable buffer size for the
stdin writer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]