Github user sitalkedia commented on a diff in the pull request:
https://github.com/apache/spark/pull/12309#discussion_r60100537
--- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala ---
@@ -144,7 +142,8 @@ private[spark] class PipedRDD[T: ClassTag](
new Thread(s"stdin writer for $command") {
override def run(): Unit = {
TaskContext.setTaskContext(context)
- val out = new PrintWriter(proc.getOutputStream)
+ val out = new PrintWriter(new BufferedWriter(
--- End diff --
@srowen - Thanks for taking a look. In our testing we found out that using
a buffer of large size (1 MB) gives us a cpu savings of around 15%. It makes
sense to be able to increase the buffer size when we are piping a large amount
of data. If changing a public API is not too much trouble, it would be pretty
useful for us to have a configurable buffer size.
Regarding your second point, I am not sure if I understand you. My change
is not going to change the behavior of the PrintWriter at all. Do you mean to
say the issue with UTF-8 encoding already exists and I should fix it in this
diff?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]