Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/12309#discussion_r59609962
--- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala ---
@@ -144,7 +142,8 @@ private[spark] class PipedRDD[T: ClassTag](
new Thread(s"stdin writer for $command") {
override def run(): Unit = {
TaskContext.setTaskContext(context)
- val out = new PrintWriter(proc.getOutputStream)
+ val out = new PrintWriter(new BufferedWriter(
--- End diff --
Buffering here is probably a decent idea, with a small buffer. Is it even
necessary to make it configurable? 8K is pretty standard; you've found a larger
buffer (32K?) is better. Would you ever want to turn it off or make it quite
larger than that? The reason is just that this requires you to change a public
API and that's going to require additional steps.
Separately, this needs to specify UTF-8 encoding. Actually, we have this
same problem in the stderr and stdout readers above, that they rely on platform
encoding. I can sort of see an argument that using platform encoding makes
sense when dealing with platform binaries, but, there's still no particular
reason to expect the JVM default more often matches whatever some binary is
using.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]