[GitHub] spark pull request: [SPARK-14542][CORE] PipeRDD should allow confi...

sitalkedia Mon, 18 Apr 2016 10:36:43 -0700

Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12309#discussion_r60100537
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala ---
    @@ -144,7 +142,8 @@ private[spark] class PipedRDD[T: ClassTag](
         new Thread(s"stdin writer for $command") {
           override def run(): Unit = {
             TaskContext.setTaskContext(context)
    -        val out = new PrintWriter(proc.getOutputStream)
    +        val out = new PrintWriter(new BufferedWriter(
    --- End diff --
    
    @srowen  - Thanks for taking a look. In our testing we found out that using 
a buffer of large size (1 MB) gives us a cpu savings of around 15%. It makes 
sense to be able to increase the buffer size when we are piping a large amount 
of data. If changing a public API is not too much trouble, it would be pretty 
useful for us to have a configurable buffer size. 
    
    Regarding your second point, I am not sure if I understand you. My change 
is not going to change the behavior of the PrintWriter at all.  Do you mean to 
say the issue with UTF-8 encoding already exists and I should fix it in this 
diff?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14542][CORE] PipeRDD should allow confi...

Reply via email to