[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sat, 23 Jan 2021 10:03:33 -0800


viirya commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766152881



   For the performance perspective, I think we are not going to change some 
currently existing API so it is not a performance regression. User may 
understand the performance meaning when using pipe. Actually pipe is the way if 
there is no other choices. In some cases user have to use it, e.g. they're 
unable to replace the forked process.
   
   I'm not sure if someone ever has been complained for it. For batch Dataset, 
it is easy to convert to RDD and run RDD.pipe. For streaming one, it is simply 
impossible. Maybe it is also related to SS adaptation rate?
   
   UDF/expression not work because pipe is not 1-to-1 relation between input 
and output. For example piping through `wc -l` gets single output per partition.
   
   There will be some issues if just invoking forked process in mapPartitions 
API, e.g. interrupting stdin writer when the task is finished. We did some work 
in PipedRDD to make pipe work well with Spark. I think it is better if we reuse 
the work instead of re-inventing all stuff again per user.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to