viirya commented on pull request #31296: URL: https://github.com/apache/spark/pull/31296#issuecomment-766152881
For the performance perspective, I think we are not going to change some currently existing API so it is not a performance regression. User may understand the performance meaning when using pipe. Actually pipe is the way if there is no other choices. In some cases user have to use it, e.g. they're unable to replace the forked process. I'm not sure if someone ever has been complained for it. For batch Dataset, it is easy to convert to RDD and run RDD.pipe. For streaming one, it is simply impossible. Maybe it is also related to SS adaptation rate? UDF/expression not work because pipe is not 1-to-1 relation between input and output. For example piping through `wc -l` gets single output per partition. There will be some issues if just invoking forked process in mapPartitions API, e.g. interrupting stdin writer when the task is finished. We did some work in PipedRDD to make pipe work well with Spark. I think it is better if we reuse the work instead of re-inventing all stuff again per user. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
