viirya commented on pull request #31296: URL: https://github.com/apache/spark/pull/31296#issuecomment-765873925
> Could you please describe the actual use case? I would like to confirm this works with complicated schema like array/map/nested struct with some binary columns, and for the case how forked process can deserialize inputs properly and applies operations and serialize again (that wouldn't matter much as the return is `Dataset[String]` though). Our internal client needs to pipe streaming read from Kafka through a forked process, and currently with SS users cannot do it. I think the above question also applied to RDD.pipe. It's users' responsibility to make sure the process can understand the input. > In addition, this would incur non-trivial serde cost on communication between Spark process and external process. Probably we also need to revisit which benefits this gives to us compared to what Spark provides now (UDF, or some others if I miss?). I believe there are some cases that users have to use pipe. Users should consider it before choosing and using pipe API. I think it is well known issue for pipe. The point is, when users need to use pipe on streaming data like RDD and batch Dataset, but streaming Dataset cannot support it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org