viirya commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-765873925


   > Could you please describe the actual use case? I would like to confirm 
this works with complicated schema like array/map/nested struct with some 
binary columns, and for the case how forked process can deserialize inputs 
properly and applies operations and serialize again (that wouldn't matter much 
as the return is `Dataset[String]` though).
   
   Our internal client needs to pipe streaming read from Kafka through a forked 
process, and currently with SS users cannot do it. I think the above question 
also applied to RDD.pipe. It's users' responsibility to make sure the process 
can understand the input.
   
   > In addition, this would incur non-trivial serde cost on communication 
between Spark process and external process. Probably we also need to revisit 
which benefits this gives to us compared to what Spark provides now (UDF, or 
some others if I miss?).
   
   I believe there are some cases that users have to use pipe. Users should 
consider it before choosing and using pipe API. I think it is well known issue 
for pipe. The point is, when users need to use pipe on streaming data like RDD 
and batch Dataset, but streaming Dataset cannot support it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to