[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Fri, 22 Jan 2021 22:28:23 -0800


viirya commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-765877154



   > Yes the question is also applied to RDD.pipe as well, but the 
serialization is done via `OutputStreamWriter.println` which is relatively 
"known" - `String.valueOf(T)` and print it out. Easy to reason about, though 
it'd show bad performance and tricky to deserialize if toString is implemented 
human friendly.
   > 
   > Here we are serializing via Encoder which is like a black-box to others 
(Have we documented how encoder encodes the object?) so that can't be users' 
responsibility. And once we do this, the approach to encode will also become a 
kind of public API.
   
   Encoder is the how we serialize external data to Dataset. Why this is an 
issue? This doesn't invent any new stuff. Encoder is nothing related to how we 
pipe here. This follows all other typed Dataset API (map, foreach...) approach 
to use Encoder to deserialize internal row to domain object T. Then it follows 
RDD pipe API to print T out to forked process. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to