viirya commented on pull request #31296: URL: https://github.com/apache/spark/pull/31296#issuecomment-765877154
> Yes the question is also applied to RDD.pipe as well, but the serialization is done via `OutputStreamWriter.println` which is relatively "known" - `String.valueOf(T)` and print it out. Easy to reason about, though it'd show bad performance and tricky to deserialize if toString is implemented human friendly. > > Here we are serializing via Encoder which is like a black-box to others (Have we documented how encoder encodes the object?) so that can't be users' responsibility. And once we do this, the approach to encode will also become a kind of public API. Encoder is the how we serialize external data to Dataset. Why this is an issue? This doesn't invent any new stuff. Encoder is nothing related to how we pipe here. This follows all other typed Dataset API (map, foreach...) approach to use Encoder to deserialize internal row to domain object T. Then it follows RDD pipe API to print T out to forked process. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
