HeartSaVioR commented on pull request #31296: URL: https://github.com/apache/spark/pull/31296#issuecomment-766535285
> I'm not sure how much details you'd like to see? I'm quoting my comment: > That's why I want to hear the actual use case, what is the type of T Dataset, which task the external process does, what is the output of external process, should they need to break down the output to multiple columns after that. There's no "detail" on business logic. If the type T is Java bean or case class or something, just mention it as it's a Java bean/case class. If that's typed but bound to the non-primitive type like tuple please mention it. If that's untyped you need to mention it, including whether it has complex column(s). Fair enough for you? > For DataFrame, we know it is actually Dataset[Row]. If users need custom print-out, printRDDElement will take Row as input type. What is the output of `Row.toString` then? Is it consistent if we change the implementation of Row? Even more, should end users know about that? I already pointed out earlier that the default serializer only makes sense if the type T is primitive for Java/Scala which the output of T.toString is relatively intuitive for end users. For other types the default serializer won't work (or even end users are able to infer it, still fragile) including "untyped". > This is how other typed functions (map, foreach...) work with untyped Dataset. The problem is other typed functions get the Row as simply `Row`, and able to call `Row.getString` or something like that, even knowing which columns the Row instance has. It doesn't need to be serialized as some other form. Does it apply to the external process? No. Spark should serialize the Row instance to send to the external process, and the serialized form of Row instance is "unknown" to end users, unless they deal with crafting serializer by their hand using `Row.getString` or so on. That's why I said default serializer doesn't work with untyped Dataset. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
