[GitHub] [spark] HeartSaVioR commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sun, 24 Jan 2021 20:41:21 -0800


HeartSaVioR commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766535285



   > I'm not sure how much details you'd like to see?
   
   I'm quoting my comment:
   
   > That's why I want to hear the actual use case, what is the type of T 
Dataset, which task the external process does, what is the output of external 
process, should they need to break down the output to multiple columns after 
that.
   
   There's no "detail" on business logic. If the type T is Java bean or case 
class or something, just mention it as it's a Java bean/case class. If that's 
typed but bound to the non-primitive type like tuple please mention it. If 
that's untyped you need to mention it, including whether it has complex 
column(s).
   
   Fair enough for you?
   
   > For DataFrame, we know it is actually Dataset[Row]. If users need custom 
print-out, printRDDElement will take Row as input type.
   
   What is the output of `Row.toString` then? Is it consistent if we change the 
implementation of Row? Even more, should end users know about that?
   
   I already pointed out earlier that the default serializer only makes sense 
if the type T is primitive for Java/Scala which the output of T.toString is 
relatively intuitive for end users. For other types the default serializer 
won't work (or even end users are able to infer it, still fragile) including 
"untyped".
   
   > This is how other typed functions (map, foreach...) work with untyped 
Dataset.
   
   The problem is other typed functions get the Row as simply `Row`, and able 
to call `Row.getString` or something like that, even knowing which columns the 
Row instance has. It doesn't need to be serialized as some other form. Does it 
apply to the external process? No. Spark should serialize the Row instance to 
send to the external process, and the serialized form of Row instance is 
"unknown" to end users, unless they deal with crafting serializer by their hand 
using `Row.getString` or so on. That's why I said default serializer doesn't 
work with untyped Dataset. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to