[GitHub] [spark] viirya edited a comment on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sun, 24 Jan 2021 20:25:12 -0800


viirya edited a comment on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766530579



   > Is it too hard requirement to explain the actual use case, especially 
you've said you have internal customer claiming this feature? I don't think my 
request requires anything needed redaction. (If there's something you can 
abstract the details or do some redaction by yourself.) My first comment was 
asking about the actual use case and I have been asking consistently.
   
   I'm not sure how much details you'd like to see? The script we will call? 
The data input it accepts? The use case is generally to pipe through an 
external process. But if you insist, I think I need to get more information for 
the client. I just wonder if it is necessary to provide details instead of 
general use case of pipe. To know what input/script the client will be helpful? 
We are not to provide the client-specific API.
   
   Okay, let me try to get more details tomorrow to make everyone happy here.
   
   > I don't think `RDD.pipe` and `Dataset.pipe` is exactly same, at least the 
usability of the default `printRDDElement`. There're lots of users using 
"untyped" Dataset (DataFrame) which the default `printRDDElement` would depend 
on the internal implementation (Row is just an interface). The default 
serializer implementation only works if Dataset has only one column which type 
is matched with Java/Scala type, otherwise they always want to provide the 
serializer implementation. Based on this I wonder we should allow default 
serializer - probably we want to require end users to provide serializer so 
that they should know what they are doing.
   
   For DataFrame, we know it is actually Dataset[Row]. If users need custom 
print-out, `printRDDElement` will take Row as input type. I don't see there is 
a problem here. This is how other typed functions (map, foreach...) work with 
untyped Dataset. I don't really get the serializer point you just mentioned.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya edited a comment on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to