[GitHub] [spark] viirya edited a comment on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sat, 23 Jan 2021 21:06:33 -0800


viirya edited a comment on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766291831



   > I understand the functionality is lacking on SS. There's a workaround like 
foreachBatch -> toRDD -> pipe but streaming operations can't be added after 
calling pipe. So I'd agree that it'd be better to address the gap in any way.
   > 
   > I feel default serialization logic on PipedRDD is also fragile and not 
well documented as well. (This actually makes me wondering, is PipedRDD widely 
adopted?) Is there any documentation/mention that T.toString is used as a 
serialization, and it doesn't escape line break so multiple lines string will 
be printed as multiple lines without any guard? The default implementation is 
too naive and even for primitive type it's not hard to find the hole. There's a 
parameter to customize the serialization and we can add it as well so it makes 
me less concerned, but default should be still reasonable and well explained 
for the limitations if any.
   
   The current RDD.pipe doesn't explicitly mention we output the string of T. 
This is what it said "All elements of each input partition are written to a 
process's stdin as lines of input separated by a newline." If you think it is 
not enough, we can improve the API document. About the parameter, do you mean 
`printRDDElement`?
   
   > And like I said above I don't think they'll be able to understand the 
serialized form for the multiple columns or complicated column types. They'll 
end up using custom class for type T which overrides toString. (And the output 
of toString shouldn't be multiple lines.)
   
   As we discussed before, users don't need to understand how Spark serializes 
object T to Internal Row, this still hide from users. With `printRDDElement` 
parameter, users only deal with domain object T, and only need to decide what 
of T should be output to forked process.
   
   I think some of the questions are over the scope of the pipe concept. For 
example the one about only pipe one column but retain 9 columns for next 
operation. User also cannot only pipe only field of object T and retain all 
others after by using RDD's pipe.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya edited a comment on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to