[GitHub] [spark] HeartSaVioR commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sun, 24 Jan 2021 17:03:58 -0800


HeartSaVioR commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766477734



   > The current RDD.pipe doesn't explicitly mention we output the string of T. 
This is what it said "All elements of each input partition are written to a 
process's stdin as lines of input separated by a newline." If you think it is 
not enough, we can improve the API document. About the parameter, do you mean 
printRDDElement?
   
   Yes. This is definitely not enough. This is only intuitive if the type T is 
primitive like integer, long, String, etc. If you have type T as Java bean and 
override toString with IDE toString generator, the format is depending on the 
IDE. case class is depending on Scala, and I don't think the representation of 
toString is something Scala should guarantee compatibility. Makes sense?
   
   > I think some of the questions are over the scope of the pipe concept. For 
example the one about only pipe one column but retain 9 columns for next 
operation. User also cannot only pipe only field of object T and retain all 
others after by using RDD's pipe.
   
   Once you're adding the pipe to the one of DataFrame operations, the 
operation 'pipe' should be evaluated as a DataFrame operation. End users using 
pipe wouldn't use the trivial external process like "cat" or "wc -l" which 
completely ignore the structure of input, but I can't find any example beyond 
such thing.
   (I don't think something is reasonable "just because" previous RDD works 
like so.)
   
   That's why I want to hear the actual use case, what is the type of T 
Dataset, which task the external process does, what is the output of external 
process, should they need to break down the output to multiple columns after 
that.
   
   > Hmm, I'd say not to think it as aggregation, they are different things. 
Pipe is widely used in Unix/Linux command line and I don't think we should mix 
it with aggregation.
   
   The thing is that the output can be different between batch and streaming, 
and that is entirely depending on the external process. Any external process 
does the aggregation ("wc -l" does still an aggregation in effect, whatever you 
say) breaks the concept and the result is varying how the input stream is split 
to multiple micro-batches.
   
   I see the possibility existing APIs can also break such thing (like 
mapPartitions/flatMap with user function which doesn't consider the fact) so 
I'd be OK if everyone doesn't mind. I still think restricting the relation to 
1-to-1 / N-to-1 would be ideal, but that requires external process to be 
implemented as Spark's requirement which might not be possible, so...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to