[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Sun, 24 Jan 2021 19:09:40 -0800


viirya commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766510792



   > Yes. This is definitely not enough. This is only intuitive if the type T 
is primitive like integer, long, String, etc. If you have type T as Java bean 
and override toString with IDE toString generator, the format is depending on 
the IDE. case class is depending on Scala, and I don't think the representation 
of toString is something Scala should guarantee compatibility. Makes sense?
   
   That is what `printRDDElement` should do. For complex type T, users can 
provide custom function and produce necessary output for their need.
   
   > Once you're adding the pipe to the one of DataFrame operations, the 
operation 'pipe' should be evaluated as a DataFrame operation. End users using 
pipe wouldn't use the trivial external process like "cat" or "wc -l" which 
completely ignore the structure of input, but I can't find any example beyond 
such thing.
   > (I don't think something is reasonable "just because" previous RDD works 
like so.)
   > 
   > That's why I want to hear the actual use case, what is the type of T 
Dataset, which task the external process does, what is the output of external 
process, should they need to break down the output to multiple columns after 
that.
   
   I think these are mainly focus how object T is going to work. I said it is 
not different than RDD working because they both work on object T. Custom 
function provided by users should take the responsibility to produce necessary 
output to the forked process. (Note I have not add the parameter yet. Maybe do 
it tomorrow.)
   
   > I see the possibility existing APIs can also break such thing (like 
mapPartitions/flatMap with user function which doesn't consider the fact) so 
I'd be OK if everyone doesn't mind. I still think restricting the relation to 
1-to-1 / N-to-1 would be ideal, but that requires external process to be 
implemented as Spark's requirement which might not be possible, so...
   
   "pipe" is not invented by Spark. I don't think we should provide a 
half-baked pipe function. It is worse than nothing. Not to mention that the 
technical point you said. IMHO, it brings more inconsistency between RDD pipe, 
Dataset pipe and streaming Dataset pipe. I think what we can do is to 
explicitly clarify the effect of pipe on micro-batch streaming is only per 
micro-batch not cross entire stream.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to