viirya commented on pull request #31296: URL: https://github.com/apache/spark/pull/31296#issuecomment-766510792
> Yes. This is definitely not enough. This is only intuitive if the type T is primitive like integer, long, String, etc. If you have type T as Java bean and override toString with IDE toString generator, the format is depending on the IDE. case class is depending on Scala, and I don't think the representation of toString is something Scala should guarantee compatibility. Makes sense? That is what `printRDDElement` should do. For complex type T, users can provide custom function and produce necessary output for their need. > Once you're adding the pipe to the one of DataFrame operations, the operation 'pipe' should be evaluated as a DataFrame operation. End users using pipe wouldn't use the trivial external process like "cat" or "wc -l" which completely ignore the structure of input, but I can't find any example beyond such thing. > (I don't think something is reasonable "just because" previous RDD works like so.) > > That's why I want to hear the actual use case, what is the type of T Dataset, which task the external process does, what is the output of external process, should they need to break down the output to multiple columns after that. I think these are mainly focus how object T is going to work. I said it is not different than RDD working because they both work on object T. Custom function provided by users should take the responsibility to produce necessary output to the forked process. (Note I have not add the parameter yet. Maybe do it tomorrow.) > I see the possibility existing APIs can also break such thing (like mapPartitions/flatMap with user function which doesn't consider the fact) so I'd be OK if everyone doesn't mind. I still think restricting the relation to 1-to-1 / N-to-1 would be ideal, but that requires external process to be implemented as Spark's requirement which might not be possible, so... "pipe" is not invented by Spark. I don't think we should provide a half-baked pipe function. It is worse than nothing. Not to mention that the technical point you said. IMHO, it brings more inconsistency between RDD pipe, Dataset pipe and streaming Dataset pipe. I think what we can do is to explicitly clarify the effect of pipe on micro-batch streaming is only per micro-batch not cross entire stream. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
