HeartSaVioR commented on pull request #31296:
URL: https://github.com/apache/spark/pull/31296#issuecomment-766477734
> The current RDD.pipe doesn't explicitly mention we output the string of T.
This is what it said "All elements of each input partition are written to a
process's stdin as lines of input separated by a newline." If you think it is
not enough, we can improve the API document. About the parameter, do you mean
printRDDElement?
Yes. This is definitely not enough. This is only intuitive if the type T is
primitive like integer, long, String, etc. If you have type T as Java bean and
override toString with IDE toString generator, the format is depending on the
IDE. case class is depending on Scala, and I don't think the representation of
toString is something Scala should guarantee compatibility. Makes sense?
> I think some of the questions are over the scope of the pipe concept. For
example the one about only pipe one column but retain 9 columns for next
operation. User also cannot only pipe only field of object T and retain all
others after by using RDD's pipe.
Once you're adding the pipe to the one of DataFrame operations, the
operation 'pipe' should be evaluated as a DataFrame operation. End users using
pipe wouldn't use the trivial external process like "cat" or "wc -l" which
completely ignore the structure of input, but I can't find any example beyond
such thing.
(I don't think something is reasonable "just because" previous RDD works
like so.)
That's why I want to hear the actual use case, what is the type of T
Dataset, which task the external process does, what is the output of external
process, should they need to break down the output to multiple columns after
that.
> Hmm, I'd say not to think it as aggregation, they are different things.
Pipe is widely used in Unix/Linux command line and I don't think we should mix
it with aggregation.
The thing is that the output can be different between batch and streaming,
and that is entirely depending on the external process. Any external process
does the aggregation ("wc -l" does still an aggregation in effect, whatever you
say) breaks the concept and the result is varying how the input stream is split
to multiple micro-batches.
I see the possibility existing APIs can also break such thing (like
mapPartitions/flatMap with user function which doesn't consider the fact) so
I'd be OK if everyone doesn't mind. I still think restricting the relation to
1-to-1 / N-to-1 would be ideal, but that requires external process to be
implemented as Spark's requirement which might not be possible, so...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]