viirya commented on a change in pull request #31296:
URL: https://github.com/apache/spark/pull/31296#discussion_r564324572
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2897,14 +2897,22 @@ class Dataset[T] private[sql](
* each line of stdout resulting in one element of the output partition. A
process is invoked
* even for empty partitions.
*
- * @param command command to run in forked process.
+ * Note that for micro-batch streaming Dataset, the effect of pipe is only
per micro-batch, not
+ * cross entire stream.
*
+ * @param command command to run in forked process.
+ * @param printElement Use this function to customize how to pipe elements.
This function
+ * will be called with each Dataset element as the 1st
parameter, and the
+ * print line function (like out.println()) as the 2nd
parameter.
* @group typedrel
* @since 3.2.0
*/
- def pipe(command: String): Dataset[String] = {
+ def pipe(command: String, printElement: (T, String => Unit) => Unit):
Dataset[String] = {
Review comment:
I think it is okay. Most cases there should be no difference. Only
difference might be when we want to print out multi-lines per obj:
```scala
def printElement(obj: T, printFunc: String => Unit) = {
printFunc(obj.a)
printFunc(obj.b)
...
}
```
```scala
def serializeFn(obj: T): String = {
s"${obj.a}\n${obj.b}\n..."
}
```
I'm fine with either one as they are working the same effect although taking
different form.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]