[GitHub] [spark] viirya commented on a change in pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Tue, 26 Jan 2021 00:25:10 -0800


viirya commented on a change in pull request #31296:
URL: https://github.com/apache/spark/pull/31296#discussion_r564324572




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2897,14 +2897,22 @@ class Dataset[T] private[sql](
    * each line of stdout resulting in one element of the output partition. A 
process is invoked
    * even for empty partitions.
    *
-   * @param command command to run in forked process.
+   * Note that for micro-batch streaming Dataset, the effect of pipe is only 
per micro-batch, not
+   * cross entire stream.
    *
+   * @param command command to run in forked process.
+   * @param printElement Use this function to customize how to pipe elements. 
This function
+   *                     will be called with each Dataset element as the 1st 
parameter, and the
+   *                     print line function (like out.println()) as the 2nd 
parameter.
    * @group typedrel
    * @since 3.2.0
    */
-  def pipe(command: String): Dataset[String] = {
+  def pipe(command: String, printElement: (T, String => Unit) => Unit): 
Dataset[String] = {

Review comment:
       I think it is okay. Most cases there should be no difference. Only 
difference might be when we want to print out multi-lines per obj:
   
   ```scala
   def printElement(obj: T, printFunc: String => Unit) = {
     printFunc(obj.a)
     printFunc(obj.b)
     ...
   }
   ```
   
   ```scala
   def serializeFn(obj: T): String = {
     s"${obj.a}\n${obj.b}\n..."
   }
   ```
   
   I'm fine with either one as they are working the same effect although taking 
different form.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to