[GitHub] [spark] xuanyuanking commented on a change in pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

GitBox Tue, 26 Jan 2021 05:26:52 -0800


xuanyuanking commented on a change in pull request #31296:
URL: https://github.com/apache/spark/pull/31296#discussion_r564507427




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
##########
@@ -2007,6 +2007,54 @@ class DatasetSuite extends QueryTest
 
     checkAnswer(withUDF, Row(Row(1), null, null) :: Row(Row(1), null, null) :: 
Nil)
   }
+
+  test("SPARK-34205: Pipe Dataset") {
+    assume(TestUtils.testCommandAvailable("cat"))
+
+    val nums = spark.range(4)
+    val piped = nums.pipe("cat", (l, printFunc) => printFunc(l.toString)).toDF

Review comment:
       ```
   What's the top-level API, you mean Plan node like CollectSet or other thing?
   ```
   @AngersZhuuuu The top-level API here means the new API added in Dataset.
   
   ```
   Can you share how to make transformation as an expression? I don't think It 
is an expression at all.
   ```
   @viirya Sure. I followed the comment "`I have thought this problem too, 
first I want to add transform as a DSL function, in this way, we need to make 
an equivalent ScriptTransformation expression first. We can think that this is 
just a new expression, or a new function`" from @AngersZhuuuu. To add a new 
expression `ScriptTransformationExpression` for `ScriptTransformation` and turn 
to `ScriptTransformationExec`.
   
   Two limitations here might need more discussion:
   - The script transformation may produce more than one row for a single row, 
so it cannot use together with other expressions. 
   - The script in hive transformation is partition-based, but if we make it an 
expression, it becomes row based.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xuanyuanking commented on a change in pull request #31296: [SPARK-34205][SQL][SS] Add pipe to Dataset to enable Streaming Dataset pipe

Reply via email to