HeartSaVioR commented on a change in pull request #31296:
URL: https://github.com/apache/spark/pull/31296#discussion_r564447933
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
##########
@@ -2007,6 +2007,54 @@ class DatasetSuite extends QueryTest
checkAnswer(withUDF, Row(Row(1), null, null) :: Row(Row(1), null, null) ::
Nil)
}
+
+ test("SPARK-34205: Pipe Dataset") {
+ assume(TestUtils.testCommandAvailable("cat"))
+
+ val nums = spark.range(4)
+ val piped = nums.pipe("cat", (l, printFunc) => printFunc(l.toString)).toDF
Review comment:
I see what @viirya said. I'd agree that transform looks to behave as an
operation (not sure that is intended or not, but looks like at least for now)
and transform also requires top level API to cover up like we did for
`mapPartition`.
If we are OK to add the top level API (again not yet decided so just a 2
cents) then which one? I'd rather say `transform` is something we'd like to be
consistent with, instead of `pipe`. They have been exposed as SQL statement,
and probably used widely for Spark SQL users, and even Hive users. If we want
feature parity then my vote goes to `transform`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]