[GitHub] [spark] ueshin commented on a diff in pull request #42595: [SPARK-44901][SQL] Add API in Python UDTF 'analyze' method to return partitioning/ordering expressions

via GitHub Mon, 28 Aug 2023 15:51:09 -0700


ueshin commented on code in PR #42595:
URL: https://github.com/apache/spark/pull/42595#discussion_r1308015934



##########
python/pyspark/sql/udtf.py:
##########
@@ -70,9 +90,25 @@ class AnalyzeResult:
     ----------
     schema : :class:`StructType`
         The schema that the Python UDTF will return.
+    with_single_partition : bool
+        If true, the UDTF is specifying for Catalyst to repartition all rows 
of the input TABLE
+        argument to one collection for consumption by exactly one instance of 
the correpsonding
+        UDTF class.
+    partition_by : Sequence[PartitioningColumn]
+        If non-empty, this is a sequence of columns that the UDTF is 
specifying for Catalyst to
+        partition the input TABLE argument by. In this case, calls to the UDTF 
may not include any
+        explicit PARTITION BY clause, in which case Catalyst will return an 
error. This option is
+        mutually exclusive with 'with_single_partition'.
+    order_by: Sequence[OrderingColumn]
+        If non-empty, this is a sequence of columns that the UDTF is 
specifying for Catalyst to
+        sort the input TABLE argument by. Note that the 'partition_by' list 
must also be non-empty
+        in this case.
     """
 
     schema: StructType
+    with_single_partition: bool = False
+    partition_by: Sequence[PartitioningColumn] = ()
+    order_by: Sequence[OrderingColumn] = ()

Review Comment:
   `()` shouldn't be used as a default value of the fields. 
   
   ```py
   from dataclass import field
   ```
   
   then
   
   ```py
   partition_by: Sequence[PartitioningColumn] = field(default_factory=tuple)  # 
tuple or list
   order_by: Sequence[OrderingColumn] = field(default_factory=tuple)
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala:
##########
@@ -643,11 +643,17 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSparkSession with SQLHelper
             s"$testCaseName - ${udf.prettyName}", absPath, resultFile, udf)
         }
       } else if 
(file.getAbsolutePath.startsWith(s"$inputFilePath${File.separator}udtf")) {
-        Seq(TestPythonUDTF("udtf")).map { udtf =>
-          UDTFTestCase(
-            s"$testCaseName - ${udtf.prettyName}", absPath, resultFile, udtf
-          )
-        }
+        val udtfs = Seq(
+          TestPythonUDTF("udtf"),
+          TestPythonUDTFCountSumLast,
+          TestPythonUDTFWithSinglePartition,
+          TestPythonUDTFPartitionBy,
+          TestPythonUDTFInvalidPartitionByAndWithSinglePartition,
+          TestPythonUDTFInvalidOrderByWithoutPartitionBy
+        )
+        Seq(UDTFTestCase(
+          s"$testCaseName - Python UDTFs", absPath, resultFile, udtfs
+        ))

Review Comment:
   `UDTFTestCase` is supposed to be for one test class I guess?
   
   ```scala
   Seq(TestPythonUDTF("udtf"), ...).map { udtf =>
     UDTFTestCase(
       s"$testCaseName - ${udtf.prettyName}", absPath, resultFile, udtf
     )
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin commented on a diff in pull request #42595: [SPARK-44901][SQL] Add API in Python UDTF 'analyze' method to return partitioning/ordering expressions

Reply via email to