Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

via GitHub Tue, 10 Oct 2023 14:58:56 -0700


allisonwang-db commented on code in PR #43204:
URL: https://github.com/apache/spark/pull/43204#discussion_r1353474035



##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -524,27 +524,42 @@ object IntegratedUDFTestUtils extends SQLHelper {
     val name: String = "UDTFWithSinglePartition"
     val pythonScript: String =
       s"""
+        |import json
+        |from dataclasses import dataclass
         |from pyspark.sql.functions import AnalyzeResult, OrderingColumn, 
PartitioningColumn
         |from pyspark.sql.types import IntegerType, Row, StructType
+        |
+        |@dataclass
+        |class AnalyzeResultWithBuffer(AnalyzeResult):
+        |    buffer: str = ""
+        |
         |class $name:
         |    def __init__(self):
         |        self._count = 0
+        |        # self._count = json.loads(buffer)["initial_count"]

Review Comment:
   Shall we remove this?



##########
python/pyspark/worker.py:
##########
@@ -693,6 +699,21 @@ def read_udtf(pickleSer, infile, eval_type):
             f"The return type of a UDTF must be a struct type, but got 
{type(return_type)}."
         )
 
+    # Update the handler that creates a new UDTF instance to first try calling 
the UDTF constructor
+    # with one argument containing the previous AnalyzeResult. If that fails, 
then try a constructor
+    # with no arguments. In this way each UDTF class instance can decide if it 
wants to inspect the
+    # AnalyzeResult.
+    if has_pickled_analyze_result:
+        prev_handler = handler
+
+        def construct_udtf():
+            try:
+                return 
prev_handler(dataclasses.replace(pickled_analyze_result))
+            except TypeError:
+                return prev_handler()

Review Comment:
   This means the UDTF handler does not accept an analyzeResult object in its 
`__init__` method? The try...except block will always be invoked when we call 
`contruct_udtf` which can be expensive. I wonder if we can only call this once.



##########
python/pyspark/sql/udtf.py:
##########
@@ -85,7 +85,7 @@ class OrderingColumn:
     overrideNullsFirst: Optional[bool] = None
 
 
-@dataclass(frozen=True)
+@dataclass

Review Comment:
   It would be really good to add this as a comment :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'eval' and 'terminate' methods to consume previous 'analyze' result [spark]

Reply via email to