dtenedor commented on code in PR #45375:
URL: https://github.com/apache/spark/pull/45375#discussion_r1511980443
##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -63,6 +63,7 @@ To implement a Python UDTF, you first need to define a class
implementing the me
"""
...
+ @staticmethod
def analyze(self, *args: Any) -> AnalyzeResult:
Review Comment:
You're right, good catch. Updated this.
##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -285,10 +327,39 @@ To implement a Python UDTF, you first need to define a
class implementing the me
"""
...
+Emitting output rows
+--------------------
+
+The return type of the UDTF defines the schema of the table it outputs. It
must be either a
+``StructType``, for example ``StructType().add("c1", StringType())``, or a DDL
string representing a
+struct type, for example ``c1: string``. The `eval` and `terminate` methods
then emit zero or more
+output rows conforming to this schema by yielding tuples, lists, or
pyspark.sql.Row objects. For
+example:
+
+```
+def eval(self, x, y, z):
+ # Here we return a row by providing a tuple of three elements.
Review Comment:
Sure, this is done.
##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -163,6 +185,28 @@ To implement a Python UDTF, you first need to define a
class implementing the me
... num_articles=len((
... word for word in words
... if word == 'a' or word == 'an' or word == 'the')))
+
+ An `analyze` implementation that returns a constant output schema,
and also requests
+ to select a subset of columns from the input table and for input
table to be partitioned
+ across several UDTF calls based on the values of the `date` column:
+
+ >>> @staticmethod
+ ... def analyze(*args) -> AnalyzeResult:
Review Comment:
I added some more explanation here.
##########
python/docs/source/user_guide/sql/python_udtf.rst:
##########
@@ -75,31 +76,52 @@ To implement a Python UDTF, you first need to define a
class implementing the me
This method accepts zero or more parameters mapping 1:1 with the
arguments provided to
the particular UDTF call under consideration. Each parameter is an
instance of the
- `AnalyzeArgument` class, which contains fields including the
provided argument's data
- type and value (in the case of literal scalar arguments only). For
table arguments, the
- `isTable` field is set to true and the `dataType` field is a
StructType representing
- the table's column types:
-
- dataType: DataType
- value: Optional[Any]
- isTable: bool
+ `AnalyzeArgument` class.
+
+ `AnalyzeArgument` fields
+ ------------------------
+ dataType: DataType
+ Indicates the type of the provided input argument to this
particular UDTF call.
+ For input table arguments, this is a StructType representing
the table's columns.
+ value: Optional[Any]
+ The value of the provided input argument to this particular
UDTF call. This is
+ `None` for table arguments, or for literal scalar arguments
that are not constant.
+ isTable: bool
+ This is true if the provided input argument to this particular
UDTF call is a
+ table argument.
+ isConstantExpression: bool
+ This is true if the provided input argument to this particular
UDTF call is a
+ constant scalar expression.
Review Comment:
Yes. Updated this to explicitly say "either a literal or other
constant-foldable scalar expression."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]