Re: [PR] [SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` [spark]

via GitHub Wed, 24 Sep 2025 23:09:56 -0700


zhengruifeng commented on code in PR #52440:
URL: https://github.com/apache/spark/pull/52440#discussion_r2377853044



##########
python/pyspark/sql/pandas/group_ops.py:
##########
@@ -703,27 +704,30 @@ def applyInArrow(
         Maps each group of the current :class:`DataFrame` using an Arrow udf 
and returns the result
         as a `DataFrame`.
 
-        The function should take a `pyarrow.Table` and return another
-        `pyarrow.Table`. Alternatively, the user can pass a function that takes
-        a tuple of `pyarrow.Scalar` grouping key(s) and a `pyarrow.Table`.
-        For each group, all columns are passed together as a `pyarrow.Table`
-        to the user-function and the returned `pyarrow.Table` are combined as a
-        :class:`DataFrame`.
+        The function can take one of two forms: It can take a `pyarrow.Table` 
and return a
+        `pyarrow.Table`, or it can take an iterator of `pyarrow.RecordBatch` 
and yield
+        `pyarrow.RecordBatch`. Alternatively each form can take a tuple of 
`pyarrow.Scalar`
+        as the first argument in addition to the input type above. For each 
group, all columns
+        are passed together in the `pyarrow.Table` or `pyarrow.RecordBatch`, 
and the returned
+        `pyarrow.Table` or iterator of `pyarrow.RecordBatch` are combined as a 
:class:`DataFrame`.
 
         The `schema` should be a :class:`StructType` describing the schema of 
the returned
-        `pyarrow.Table`. The column labels of the returned `pyarrow.Table` 
must either match
-        the field names in the defined schema if specified as strings, or 
match the
-        field data types by position if not strings, e.g. integer indices.
-        The length of the returned `pyarrow.Table` can be arbitrary.
+        `pyarrow.Table` or `pyarrow.RecordBatch`. The column labels of the 
returned `pyarrow.Table`
+        or `pyarrow.RecordBatch` must either match the field names in the 
defined schema if
+        specified as strings, or match the field data types by position if not 
strings, e.g.
+        integer indices. The length of the returned `pyarrow.Table` or 
iterator of
+        `pyarrow.RecordBatch` can be arbitrary.
 
         .. versionadded:: 4.0.0
 
         Parameters
         ----------
         func : function
-            a Python native function that takes a `pyarrow.Table` and outputs a
-            `pyarrow.Table`, or that takes one tuple (grouping keys) and a
-            `pyarrow.Table` and outputs a `pyarrow.Table`.
+            a Python native function that either takes a `pyarrow.Table` and 
outputs a
+            `pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and 
yields
+            `pyarrow.RecordBatch`. Additionally, each form can take a tuple of 
grouping keys
+            as the first argument, with the `pyarrow.Table` or iterator of 
`pyarrow.RecordBatch`
+            as the second argument.

Review Comment:
   let's add 
   ```
       .. versionchanged:: 4.1.0
           Supports iterator API ...
   ```



##########
python/pyspark/sql/pandas/group_ops.py:
##########
@@ -703,27 +704,30 @@ def applyInArrow(
         Maps each group of the current :class:`DataFrame` using an Arrow udf 
and returns the result
         as a `DataFrame`.
 
-        The function should take a `pyarrow.Table` and return another
-        `pyarrow.Table`. Alternatively, the user can pass a function that takes
-        a tuple of `pyarrow.Scalar` grouping key(s) and a `pyarrow.Table`.
-        For each group, all columns are passed together as a `pyarrow.Table`
-        to the user-function and the returned `pyarrow.Table` are combined as a
-        :class:`DataFrame`.
+        The function can take one of two forms: It can take a `pyarrow.Table` 
and return a
+        `pyarrow.Table`, or it can take an iterator of `pyarrow.RecordBatch` 
and yield
+        `pyarrow.RecordBatch`. Alternatively each form can take a tuple of 
`pyarrow.Scalar`
+        as the first argument in addition to the input type above. For each 
group, all columns
+        are passed together in the `pyarrow.Table` or `pyarrow.RecordBatch`, 
and the returned
+        `pyarrow.Table` or iterator of `pyarrow.RecordBatch` are combined as a 
:class:`DataFrame`.
 
         The `schema` should be a :class:`StructType` describing the schema of 
the returned
-        `pyarrow.Table`. The column labels of the returned `pyarrow.Table` 
must either match
-        the field names in the defined schema if specified as strings, or 
match the
-        field data types by position if not strings, e.g. integer indices.
-        The length of the returned `pyarrow.Table` can be arbitrary.
+        `pyarrow.Table` or `pyarrow.RecordBatch`. The column labels of the 
returned `pyarrow.Table`
+        or `pyarrow.RecordBatch` must either match the field names in the 
defined schema if
+        specified as strings, or match the field data types by position if not 
strings, e.g.
+        integer indices. The length of the returned `pyarrow.Table` or 
iterator of
+        `pyarrow.RecordBatch` can be arbitrary.
 
         .. versionadded:: 4.0.0
 
         Parameters
         ----------
         func : function
-            a Python native function that takes a `pyarrow.Table` and outputs a
-            `pyarrow.Table`, or that takes one tuple (grouping keys) and a
-            `pyarrow.Table` and outputs a `pyarrow.Table`.
+            a Python native function that either takes a `pyarrow.Table` and 
outputs a
+            `pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and 
yields
+            `pyarrow.RecordBatch`. Additionally, each form can take a tuple of 
grouping keys
+            as the first argument, with the `pyarrow.Table` or iterator of 
`pyarrow.RecordBatch`
+            as the second argument.

Review Comment:
   lets also add two simple examples (w/o `key`) in the `Examples` section



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` [spark]

Reply via email to