zhengruifeng commented on code in PR #52440: URL: https://github.com/apache/spark/pull/52440#discussion_r2377853044
########## python/pyspark/sql/pandas/group_ops.py: ########## @@ -703,27 +704,30 @@ def applyInArrow( Maps each group of the current :class:`DataFrame` using an Arrow udf and returns the result as a `DataFrame`. - The function should take a `pyarrow.Table` and return another - `pyarrow.Table`. Alternatively, the user can pass a function that takes - a tuple of `pyarrow.Scalar` grouping key(s) and a `pyarrow.Table`. - For each group, all columns are passed together as a `pyarrow.Table` - to the user-function and the returned `pyarrow.Table` are combined as a - :class:`DataFrame`. + The function can take one of two forms: It can take a `pyarrow.Table` and return a + `pyarrow.Table`, or it can take an iterator of `pyarrow.RecordBatch` and yield + `pyarrow.RecordBatch`. Alternatively each form can take a tuple of `pyarrow.Scalar` + as the first argument in addition to the input type above. For each group, all columns + are passed together in the `pyarrow.Table` or `pyarrow.RecordBatch`, and the returned + `pyarrow.Table` or iterator of `pyarrow.RecordBatch` are combined as a :class:`DataFrame`. The `schema` should be a :class:`StructType` describing the schema of the returned - `pyarrow.Table`. The column labels of the returned `pyarrow.Table` must either match - the field names in the defined schema if specified as strings, or match the - field data types by position if not strings, e.g. integer indices. - The length of the returned `pyarrow.Table` can be arbitrary. + `pyarrow.Table` or `pyarrow.RecordBatch`. The column labels of the returned `pyarrow.Table` + or `pyarrow.RecordBatch` must either match the field names in the defined schema if + specified as strings, or match the field data types by position if not strings, e.g. + integer indices. The length of the returned `pyarrow.Table` or iterator of + `pyarrow.RecordBatch` can be arbitrary. .. versionadded:: 4.0.0 Parameters ---------- func : function - a Python native function that takes a `pyarrow.Table` and outputs a - `pyarrow.Table`, or that takes one tuple (grouping keys) and a - `pyarrow.Table` and outputs a `pyarrow.Table`. + a Python native function that either takes a `pyarrow.Table` and outputs a + `pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and yields + `pyarrow.RecordBatch`. Additionally, each form can take a tuple of grouping keys + as the first argument, with the `pyarrow.Table` or iterator of `pyarrow.RecordBatch` + as the second argument. Review Comment: let's add ``` .. versionchanged:: 4.1.0 Supports iterator API ... ``` ########## python/pyspark/sql/pandas/group_ops.py: ########## @@ -703,27 +704,30 @@ def applyInArrow( Maps each group of the current :class:`DataFrame` using an Arrow udf and returns the result as a `DataFrame`. - The function should take a `pyarrow.Table` and return another - `pyarrow.Table`. Alternatively, the user can pass a function that takes - a tuple of `pyarrow.Scalar` grouping key(s) and a `pyarrow.Table`. - For each group, all columns are passed together as a `pyarrow.Table` - to the user-function and the returned `pyarrow.Table` are combined as a - :class:`DataFrame`. + The function can take one of two forms: It can take a `pyarrow.Table` and return a + `pyarrow.Table`, or it can take an iterator of `pyarrow.RecordBatch` and yield + `pyarrow.RecordBatch`. Alternatively each form can take a tuple of `pyarrow.Scalar` + as the first argument in addition to the input type above. For each group, all columns + are passed together in the `pyarrow.Table` or `pyarrow.RecordBatch`, and the returned + `pyarrow.Table` or iterator of `pyarrow.RecordBatch` are combined as a :class:`DataFrame`. The `schema` should be a :class:`StructType` describing the schema of the returned - `pyarrow.Table`. The column labels of the returned `pyarrow.Table` must either match - the field names in the defined schema if specified as strings, or match the - field data types by position if not strings, e.g. integer indices. - The length of the returned `pyarrow.Table` can be arbitrary. + `pyarrow.Table` or `pyarrow.RecordBatch`. The column labels of the returned `pyarrow.Table` + or `pyarrow.RecordBatch` must either match the field names in the defined schema if + specified as strings, or match the field data types by position if not strings, e.g. + integer indices. The length of the returned `pyarrow.Table` or iterator of + `pyarrow.RecordBatch` can be arbitrary. .. versionadded:: 4.0.0 Parameters ---------- func : function - a Python native function that takes a `pyarrow.Table` and outputs a - `pyarrow.Table`, or that takes one tuple (grouping keys) and a - `pyarrow.Table` and outputs a `pyarrow.Table`. + a Python native function that either takes a `pyarrow.Table` and outputs a + `pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and yields + `pyarrow.RecordBatch`. Additionally, each form can take a tuple of grouping keys + as the first argument, with the `pyarrow.Table` or iterator of `pyarrow.RecordBatch` + as the second argument. Review Comment: lets also add two simple examples (w/o `key`) in the `Examples` section -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org