HyukjinKwon commented on a change in pull request #26110: [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide URL: https://github.com/apache/spark/pull/26110#discussion_r337981832
########## File path: docs/sql-pyspark-pandas-with-arrow.md ########## @@ -178,6 +178,41 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p [`pyspark.sql.DataFrame.mapsInPandas`](api/python/pyspark.sql.html#pyspark.sql.DataFrame.mapInPandas). +### Cogrouped Map + +CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to +each cogroup. They are used with `groupBy().cogroup().apply()` which consists of the following steps: + +* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together. +* Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple +representing the key). The output of the function is a `pandas.DataFrame`. +* Combine the results into a new `DataFrame`. + +To use `groupBy().cogroup().apply()`, the user needs to define the following: +* A Python function that defines the computation for each cogroup. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +The column labels of the returned `pandas.DataFrame` must either match the field names in the +defined output schema if specified as strings, or match the field data types by position if not +strings, e.g. integer indices. See [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) +on how to label columns when constructing a `pandas.DataFrame`. + +Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of +memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size) Review comment: typoe -> `for[maxRecordsPerBatch]` -> `for [maxRecordsPerBatch]` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
