[GitHub] [spark] HyukjinKwon commented on a change in pull request #26110: [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide

GitBox Wed, 23 Oct 2019 04:02:47 -0700

HyukjinKwon commented on a change in pull request #26110: 
[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide
URL: https://github.com/apache/spark/pull/26110#discussion_r337981832


 ##########
 File path: docs/sql-pyspark-pandas-with-arrow.md
 ##########
 @@ -178,6 +178,41 @@ For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/p
 
[`pyspark.sql.DataFrame.mapsInPandas`](api/python/pyspark.sql.html#pyspark.sql.DataFrame.mapInPandas).
 
 
+### Cogrouped Map
+
+CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common 
key and then a python function applied to
+each cogroup.  They are used with `groupBy().cogroup().apply()` which consists 
of the following steps:
+
+* Shuffle the data such that the groups of each dataframe which share a key 
are cogrouped together.
+* Apply a function to each cogroup.  The input of of the function is two 
`pandas.DataFrame` (with an optional Tuple
+representing the key).  The output of the function is a `pandas.DataFrame`.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().cogroup().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each cogroup.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The column labels of the returned `pandas.DataFrame` must either match the 
field names in the
+defined output schema if specified as strings, or match the field data types 
by position if not
+strings, e.g. integer indices. See 
[pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame)
+on how to label columns when constructing a `pandas.DataFrame`.
+
+Note that all data for a cogroup will be loaded into memory before the 
function is applied. This can lead to out of
+memory exceptions, especially if the group sizes are skewed. The configuration 
for[maxRecordsPerBatch](#setting-arrow-batch-size)
 
 Review comment:
   typoe -> `for[maxRecordsPerBatch]` -> `for [maxRecordsPerBatch]`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #26110: [SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide

Reply via email to