[GitHub] [spark] HyukjinKwon commented on a change in pull request #24997: [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames

GitBox Mon, 01 Jul 2019 06:01:49 -0700

HyukjinKwon commented on a change in pull request #24997: [SPARK-28198][PYTHON] 
Add mapPartitionsInPandas to allow an iterator of DataFrames
URL: https://github.com/apache/spark/pull/24997#discussion_r299030810


 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2192,6 +2193,51 @@ def toPandas(self):
                         
_check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
             return pdf
 
+    def mapPartitionsInPandas(self, udf):
+        """
+        Maps each partition of the current :class:`DataFrame` using a pandas 
udf and returns
+        the result as a `DataFrame`.
+
+        The user-defined function should take an iterator of 
`pandas.DataFrame`s and return another
+        iterator of `pandas.DataFrame`s. For each partition, all columns are 
passed together as an
+        iterator of `pandas.DataFrame`s to the user-function and the returned 
iterator of
+        `pandas.DataFrame`s are combined as a :class:`DataFrame`.
+        Each `pandas.DataFrame` size can be controlled by
+        `spark.sql.execution.arrow.maxRecordsPerBatch`.
+        Its schema must match the returnType of the pandas udf.
+
+        :param udf: A function object returned by 
:meth:`pyspark.sql.functions.pandas_udf`
+
+        >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+        >>> df = spark.createDataFrame([(1, 21), (2, 30)],
+        ...                            ("id", "age"))  # doctest: +SKIP
+        >>> @pandas_udf(df.schema, PandasUDFType.SCALAR_ITER)  # doctest: +SKIP
 
 Review comment:
   Yea .. I will do that in a separate PR. I need to make another PR to 
document (and/or rename) anyway. But out of curiosity, do you guys have some 
preferences? @mengxr, @icexelloss, @BryanCutler, @viirya? we can,
   
   1. Rename `SCALAR_ITER` -> `ITER` (or something else?)
   2. Introduce `DATAFRAME_ITER` here.
   
   My impression is that UDF type defines execution type and existing 
`SCALAR_ITER` and the current one share virtually same execution path (like we 
share `GROUPED_AGG` for both Grouped Agg and Window function)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #24997: [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames

Reply via email to