[GitHub] [spark] HyukjinKwon opened a new pull request #24997: [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames

GitBox Fri, 28 Jun 2019 01:08:02 -0700

HyukjinKwon opened a new pull request #24997: [SPARK-28198][PYTHON] Add 
mapPartitionsInPandas to allow an iterator of DataFrames
URL: https://github.com/apache/spark/pull/24997
 
 
   ## What changes were proposed in this pull request?
   
   This PR proposes to add `mapPartitionsInPandas` API to DataFrame by using 
existing `SCALAR_ITER` as below:
   
   ```python
   from pyspark.sql.functions import pandas_udf, PandasUDFType
   
   df = spark.createDataFrame([(1, 20), (3, 40)], ["a", "b"])
   
   @pandas_udf(df.schema, PandasUDFType.SCALAR_ITER)
   def func(pdfs):
       for pdf in pdfs:
           print(pdf)
           yield pdf
   
   df.mapPartitionsInPandas(func).show()
   ```
   
   ```
   +---+---+
   | id|age|
   +---+---+
   |  1| 21|
   +---+---+
   ```
   
   The current limitation of `SCALAR_ITER` is that it doesn't allow different 
length of result, which is pretty critical in practice - for instance, we 
cannot simply filter by using Pandas APIs but we merely just map 1 to 1.
   
   This API mimics the way of `mapPartitions` but keeps API shape of 
`SCALAR_ITER` by allowing different results.
   
   ### How does this PR implement?
   
   This PR adds mimics both `dapply` with Arrow optimization and Grouped Map 
Pandas UDF to follow. At Python execution side, it reuses existing 
`SCALAR_ITER` code path.
   
   Therefore, externally, we don't introduce any new type of Pandas UDF but 
internally we use another evaluation type code `205` 
(`SQL_MAP_PANDAS_ITER_UDF`).
   
   This approach is similar with Pandas' Windows function implementation with 
Grouped Aggregation Pandas UDF functions - internally we have 203 
(`SQL_WINDOW_AGG_PANDAS_UDF`) but externally we just share the same 
`GROUPED_AGG`.
   
   ## How was this patch tested?
   
   Manually tested and unittests were added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #24997: [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames

Reply via email to