[GitHub] [spark] HyukjinKwon opened a new pull request #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

GitBox Thu, 09 Jan 2020 23:55:27 -0800

HyukjinKwon opened a new pull request #27165: [SPARK-28264][PYTHON][SQL] 
Support type hints in pandas UDF and rename/move inconsistent pandas UDF types
URL: https://github.com/apache/spark/pull/27165
 
 
   ### What changes were proposed in this pull request?
   
   This PR proposes to redesign pandas UDFs as described in [the 
proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).
   
   Note that, this PR address one of the future improvements described 
[here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu),
 "A couple of less-intuitive pandas UDF types" (by @zero323) together.
   
   In short,
   
   - New way with type hints as an alternative and experimental way.
       ```python
       @pandas_udf(schema='...')
       def func(c1: Series, c2: Series) -> DataFrame:
           pass
       ```
   
   - Remove three types below from UDF, and make them as separate standalone 
APIs. So, `pandas_udf` is now consistent with regular `udf`s and other 
expressions.
   
       `df.mapInPandas(udf)`  -> `df.mapInPandas(func, schema)`
       `df.groupby.apply(udf)`  -> `df.groupby.applyInPandas(func, schema)`
       `df.groupby.cogroup.apply(udf)`  -> 
`df.groupby.cogroup.applyInPandas(func, schema)`
   
   - No deprecation for the existing ways for now.
       ```python
       @pandas_udf(schema='...', functionType=PandasUDFType.SCALAR)
       def func(c1, c2):
           pass
       ```
   If users are happy with this, I plan to deprecate the existing way and 
declare using type hints is not experimental anymore.
   
   One design goal in this PR was that, avoid touching the internal (since we 
didn't deprecate the old ways for now), but supports type hints with a 
minimised changes only at the interface.
   
   - Once we deprecate or remove the old ways, I think it requires another 
refactoring for the internal in the future. At the very least, we should rename 
internal pandas evaluation types.
   - If users find this experimental type hints isn't quite helpful, we should 
simply revert the changes at the interface level.
   
   ### Why are the changes needed?
   
   In order to address old design issues. Please see [the 
proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing).
   
   ### Does this PR introduce any user-facing change?
   
   For behaviour changes, No.
   
   It adds new ways to use pandas UDFs by using type hints. See below.
   
   **SCALAR**:
   
   ```python
   @pandas_udf(schema='...')
   def func(c1: Series, c2: DataFrame) -> Series:
       pass  # DataFrame represents a struct column
   ```
   
   **SCALAR_ITER**:
   
   ```python
   @pandas_udf(schema='...')
   def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]:
       pass  # Same as SCALAR but wrapped by Iterator
   ```
   
   **GROUPED_AGG**:
   
   ```python
   @pandas_udf(schema='...')
   def func(c1: Series, c2: DataFrame) -> int:
       pass  # DataFrame represents a struct column
   ```
   
   **GROUPED_MAP**:
   
   This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps 
the existing behaviour. Instead, we have a new alias `groupby.applyInPandas` 
for `groupby.apply`. See the example below:
   
   
   ```python
   def func(pdf):
       return pdf
   
   df.groupby("...").applyInPandas(func, schema=df.schema)
   ```
   
   
   **MAP_ITER**:
   
   This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the 
usages. See the example below:
   
   ```python
   def func(iter):
       for df in iter:
           yield df
   
   df.mapInPandas(func, df.schema)
   ```
   
   
   **COGROUPED_MAP**
   
   This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the 
usages. See the example below:
   
   ```python
   def asof_join(left, right):
       return pd.merge_asof(left, right, on="...", by="...")
   
    df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, 
schema="...")
   ```
   
   ### How was this patch tested?
   
   Unittests added and tested against Python 2.7, 3.6 and 3.7.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

Reply via email to