EnricoMi opened a new pull request, #36120:
URL: https://github.com/apache/spark/pull/36120

   ### What changes were proposed in this pull request?
   Methods `wrap_cogrouped_map_pandas_udf` and `wrap_grouped_map_pandas_udf` in 
`python/pyspark/worker.py` do not need to reject `pd.DataFrame`s with no 
columns return by udf when that DataFrame is empty (zero rows). This allows to 
return empty DataFrames without the need to define columns. The DataFrame is 
empty after all!
   
   ### Why are the changes needed?
   Returning an empty DataFrame from the lambda given to `applyInPandas` should 
be as easy as this:
   
   ```python
   return pd.DataFrame([])
   ```
   
   However, PySpark requires that empty DataFrame to have the right _number_ of 
columns. This seems redundant as the schema is already defined in the 
`applyInPandas` call. Returning a non-empty DataFrame does not require defining 
columns.
   
   Here is an example to reproduce:
   ```python
   import pandas as pd  
   
   from pyspark.sql.functions import pandas_udf, ceil
   
   df = spark.createDataFrame(
       [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
       ("id", "v"))  
   
   def mean_func(key, pdf):
       if key == (1,):
           return pd.DataFrame([])
       else:
           return pd.DataFrame([key + (pdf.v.mean(),)])
   
   df.groupby("id").applyInPandas(mean_func, schema="id long, v double").show()
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   It changes the behaviour of the following calls to allow returning empty 
`pd.DataFrame` without defining columns. The PySpark DataFrame returned by 
`applyInPandas` is unchanged:
   
   - `df.groupby(…).applyInPandas(…)`
   - `df.cogroup(…).applyInPandas(…)`
   
   ### How was this patch tested?
   Tests are added that test `applyInPandas` returning
   
   - empty DataFrame with no columns
   - empty DataFrame with the wrong number of columns
   - non-empty DataFrame with wrong number of columns
   - something other than `pd.DataFrame`
   
   TODO:
   - test cogroup
   - check mapInPandas
   - look for other methods returning pd.DataFrames


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to