EnricoMi opened a new pull request, #36120:
URL: https://github.com/apache/spark/pull/36120
### What changes were proposed in this pull request?
Methods `wrap_cogrouped_map_pandas_udf` and `wrap_grouped_map_pandas_udf` in
`python/pyspark/worker.py` do not need to reject `pd.DataFrame`s with no
columns return by udf when that DataFrame is empty (zero rows). This allows to
return empty DataFrames without the need to define columns. The DataFrame is
empty after all!
### Why are the changes needed?
Returning an empty DataFrame from the lambda given to `applyInPandas` should
be as easy as this:
```python
return pd.DataFrame([])
```
However, PySpark requires that empty DataFrame to have the right _number_ of
columns. This seems redundant as the schema is already defined in the
`applyInPandas` call. Returning a non-empty DataFrame does not require defining
columns.
Here is an example to reproduce:
```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, ceil
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def mean_func(key, pdf):
if key == (1,):
return pd.DataFrame([])
else:
return pd.DataFrame([key + (pdf.v.mean(),)])
df.groupby("id").applyInPandas(mean_func, schema="id long, v double").show()
```
### Does this PR introduce _any_ user-facing change?
It changes the behaviour of the following calls to allow returning empty
`pd.DataFrame` without defining columns. The PySpark DataFrame returned by
`applyInPandas` is unchanged:
- `df.groupby(…).applyInPandas(…)`
- `df.cogroup(…).applyInPandas(…)`
### How was this patch tested?
Tests are added that test `applyInPandas` returning
- empty DataFrame with no columns
- empty DataFrame with the wrong number of columns
- non-empty DataFrame with wrong number of columns
- something other than `pd.DataFrame`
TODO:
- test cogroup
- check mapInPandas
- look for other methods returning pd.DataFrames
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]