EnricoMi opened a new pull request, #42316:
URL: https://github.com/apache/spark/pull/42316
### What changes were proposed in this pull request?
This merges #39952 into 3.5 branch.
Similar to #38223, improve the error messages when a Python method provided
to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the
expected schema.
With
```Python
df = spark.range(2).withColumn("v", col("id"))
```
**Mismatching column names:**
```Python
df.mapInPandas(lambda it: it, "id long, val long").show()
# was: KeyError: 'val'
# now: RuntimeError: Column names of the returned pandas.DataFrame do not
match specified schema.
# Missing: val Unexpected: v
```
**Python function not returning iterator:**
```Python
df.mapInPandas(lambda it: 1, "id long").show()
# was: TypeError: 'int' object is not iterable
# now: TypeError: Return type of the user-defined function should be
iterator of pandas.DataFrame, but is <class 'int'>
```
**Python function not returning iterator of pandas.DataFrame:**
```Python
df.mapInPandas(lambda it: [1], "id long").show()
# was: TypeError: Return type of the user-defined function should be
Pandas.DataFrame, but is <class 'int'>
# now: TypeError: Return type of the user-defined function should be
iterator of pandas.DataFrame, but is iterator of <class 'int'>
# sometimes: ValueError: A field of type StructType expects a
pandas.DataFrame, but got: <class 'list'>
# now: TypeError: Return type of the user-defined function should be
iterator of pandas.DataFrame, but is iterator of <class 'list'>
```
**Mismatching types (ValueError and TypeError):**
```Python
df.mapInPandas(lambda it: it, "id int, v string").show()
# was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got
int64
# now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got
int64
# The above exception was the direct cause of the following exception:
# TypeError: Exception thrown when converting pandas.Series (int64)
with name 'v' to Arrow Array (string).
df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it],
"id int, v double").show()
# was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried
to convert to double
# now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried
to convert to double
# The above exception was the direct cause of the following exception:
# ValueError: Exception thrown when converting pandas.Series (object)
with name 'v' to Arrow Array (double).
with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely":
True}):
df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in
it], "id int, v double").show()
# was: ValueError: Exception thrown when converting pandas.Series (object)
to Arrow Array (double).
# It can be caused by overflows or other unsafe conversions warned by
Arrow. Arrow safe type check can be disabled
# by using SQL config
`spark.sql.execution.pandas.convertToArrowArraySafely`.
# now: ValueError: Exception thrown when converting pandas.Series (object)
with name 'v' to Arrow Array (double).
# It can be caused by overflows or other unsafe conversions warned by
Arrow. Arrow safe type check can be disabled
# by using SQL config
`spark.sql.execution.pandas.convertToArrowArraySafely`.
```
### Why are the changes needed?
Existing errors are generic (`KeyError`) or meaningless (`'int' object is
not iterable`). The errors should help users in spotting the mismatching
columns by naming them.
The schema of the returned Pandas DataFrames can only be checked during
processing the DataFrame, so such errors are very expensive. Therefore, they
should be expressive.
### Does this PR introduce _any_ user-facing change?
This only changes error messages, not behaviour.
### How was this patch tested?
Tests all cases of schema mismatch for `DataFrame.mapInPandas`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]