EnricoMi opened a new pull request, #42316:
URL: https://github.com/apache/spark/pull/42316

   ### What changes were proposed in this pull request?
   This merges #39952 into 3.5 branch.
   
   Similar to #38223, improve the error messages when a Python method provided 
to `DataFrame.mapInPandas` returns a Pandas DataFrame that does not match the 
expected schema.
   
   With
   ```Python
   df = spark.range(2).withColumn("v", col("id"))
   ```
   
   **Mismatching column names:**
   ```Python
   df.mapInPandas(lambda it: it, "id long, val long").show()
   # was: KeyError: 'val'
   # now: RuntimeError: Column names of the returned pandas.DataFrame do not 
match specified schema.
   #      Missing: val  Unexpected: v
   ```
   
   **Python function not returning iterator:**
   ```Python
   df.mapInPandas(lambda it: 1, "id long").show()
   # was: TypeError: 'int' object is not iterable
   # now: TypeError: Return type of the user-defined function should be 
iterator of pandas.DataFrame, but is <class 'int'>
   ```
   
   **Python function not returning iterator of pandas.DataFrame:**
   ```Python
   df.mapInPandas(lambda it: [1], "id long").show()
   # was: TypeError: Return type of the user-defined function should be 
Pandas.DataFrame, but is <class 'int'>
   # now: TypeError: Return type of the user-defined function should be 
iterator of pandas.DataFrame, but is iterator of <class 'int'>
   # sometimes: ValueError: A field of type StructType expects a 
pandas.DataFrame, but got: <class 'list'>
   # now: TypeError: Return type of the user-defined function should be 
iterator of pandas.DataFrame, but is iterator of <class 'list'>
   ```
   
   **Mismatching types (ValueError and TypeError):**
   ```Python
   df.mapInPandas(lambda it: it, "id int, v string").show()
   # was: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got 
int64
   # now: pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got 
int64
   #      The above exception was the direct cause of the following exception:
   #      TypeError: Exception thrown when converting pandas.Series (int64) 
with name 'v' to Arrow Array (string).
   
   df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in it], 
"id int, v double").show()
   # was: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried 
to convert to double
   # now: pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried 
to convert to double
   #      The above exception was the direct cause of the following exception:
   #      ValueError: Exception thrown when converting pandas.Series (object) 
with name 'v' to Arrow Array (double).
   
   with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": 
True}):
     df.mapInPandas(lambda it: [pdf.assign(v=pdf["v"].apply(str)) for pdf in 
it], "id int, v double").show()
   # was: ValueError: Exception thrown when converting pandas.Series (object) 
to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled
   #      by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.
   # now: ValueError: Exception thrown when converting pandas.Series (object) 
with name 'v' to Arrow Array (double).
   #      It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled
   #      by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.
   ```
   
   ### Why are the changes needed?
   Existing errors are generic (`KeyError`) or meaningless (`'int' object is 
not iterable`). The errors should help users in spotting the mismatching 
columns by naming them.
   
   The schema of the returned Pandas DataFrames can only be checked during 
processing the DataFrame, so such errors are very expensive. Therefore, they 
should be expressive.
   
   ### Does this PR introduce _any_ user-facing change?
   This only changes error messages, not behaviour.
   
   ### How was this patch tested?
   Tests all cases of schema mismatch for `DataFrame.mapInPandas`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to