Ian Markowitz created SPARK-53507:
-------------------------------------

             Summary: Add Breaking Change info to Spark error classes
                 Key: SPARK-53507
                 URL: https://issues.apache.org/jira/browse/SPARK-53507
             Project: Spark
          Issue Type: Task
          Components: Spark Core
    Affects Versions: 4.1.0
            Reporter: Ian Markowitz


Users of Apache Spark often have their jobs break when upgrading to a new 
version. We'd like to improve this using config flags and a concept called 
"Breaking Change Info".

This is an example of a breaking change:
- Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict validation of 
the result against the schema. The column names must exactly match and types 
must match with compatible nullability. To restore the previous behavior, set 
`spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`.

 

This can be mitigated as follows:
* When the breaking change is created, we define an error class with a 
`breakingChangeInfo` object. This includes a message, a spark config, and a 
flag indicating if the mitigation could be applied automatically.
Example:
```
"MAP_VALIDATION_ERROR": {
"message": [
"Result validation failed: The schema does not match the expected schema.",
],
"breakingChangeInfo": {
"migrationMessage": [
"To disable strict result validation, set set 
`spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`"
],
"mitigationSparkConfig": {
"key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled",
"value": "false"
},
"autoMitigation": true
}
}
```
* In the Spark code, when this particular breaking change is hit, we always 
throw an error with the matching error class.
* A platform running the spark job can handle this error by re-running this job 
with the specified config applied. This enables us to automatically, 
successfully retry the job with the breaking change mitigated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to