[ 
https://issues.apache.org/jira/browse/SPARK-53507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Markowitz updated SPARK-53507:
----------------------------------
    Description: 
Users of Apache Spark often have their jobs break when upgrading to a new 
version. We'd like to improve this using config flags and a concept called 
"Breaking Change Info".

This is an example of a breaking change:
{quote}Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict 
validation of the result against the schema. The column names must exactly 
match and types must match with compatible nullability. To restore the previous 
behavior, set `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to 
`false`.{quote}
 

This can be mitigated as follows:
 * When the breaking change is created, we define an error class with a 
`breakingChangeInfo` object. This includes a message, a spark config, and a 
flag indicating if the mitigation could be applied automatically.

Example:{{{}{}}}
{code:java}
"MAP_VALIDATION_ERROR": {
  "message": [
    "Result validation failed: The schema does not match the expected schema.",
  ],
  "breakingChangeInfo": {
    "migrationMessage": [
      "To disable strict result validation, set set 
`spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`"
    ],
    "mitigationSparkConfig": {
      "key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled",
      "value": "false"
    },
    "autoMitigation": true
  }
}
{code}
 * In the Spark code, when this particular breaking change is hit, we always 
throw an error with the matching error class.
 * A platform running the spark job can handle this error by re-running this 
job with the specified config applied. This enables us to automatically, 
successfully retry the job with the breaking change mitigated.

  was:
Users of Apache Spark often have their jobs break when upgrading to a new 
version. We'd like to improve this using config flags and a concept called 
"Breaking Change Info".

This is an example of a breaking change:
- Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict validation of 
the result against the schema. The column names must exactly match and types 
must match with compatible nullability. To restore the previous behavior, set 
`spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`.

 

This can be mitigated as follows:
* When the breaking change is created, we define an error class with a 
`breakingChangeInfo` object. This includes a message, a spark config, and a 
flag indicating if the mitigation could be applied automatically.
Example:
```
"MAP_VALIDATION_ERROR": {
"message": [
"Result validation failed: The schema does not match the expected schema.",
],
"breakingChangeInfo": {
"migrationMessage": [
"To disable strict result validation, set set 
`spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`"
],
"mitigationSparkConfig": {
"key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled",
"value": "false"
},
"autoMitigation": true
}
}
```
* In the Spark code, when this particular breaking change is hit, we always 
throw an error with the matching error class.
* A platform running the spark job can handle this error by re-running this job 
with the specified config applied. This enables us to automatically, 
successfully retry the job with the breaking change mitigated.


> Add Breaking Change info to Spark error classes
> -----------------------------------------------
>
>                 Key: SPARK-53507
>                 URL: https://issues.apache.org/jira/browse/SPARK-53507
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 4.1.0
>            Reporter: Ian Markowitz
>            Priority: Major
>
> Users of Apache Spark often have their jobs break when upgrading to a new 
> version. We'd like to improve this using config flags and a concept called 
> "Breaking Change Info".
> This is an example of a breaking change:
> {quote}Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict 
> validation of the result against the schema. The column names must exactly 
> match and types must match with compatible nullability. To restore the 
> previous behavior, set 
> `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`.{quote}
>  
> This can be mitigated as follows:
>  * When the breaking change is created, we define an error class with a 
> `breakingChangeInfo` object. This includes a message, a spark config, and a 
> flag indicating if the mitigation could be applied automatically.
> Example:{{{}{}}}
> {code:java}
> "MAP_VALIDATION_ERROR": {
>   "message": [
>     "Result validation failed: The schema does not match the expected 
> schema.",
>   ],
>   "breakingChangeInfo": {
>     "migrationMessage": [
>       "To disable strict result validation, set set 
> `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`"
>     ],
>     "mitigationSparkConfig": {
>       "key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled",
>       "value": "false"
>     },
>     "autoMitigation": true
>   }
> }
> {code}
>  * In the Spark code, when this particular breaking change is hit, we always 
> throw an error with the matching error class.
>  * A platform running the spark job can handle this error by re-running this 
> job with the specified config applied. This enables us to automatically, 
> successfully retry the job with the breaking change mitigated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to