[ https://issues.apache.org/jira/browse/SPARK-53507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ian Markowitz updated SPARK-53507: ---------------------------------- Description: Users of Apache Spark often have their jobs break when upgrading to a new version. We'd like to improve this using config flags and a concept called "Breaking Change Info". This is an example of a breaking change: {quote}Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict validation of the result against the schema. The column names must exactly match and types must match with compatible nullability. To restore the previous behavior, set `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`.{quote} This can be mitigated as follows: * When the breaking change is created, we define an error class with a `breakingChangeInfo` object. This includes a message, a spark config, and a flag indicating if the mitigation could be applied automatically. Example:{{{}{}}} {code:java} "MAP_VALIDATION_ERROR": { "message": [ "Result validation failed: The schema does not match the expected schema.", ], "breakingChangeInfo": { "migrationMessage": [ "To disable strict result validation, set set `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`" ], "mitigationSparkConfig": { "key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled", "value": "false" }, "autoMitigation": true } } {code} * In the Spark code, when this particular breaking change is hit, we always throw an error with the matching error class. * A platform running the spark job can handle this error by re-running this job with the specified config applied. This enables us to automatically, successfully retry the job with the breaking change mitigated. was: Users of Apache Spark often have their jobs break when upgrading to a new version. We'd like to improve this using config flags and a concept called "Breaking Change Info". This is an example of a breaking change: - Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict validation of the result against the schema. The column names must exactly match and types must match with compatible nullability. To restore the previous behavior, set `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`. This can be mitigated as follows: * When the breaking change is created, we define an error class with a `breakingChangeInfo` object. This includes a message, a spark config, and a flag indicating if the mitigation could be applied automatically. Example: ``` "MAP_VALIDATION_ERROR": { "message": [ "Result validation failed: The schema does not match the expected schema.", ], "breakingChangeInfo": { "migrationMessage": [ "To disable strict result validation, set set `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`" ], "mitigationSparkConfig": { "key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled", "value": "false" }, "autoMitigation": true } } ``` * In the Spark code, when this particular breaking change is hit, we always throw an error with the matching error class. * A platform running the spark job can handle this error by re-running this job with the specified config applied. This enables us to automatically, successfully retry the job with the breaking change mitigated. > Add Breaking Change info to Spark error classes > ----------------------------------------------- > > Key: SPARK-53507 > URL: https://issues.apache.org/jira/browse/SPARK-53507 > Project: Spark > Issue Type: Task > Components: Spark Core > Affects Versions: 4.1.0 > Reporter: Ian Markowitz > Priority: Major > > Users of Apache Spark often have their jobs break when upgrading to a new > version. We'd like to improve this using config flags and a concept called > "Breaking Change Info". > This is an example of a breaking change: > {quote}Since Spark 4.1, `mapInPandas` and `mapInArrow` enforces strict > validation of the result against the schema. The column names must exactly > match and types must match with compatible nullability. To restore the > previous behavior, set > `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`.{quote} > > This can be mitigated as follows: > * When the breaking change is created, we define an error class with a > `breakingChangeInfo` object. This includes a message, a spark config, and a > flag indicating if the mitigation could be applied automatically. > Example:{{{}{}}} > {code:java} > "MAP_VALIDATION_ERROR": { > "message": [ > "Result validation failed: The schema does not match the expected > schema.", > ], > "breakingChangeInfo": { > "migrationMessage": [ > "To disable strict result validation, set set > `spark.sql.execution.arrow.pyspark.validateSchema.enabled` to `false`" > ], > "mitigationSparkConfig": { > "key": "spark.sql.execution.arrow.pyspark.validateSchema.enabled", > "value": "false" > }, > "autoMitigation": true > } > } > {code} > * In the Spark code, when this particular breaking change is hit, we always > throw an error with the matching error class. > * A platform running the spark job can handle this error by re-running this > job with the specified config applied. This enables us to automatically, > successfully retry the job with the breaking change mitigated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org