[ 
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-57452:
-----------------------------------

    Assignee: Wenchen Fan

> Auditing the migration guide
> ----------------------------
>
>                 Key: SPARK-57452
>                 URL: https://issues.apache.org/jira/browse/SPARK-57452
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured 
> Streaming, Web UI
>    Affects Versions: 4.2.0
>            Reporter: Xiao Li
>            Assignee: Wenchen Fan
>            Priority: Blocker
>              Labels: pull-request-available
>
> I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and 
> identified 38 changes that appear to be missing from the migration guide.
> The JIRAs listed below were identified through this analysis. However, this 
> may not be a complete list, so please also review the remaining commits for 
> any additional migration guide updates that may be required.
>  
>  * *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get 
> (Scala and Connect) now raises the underlying exception when metric 
> collection fails instead of returning an empty map; code that tolerated an 
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception 
> when observed-metric collection fails instead of silently returning an empty 
> map. Code that tolerated an empty result on failure now sees a thrown 
> exception. There is no opt-out.
>  * *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when 
> counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer 
> now breaks ties between equal-count terms lexicographically, making the 
> vocabulary deterministic; this can change vocabulary term order and feature 
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, 
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically 
> so the vocabulary is deterministic. This can change vocabulary term order and 
> feature indices compared with prior (non-deterministic) output. There is no 
> opt-out.
>  * *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
> (previously only if all missing), for all pandas versions; pass 
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
> {{'raise'}} and now raise KeyError if any requested label is missing 
> (previously only if all were missing), across all pandas versions. To skip 
> missing labels, pass {{{}errors='ignore'{}}}.
>  * *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior 
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
> groupby idxmax/idxmin with skipna=False now returns null for NA groups 
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, 
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
> pandas semantics for NA: with pandas 2 they return null for groups containing 
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
> returning an index label. There is no opt-out.
>  * *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no 
> longer matches '1'), changing results for all pandas versions; this is 
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
> values of incompatible types no longer match (for example integer 1 no longer 
> matches string '1'). Results change across all pandas versions. There is no 
> opt-out.
>  * *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create 
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame 
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
> directly, dropping np.dtype-based StructType inference so inferred schema can 
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
> from a NumPy ndarray converts the array directly to an Arrow Table and now 
> requires PyArrow. The previous np.dtype-based StructType inference is 
> dropped, so the inferred schema may differ. To control the schema, pass an 
> explicit schema.
>  * *SPARK-56186[PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer 
> officially supported in PySpark (CI, docker image, classifier, and 
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
> classifier, and PyPy-specific code/test skips have been removed. PyPy users 
> should migrate to CPython. There is no opt-out.
>  * *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas 
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
> supported version for pandas on Spark Connect has been raised from 2.0.0 to 
> 2.2.0, matching the minimum already required by PySpark.
>  * *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
> nullable integer columns containing nulls now use a nullable Int extension 
> dtype instead of float64, so values/dtype inside the UDF change (fixing 
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
> nullable integer column that contains nulls receive a pandas nullable integer 
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
> large integers. The dtype and values seen inside the UDF change accordingly. 
> There is no opt-out configuration.
>  * *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data 
> Source read returning a pa.RecordBatch whose Arrow types differ from the 
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from 
> the declared schema now fails with 
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that 
> previously loaded by coincidence now error. There is no opt-out.
>  * *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when 
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming 
> Python Data Source SimpleDataSourceStreamReader whose read() returns a 
> non-empty batch with end==start now fails with 
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects 
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
> batch with end offset equal to start now fails with 
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
> memory. Empty batches with end == start are still allowed. There is no 
> opt-out.
>  * *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE 
> names differing only in case (e.g. WITH cte1, CTE1) now raise 
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted 
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
> detection is case-insensitive. CTE definitions whose names differ only in 
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
> definition. There is no opt-out; rename the conflicting CTEs.
>  * *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint 
> output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
> now always prints RELY/NORELY for table constraints (previously omitted the 
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
> constraints, including {{NORELY}} for the default state which was previously 
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
> constraint output text for tools parsing it.
>  * *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
> view now drops the view by default instead of raising 
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a 
> view drops the view by default instead of raising 
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
>  * *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the 
> spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
> session-level value is honored, changing when the limit error fires; error 
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to 
> Hive SerDe tables is always enforced on the Spark side and honors the 
> session-level value, changing when the limit is checked. The error is now 
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
>  * *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with 
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString 
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
> output for such columns. Default non-collated strings are unchanged. No 
> opt-out.
>  * *SPARK-54918[SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> array_distinct/union/intersect/except and arrays_overlap now normalize floats 
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results 
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
> functions {{{}array_distinct{}}}, {{{}array_union{}}}, 
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}} 
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit 
> NaN values are treated as equal. This changes the results of these functions; 
> there is no opt-out.
>  * *SPARK-54777[SQL] Changed dropTable error handling in 
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE 
> now only swallows object-not-found errors; other failures (permission, etc.) 
> propagate instead of silently returning, so a drop that previously appeared 
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
> catalog only swallows object-not-found errors when running DROP TABLE; other 
> failures such as permission-denied or constraint violations now propagate 
> instead of silently returning success. A DROP TABLE that previously appeared 
> to succeed may now throw.
>  * *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC 
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness 
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables 
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
> correctness fix, as no mainstream RDBMS supports sampling with replacement), 
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
> sample-with-replacement on JDBC tables change accordingly.
>  * *SPARK-56031[SQL] Make Natural Join column matching respect case 
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
> respects spark.sql.caseSensitive (default false), so joins on case-differing 
> common columns instead of degrading to CROSS JOIN, changing results; set 
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common 
> columns that differ only in case are joined instead of degrading to a CROSS 
> JOIN, changing results. To match columns case-sensitively, set 
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
>  * *SPARK-31561[SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a 
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote 
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. 
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
>  * *SPARK-57188[SQL] Parameterless function takes precedence over UDF 
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless 
> built-in (current_user, current_date, etc.) now takes precedence over a 
> same-named SQL UDF parameter, changing UDF body results. Set 
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
> built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
> parameter in the function body. To restore the previous behavior, set 
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
> {{{}true{}}}.
>  * *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation 
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet 
> files with UNKNOWN logical-type annotation now infers physical type (e.g. 
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType 
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
> files with the {{UNKNOWN}} logical-type annotation infers the physical type 
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore 
> the 4.1.0 behavior of inferring NullType, set 
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
> {{{}true{}}}.
>  * *SPARK-56414[SQL] Per-write options should take precedence over session 
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options 
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf 
> in Parquet/Avro writes; previously such options were silently ignored, so 
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
> take precedence over session config for several Parquet/Avro write keys (e.g. 
> {{{}spark.sql.parquet.outputTimestampType{}}}, 
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
> silently ignored, so the written file format can change when both are set.
>  * *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all 
> data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC 
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor 
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0 
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL 
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), 
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the 
> whole table into memory. To restore the previous behavior, set the 
> {{fetchsize}} option to {{{}0{}}}.
>  * *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}} 
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with 
> a bare name now resolves to a session variable of that name first (if one 
> exists) before treating it as a catalog name; there is no opt-out config. 
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG 
> accepts foldable expressions and a bare name is first resolved as a session 
> variable of that name (if one exists) before being treated as a catalog name. 
> There is no opt-out; a session variable that shadows a catalog name changes 
> which catalog is set.
>  * *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe 
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|' 
> is now a SQL pipe-operator token by default, so a query using a pipe keyword 
> as a column name after bitwise-OR may reparse; restore via 
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
> operator token by default. A pipe keyword used as a column name after a 
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
>  * *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect 
> Catalog.createTable now executes eagerly instead of lazily, so the table is 
> created (and errors like already-exists surface) immediately at the call 
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
> {{Catalog.createTable}} executes eagerly: the table is created (and errors 
> such as table-already-exists surface) immediately at the call rather than 
> lazily on a later action. Code relying on the previous lazy behavior is 
> affected.
>  * *SPARK-55198[SQL] spark-sql should skip comment line with leading 
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
> skips comment lines that have leading whitespace before – (line.trim 
> startsWith --), matching Hive/beeline; such lines were previously sent as 
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
> skips comment lines whose first non-whitespace characters are {{-{-}{-}}} 
> (i.e. {{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline. 
> Previously such leading-whitespace comment lines were sent as SQL. There is 
> no opt-out.
>  * *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always 
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
> always propagates metadata columns by default; queries that failed with 
> AnalysisException may now succeed and joins may newly raise ambiguous-column 
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
> always propagates metadata columns from its child, so some queries that 
> previously failed with AnalysisException now succeed and joins may raise new 
> ambiguous-column errors. To restore the legacy behavior, set 
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
>  * *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE 
> TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
> EXTENDED for v2 tables/views now emits structured 
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier 
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
> the command output may be affected.
>  * *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON 
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant 
> reject unpaired UTF-16 surrogates (raising an error or returning NULL) 
> instead of silently substituting U+FFFD. To restore the previous permissive 
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to 
> {{{}false{}}}.
>  * *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as 
> Variant now honors inferSchema=false (leaf/attribute text kept as strings, 
> not inferred boolean/long/decimal), changing results. Set 
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
> values as strings instead of inferring boolean/long/decimal. To restore the 
> previous behavior of always inferring types, set 
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
>  * *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance() 
> now preserves attribute name casing by default, changing output column-name 
> casing for self-joins on a CTE with case-differing duplicate columns; restore 
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
> references preserve attribute name casing when re-instantiated, so self-joins 
> on a CTE with case-differing duplicate columns keep the original column-name 
> casing in the output. To restore the previous behavior, set 
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
>  * *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming 
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
> differently-bit NaNs and signed zeros are treated as duplicates; dedup 
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or 
> double key columns normalize NaN and signed zero, so differently-bit NaN 
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes 
> deduplication results for queries with floating-point keys; there is no 
> opt-out.
>  * *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a 
> streaming query whose checkpoint has offset/commit logs but no metadata file 
> now fails with MISSING_METADATA_FILE by default instead of starting a new 
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
> streaming query whose checkpoint has offset and commit logs but no metadata 
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
> silently generating a new query id. To restore the previous behavior, set 
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
> {{{}false{}}}.
>  * *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format, 
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing 
> /applications/\{appId}/sql REST endpoint default length changed from 20 to 
> -1, so it now returns all SQL executions by default; clients relying on the 
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the 
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} 
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by 
> default. To restore the previous behavior, pass {{length=20}} (any {{length 
> <= 0}} returns all executions).
>  * *SPARK-55075[K8S] Track executor pod creation errors with 
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor 
> pod-creation failures are now caught, logged and counted by 
> ExecutorFailureTracker (continue until max failures) instead of being 
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
> executor pod-creation failures are caught, logged, and counted by 
> ExecutorFailureTracker (allocation continues until the max-failures 
> threshold) instead of being rethrown immediately. There is no opt-out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to