[
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57452:
-----------------------------------
Labels: pull-request-available (was: )
> Auditing the migration guide
> ----------------------------
>
> Key: SPARK-57452
> URL: https://issues.apache.org/jira/browse/SPARK-57452
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured
> Streaming, Web UI
> Affects Versions: 4.2.0
> Reporter: Xiao Li
> Priority: Blocker
> Labels: pull-request-available
>
> I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and
> identified 38 changes that appear to be missing from the migration guide.
> The JIRAs listed below were identified through this analysis. However, this
> may not be a complete list, so please also review the remaining commits for
> any additional migration guide updates that may be required.
>
> * *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get
> (Scala and Connect) now raises the underlying exception when metric
> collection fails instead of returning an empty map; code that tolerated an
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception
> when observed-metric collection fails instead of silently returning an empty
> map. Code that tolerated an empty result on failure now sees a thrown
> exception. There is no opt-out.
> * *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when
> counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer
> now breaks ties between equal-count terms lexicographically, making the
> vocabulary deterministic; this can change vocabulary term order and feature
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2,
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically
> so the vocabulary is deterministic. This can change vocabulary term order and
> feature indices compared with prior (non-deterministic) output. There is no
> opt-out.
> * *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
> (previously only if all missing), for all pandas versions; pass
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
> {{'raise'}} and now raise KeyError if any requested label is missing
> (previously only if all were missing), across all pandas versions. To skip
> missing labels, pass {{{}errors='ignore'{}}}.
> * *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark
> groupby idxmax/idxmin with skipna=False now returns null for NA groups
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out,
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
> pandas semantics for NA: with pandas 2 they return null for groups containing
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of
> returning an index label. There is no opt-out.
> * *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no
> longer matches '1'), changing results for all pandas versions; this is
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
> values of incompatible types no longer match (for example integer 1 no longer
> matches string '1'). Results change across all pandas versions. There is no
> opt-out.
> * *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
> directly, dropping np.dtype-based StructType inference so inferred schema can
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
> from a NumPy ndarray converts the array directly to an Arrow Table and now
> requires PyArrow. The previous np.dtype-based StructType inference is
> dropped, so the inferred schema may differ. To control the schema, pass an
> explicit schema.
> * *SPARK-56186[PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer
> officially supported in PySpark (CI, docker image, classifier, and
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
> classifier, and PyPy-specific code/test skips have been removed. PyPy users
> should migrate to CPython. There is no opt-out.
> * *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
> supported version for pandas on Spark Connect has been raised from 2.0.0 to
> 2.2.0, matching the minimum already required by PySpark.
> * *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on
> nullable integer columns containing nulls now use a nullable Int extension
> dtype instead of float64, so values/dtype inside the UDF change (fixing
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
> nullable integer column that contains nulls receive a pandas nullable integer
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for
> large integers. The dtype and values seen inside the UDF change accordingly.
> There is no opt-out configuration.
> * *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data
> Source read returning a pa.RecordBatch whose Arrow types differ from the
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from
> the declared schema now fails with
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that
> previously loaded by coincidence now error. There is no opt-out.
> * *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming
> Python Data Source SimpleDataSourceStreamReader whose read() returns a
> non-empty batch with end==start now fails with
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
> batch with end offset equal to start now fails with
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
> memory. Empty batches with end == start are still allowed. There is no
> opt-out.
> * *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE
> names differing only in case (e.g. WITH cte1, CTE1) now raise
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
> detection is case-insensitive. CTE definitions whose names differ only in
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
> definition. There is no opt-out; rename the conflicting CTEs.
> * *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint
> output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
> now always prints RELY/NORELY for table constraints (previously omitted the
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
> constraints, including {{NORELY}} for the default state which was previously
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
> constraint output text for tools parsing it.
> * *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a
> view now drops the view by default instead of raising
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a
> view drops the view by default instead of raising
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
> * *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the
> spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
> session-level value is honored, changing when the limit error fires; error
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to
> Hive SerDe tables is always enforced on the Spark side and honors the
> session-level value, changing when the limit is checked. The error is now
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
> * *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
> output for such columns. Default non-collated strings are unchanged. No
> opt-out.
> * *SPARK-54918[SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> array_distinct/union/intersect/except and arrays_overlap now normalize floats
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
> functions {{{}array_distinct{}}}, {{{}array_union{}}},
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}}
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit
> NaN values are treated as equal. This changes the results of these functions;
> there is no opt-out.
> * *SPARK-54777[SQL] Changed dropTable error handling in
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE
> now only swallows object-not-found errors; other failures (permission, etc.)
> propagate instead of silently returning, so a drop that previously appeared
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
> catalog only swallows object-not-found errors when running DROP TABLE; other
> failures such as permission-denied or constraint violations now propagate
> instead of silently returning success. A DROP TABLE that previously appeared
> to succeed may now throw.
> * *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
> correctness fix, as no mainstream RDBMS supports sampling with replacement),
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
> sample-with-replacement on JDBC tables change accordingly.
> * *SPARK-56031[SQL] Make Natural Join column matching respect case
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
> respects spark.sql.caseSensitive (default false), so joins on case-differing
> common columns instead of degrading to CROSS JOIN, changing results; set
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common
> columns that differ only in case are joined instead of degrading to a CROSS
> JOIN, changing results. To match columns case-sensitively, set
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
> * *SPARK-31561[SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword.
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
> * *SPARK-57188[SQL] Parameterless function takes precedence over UDF
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless
> built-in (current_user, current_date, etc.) now takes precedence over a
> same-named SQL UDF parameter, changing UDF body results. Set
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
> built-in function ({{{}current_user{}}}, {{{}current_date{}}},
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
> parameter in the function body. To restore the previous behavior, set
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
> {{{}true{}}}.
> * *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet
> files with UNKNOWN logical-type annotation now infers physical type (e.g.
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
> files with the {{UNKNOWN}} logical-type annotation infers the physical type
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore
> the 4.1.0 behavior of inferring NullType, set
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
> {{{}true{}}}.
> * *SPARK-56414[SQL] Per-write options should take precedence over session
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf
> in Parquet/Avro writes; previously such options were silently ignored, so
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
> take precedence over session config for several Parquet/Avro write keys (e.g.
> {{{}spark.sql.parquet.outputTimestampType{}}},
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
> silently ignored, so the written file format can change when both are set.
> * *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all
> data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}),
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the
> whole table into memory. To restore the previous behavior, set the
> {{fetchsize}} option to {{{}0{}}}.
> * *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}}
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with
> a bare name now resolves to a session variable of that name first (if one
> exists) before treating it as a catalog name; there is no opt-out config.
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG
> accepts foldable expressions and a bare name is first resolved as a session
> variable of that name (if one exists) before being treated as a catalog name.
> There is no opt-out; a session variable that shadows a catalog name changes
> which catalog is set.
> * *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|'
> is now a SQL pipe-operator token by default, so a query using a pipe keyword
> as a column name after bitwise-OR may reparse; restore via
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe
> operator token by default. A pipe keyword used as a column name after a
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
> * *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect
> Catalog.createTable now executes eagerly instead of lazily, so the table is
> created (and errors like already-exists surface) immediately at the call
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
> {{Catalog.createTable}} executes eagerly: the table is created (and errors
> such as table-already-exists surface) immediately at the call rather than
> lazily on a later action. Code relying on the previous lazy behavior is
> affected.
> * *SPARK-55198[SQL] spark-sql should skip comment line with leading
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now
> skips comment lines that have leading whitespace before – (line.trim
> startsWith --), matching Hive/beeline; such lines were previously sent as
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
> skips comment lines whose first non-whitespace characters are {{-{-}{-}}}
> (i.e. {{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline.
> Previously such leading-whitespace comment lines were sent as SQL. There is
> no opt-out.
> * *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now
> always propagates metadata columns by default; queries that failed with
> AnalysisException may now succeed and joins may newly raise ambiguous-column
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
> always propagates metadata columns from its child, so some queries that
> previously failed with AnalysisException now succeed and joins may raise new
> ambiguous-column errors. To restore the legacy behavior, set
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
> * *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE
> TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
> EXTENDED for v2 tables/views now emits structured
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
> the command output may be affected.
> * *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant
> reject unpaired UTF-16 surrogates (raising an error or returning NULL)
> instead of silently substituting U+FFFD. To restore the previous permissive
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to
> {{{}false{}}}.
> * *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as
> Variant now honors inferSchema=false (leaf/attribute text kept as strings,
> not inferred boolean/long/decimal), changing results. Set
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
> values as strings instead of inferring boolean/long/decimal. To restore the
> previous behavior of always inferring types, set
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
> * *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance()
> now preserves attribute name casing by default, changing output column-name
> casing for self-joins on a CTE with case-differing duplicate columns; restore
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
> references preserve attribute name casing when re-instantiated, so self-joins
> on a CTE with case-differing duplicate columns keep the original column-name
> casing in the output. To restore the previous behavior, set
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
> * *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
> differently-bit NaNs and signed zeros are treated as duplicates; dedup
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or
> double key columns normalize NaN and signed zero, so differently-bit NaN
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes
> deduplication results for queries with floating-point keys; there is no
> opt-out.
> * *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a
> streaming query whose checkpoint has offset/commit logs but no metadata file
> now fails with MISSING_METADATA_FILE by default instead of starting a new
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
> streaming query whose checkpoint has offset and commit logs but no metadata
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
> silently generating a new query id. To restore the previous behavior, set
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
> {{{}false{}}}.
> * *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format,
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing
> /applications/\{appId}/sql REST endpoint default length changed from 20 to
> -1, so it now returns all SQL executions by default; clients relying on the
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}}
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by
> default. To restore the previous behavior, pass {{length=20}} (any {{length
> <= 0}} returns all executions).
> * *SPARK-55075[K8S] Track executor pod creation errors with
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor
> pod-creation failures are now caught, logged and counted by
> ExecutorFailureTracker (continue until max failures) instead of being
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
> executor pod-creation failures are caught, logged, and counted by
> ExecutorFailureTracker (allocation continues until the max-failures
> threshold) instead of being rethrown immediately. There is no opt-out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]