[jira] [Updated] (SPARK-57452) Auditing the migration guide

Xiao Li (Jira) Sun, 14 Jun 2026 23:19:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Li updated SPARK-57452:
----------------------------
    Description: 
* *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get 
(Scala and Connect) now raises the underlying exception when metric collection 
fails instead of returning an empty map; code that tolerated an empty result on 
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}} 
(Scala and Spark Connect) raises the underlying exception when observed-metric 
collection fails instead of silently returning an empty map. Code that 
tolerated an empty result on failure now sees a thrown exception. There is no 
opt-out.
 * *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic when 
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now 
breaks ties between equal-count terms lexicographically, making the vocabulary 
deterministic; this can change vocabulary term order and feature indices versus 
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}} 
breaks ties between equal-count terms lexicographically so the vocabulary is 
deterministic. This can change vocabulary term order and feature indices 
compared with prior (non-deterministic) output. There is no opt-out.
 * *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
(previously only if all missing), for all pandas versions; pass errors='ignore' 
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
{{'raise'}} and now raise KeyError if any requested label is missing 
(previously only if all were missing), across all pandas versions. To skip 
missing labels, pass {{{}errors='ignore'{}}}.
 * *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior 
with pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas 
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results 
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
pandas semantics for NA: with pandas 2 they return null for groups containing 
NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
returning an index label. There is no opt-out.
 * *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer 
matches '1'), changing results for all pandas versions; this is pandas-parity, 
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
values of incompatible types no longer match (for example integer 1 no longer 
matches string '1'). Results change across all pandas versions. There is no 
opt-out.
 * *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create 
dataframe from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame 
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
directly, dropping np.dtype-based StructType inference so inferred schema can 
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
from a NumPy ndarray converts the array directly to an Arrow Table and now 
requires PyArrow. The previous np.dtype-based StructType inference is dropped, 
so the inferred schema may differ. To control the schema, pass an explicit 
schema.
 * *[SPARK-56186][PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer 
officially supported in PySpark (CI, docker image, classifier, and 
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
classifier, and PyPy-specific code/test skips have been removed. PyPy users 
should migrate to CPython. There is no opt-out.
 * *[SPARK-55096][PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas 
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
supported version for pandas on Spark Connect has been raised from 2.0.0 to 
2.2.0, matching the minimum already required by PySpark.
 * *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
nullable integer columns containing nulls now use a nullable Int extension 
dtype instead of float64, so values/dtype inside the UDF change (fixing 
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
nullable integer column that contains nulls receive a pandas nullable integer 
extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
large integers. The dtype and values seen inside the UDF change accordingly. 
There is no opt-out configuration.
 * *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data 
Source read returning a pa.RecordBatch whose Arrow types differ from the 
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the 
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. 
Type-mismatched batches that previously loaded by coincidence now error. There 
is no opt-out.
 * *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when 
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python 
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch 
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of 
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
batch with end offset equal to start now fails with 
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
memory. Empty batches with end == start are still allowed. There is no opt-out.
 * *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names 
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES 
instead of silently overwriting; previously-accepted queries now fail. No 
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
detection is case-insensitive. CTE definitions whose names differ only in case 
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
definition. There is no opt-out; rename the conflicting CTEs.
 * *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint 
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
now always prints RELY/NORELY for table constraints (previously omitted the 
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
constraints, including {{NORELY}} for the default state which was previously 
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
constraint output text for tools parsing it.
 * *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
view now drops the view by default instead of raising 
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view 
drops the view by default instead of raising 
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
 * *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on the 
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
session-level value is honored, changing when the limit error fires; error 
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive 
SerDe tables is always enforced on the Spark side and honors the session-level 
value, changing when the limit is checked. The error is now reported as 
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
 * *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with 
default collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString 
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
output for such columns. Default non-collated strings are unchanged. No opt-out.
 * *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
array_distinct/union/intersect/except and arrays_overlap now normalize floats 
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of 
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}}, 
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values, 
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal. 
This changes the results of these functions; there is no opt-out.
 * *[SPARK-54777][SQL] Changed dropTable error handling in 
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now 
only swallows object-not-found errors; other failures (permission, etc.) 
propagate instead of silently returning, so a drop that previously appeared to 
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
catalog only swallows object-not-found errors when running DROP TABLE; other 
failures such as permission-denied or constraint violations now propagate 
instead of silently returning success. A DROP TABLE that previously appeared to 
succeed may now throw.
 * *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE 
with withReplacement=true is no longer pushed down (correctness fix; pushdown 
default-on), so .sample(withReplacement=true) on JDBC tables now returns 
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
correctness fix, as no mainstream RDBMS supports sampling with replacement), 
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
sample-with-replacement on JDBC tables change accordingly.
 * *[SPARK-56031][SQL] Make Natural Join column matching respect case 
sensitivity conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
respects spark.sql.caseSensitive (default false), so joins on case-differing 
common columns instead of degrading to CROSS JOIN, changing results; set 
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns 
that differ only in case are joined instead of degrading to a CROSS JOIN, 
changing results. To match columns case-sensitively, set 
{{spark.sql.caseSensitive}} to {{{}true{}}}.
 * *[SPARK-31561][SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing 
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the 
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A 
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
 * *[SPARK-57188][SQL] Parameterless function takes precedence over UDF 
parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless 
built-in (current_user, current_date, etc.) now takes precedence over a 
same-named SQL UDF parameter, changing UDF body results. Set 
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
parameter in the function body. To restore the previous behavior, set 
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
{{{}true{}}}.
 * *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation 
and revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet 
files with UNKNOWN logical-type annotation now infers physical type (e.g. 
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via 
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
files with the {{UNKNOWN}} logical-type annotation infers the physical type 
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the 
4.1.0 behavior of inferring NullType, set 
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
{{{}true{}}}.
 * *[SPARK-56414][SQL] Per-write options should take precedence over session 
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options 
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in 
Parquet/Avro writes; previously such options were silently ignored, so written 
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
take precedence over session config for several Parquet/Avro write keys (e.g. 
{{{}spark.sql.parquet.outputTimestampType{}}}, 
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
silently ignored, so the written file format can change when both are set.
 * *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all 
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads 
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with 
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC 
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling 
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole 
table into memory. To restore the previous behavior, set the {{fetchsize}} 
option to {{{}0{}}}.
 * *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}} 
statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a 
bare name now resolves to a session variable of that name first (if one exists) 
before treating it as a catalog name; there is no opt-out config. Edge case but 
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts 
foldable expressions and a bare name is first resolved as a session variable of 
that name (if one exists) before being treated as a catalog name. There is no 
opt-out; a session variable that shadows a catalog name changes which catalog 
is set.
 * *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe 
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is 
now a SQL pipe-operator token by default, so a query using a pipe keyword as a 
column name after bitwise-OR may reparse; restore via 
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
operator token by default. A pipe keyword used as a column name after a 
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
 * *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect 
Catalog.createTable now executes eagerly instead of lazily, so the table is 
created (and errors like already-exists surface) immediately at the call rather 
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
{{Catalog.createTable}} executes eagerly: the table is created (and errors such 
as table-already-exists surface) immediately at the call rather than lazily on 
a later action. Code relying on the previous lazy behavior is affected.
 * *[SPARK-55198][SQL] spark-sql should skip comment line with leading 
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
skips comment lines that have leading whitespace before -- (line.trim 
startsWith --), matching Hive/beeline; such lines were previously sent as SQL. 
CLI-only, no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
skips comment lines whose first non-whitespace characters are {{--}} (i.e. 
{{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline. 
Previously such leading-whitespace comment lines were sent as SQL. There is no 
opt-out.
 * *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always 
propagate metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
always propagates metadata columns by default; queries that failed with 
AnalysisException may now succeed and joins may newly raise ambiguous-column 
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
always propagates metadata columns from its child, so some queries that 
previously failed with AnalysisException now succeed and joins may raise new 
ambiguous-column errors. To restore the legacy behavior, set 
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
 * *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE 
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
EXTENDED for v2 tables/views now emits structured 
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row; 
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
the command output may be affected.
 * *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON 
parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}}, 
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16 
surrogates (raising an error or returning NULL) instead of silently 
substituting U+FFFD. To restore the previous permissive behavior, set 
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
 * *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as 
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not 
inferred boolean/long/decimal), changing results. Set 
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
values as strings instead of inferring boolean/long/decimal. To restore the 
previous behavior of always inferring types, set 
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
 * *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance() 
now preserves attribute name casing by default, changing output column-name 
casing for self-joins on a CTE with case-differing duplicate columns; restore 
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
references preserve attribute name casing when re-instantiated, so self-joins 
on a CTE with case-differing duplicate columns keep the original column-name 
casing in the output. To restore the previous behavior, set 
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
 * *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming 
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
differently-bit NaNs and signed zeros are treated as duplicates; dedup results 
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double 
key columns normalize NaN and signed zero, so differently-bit NaN values and 
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication 
results for queries with floating-point keys; there is no opt-out.
 * *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a 
streaming query whose checkpoint has offset/commit logs but no metadata file 
now fails with MISSING_METADATA_FILE by default instead of starting a new query 
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
streaming query whose checkpoint has offset and commit logs but no metadata 
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
silently generating a new query id. To restore the previous behavior, set 
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
{{{}false{}}}.
 * *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format, 
and appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing 
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1, 
so it now returns all SQL executions by default; clients relying on the 20-row 
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the 
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter 
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To 
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns 
all executions).
 * *[SPARK-55075][K8S] Track executor pod creation errors with 
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor 
pod-creation failures are now caught, logged and counted by 
ExecutorFailureTracker (continue until max failures) instead of being rethrown 
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
executor pod-creation failures are caught, logged, and counted by 
ExecutorFailureTracker (allocation continues until the max-failures threshold) 
instead of being rethrown immediately. There is no opt-out.

> Auditing the migration guide
> ----------------------------
>
>                 Key: SPARK-57452
>                 URL: https://issues.apache.org/jira/browse/SPARK-57452
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured 
> Streaming, Web UI
>    Affects Versions: 4.2.0
>            Reporter: Xiao Li
>            Priority: Blocker
>
> * *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get 
> (Scala and Connect) now raises the underlying exception when metric 
> collection fails instead of returning an empty map; code that tolerated an 
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception 
> when observed-metric collection fails instead of silently returning an empty 
> map. Code that tolerated an empty result on failure now sees a thrown 
> exception. There is no opt-out.
>  * *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic 
> when counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer 
> now breaks ties between equal-count terms lexicographically, making the 
> vocabulary deterministic; this can change vocabulary term order and feature 
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, 
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically 
> so the vocabulary is deterministic. This can change vocabulary term order and 
> feature indices compared with prior (non-deterministic) output. There is no 
> opt-out.
>  * *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
> (previously only if all missing), for all pandas versions; pass 
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
> {{'raise'}} and now raise KeyError if any requested label is missing 
> (previously only if all were missing), across all pandas versions. To skip 
> missing labels, pass {{{}errors='ignore'{}}}.
>  * *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior 
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
> groupby idxmax/idxmin with skipna=False now returns null for NA groups 
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, 
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
> pandas semantics for NA: with pandas 2 they return null for groups containing 
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
> returning an index label. There is no opt-out.
>  * *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no 
> longer matches '1'), changing results for all pandas versions; this is 
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
> values of incompatible types no longer match (for example integer 1 no longer 
> matches string '1'). Results change across all pandas versions. There is no 
> opt-out.
>  * *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create 
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame 
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
> directly, dropping np.dtype-based StructType inference so inferred schema can 
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
> from a NumPy ndarray converts the array directly to an Arrow Table and now 
> requires PyArrow. The previous np.dtype-based StructType inference is 
> dropped, so the inferred schema may differ. To control the schema, pass an 
> explicit schema.
>  * *[SPARK-56186][PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer 
> officially supported in PySpark (CI, docker image, classifier, and 
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
> classifier, and PyPy-specific code/test skips have been removed. PyPy users 
> should migrate to CPython. There is no opt-out.
>  * *[SPARK-55096][PYTHON] Update pandas minimum version in 
> {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas 
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
> supported version for pandas on Spark Connect has been raised from 2.0.0 to 
> 2.2.0, matching the minimum already required by PySpark.
>  * *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
> nullable integer columns containing nulls now use a nullable Int extension 
> dtype instead of float64, so values/dtype inside the UDF change (fixing 
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
> nullable integer column that contains nulls receive a pandas nullable integer 
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
> large integers. The dtype and values seen inside the UDF change accordingly. 
> There is no opt-out configuration.
>  * *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data 
> Source read returning a pa.RecordBatch whose Arrow types differ from the 
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from 
> the declared schema now fails with 
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that 
> previously loaded by coincidence now error. There is no opt-out.
>  * *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when 
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming 
> Python Data Source SimpleDataSourceStreamReader whose read() returns a 
> non-empty batch with end==start now fails with 
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects 
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
> batch with end offset equal to start now fails with 
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
> memory. Empty batches with end == start are still allowed. There is no 
> opt-out.
>  * *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE 
> names differing only in case (e.g. WITH cte1, CTE1) now raise 
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted 
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
> detection is case-insensitive. CTE definitions whose names differ only in 
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
> definition. There is no opt-out; rename the conflicting CTEs.
>  * *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED 
> constraint output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
> now always prints RELY/NORELY for table constraints (previously omitted the 
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
> constraints, including {{NORELY}} for the default state which was previously 
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
> constraint output text for tools parsing it.
>  * *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
> view now drops the view by default instead of raising 
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a 
> view drops the view by default instead of raising 
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
>  * *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on 
> the spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
> session-level value is honored, changing when the limit error fires; error 
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to 
> Hive SerDe tables is always enforced on the Spark side and honors the 
> session-level value, changing when the limit is checked. The error is now 
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
>  * *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with 
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString 
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
> output for such columns. Default non-collated strings are unchanged. No 
> opt-out.
>  * *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> array_distinct/union/intersect/except and arrays_overlap now normalize floats 
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results 
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
> functions {{{}array_distinct{}}}, {{{}array_union{}}}, 
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}} 
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit 
> NaN values are treated as equal. This changes the results of these functions; 
> there is no opt-out.
>  * *[SPARK-54777][SQL] Changed dropTable error handling in 
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE 
> now only swallows object-not-found errors; other failures (permission, etc.) 
> propagate instead of silently returning, so a drop that previously appeared 
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
> catalog only swallows object-not-found errors when running DROP TABLE; other 
> failures such as permission-denied or constraint violations now propagate 
> instead of silently returning success. A DROP TABLE that previously appeared 
> to succeed may now throw.
>  * *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC 
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness 
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables 
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
> correctness fix, as no mainstream RDBMS supports sampling with replacement), 
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
> sample-with-replacement on JDBC tables change accordingly.
>  * *[SPARK-56031][SQL] Make Natural Join column matching respect case 
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
> respects spark.sql.caseSensitive (default false), so joins on case-differing 
> common columns instead of degrading to CROSS JOIN, changing results; set 
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common 
> columns that differ only in case are joined instead of degrading to a CROSS 
> JOIN, changing results. To match columns case-sensitively, set 
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
>  * *[SPARK-31561][SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a 
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote 
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. 
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
>  * *[SPARK-57188][SQL] Parameterless function takes precedence over UDF 
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless 
> built-in (current_user, current_date, etc.) now takes precedence over a 
> same-named SQL UDF parameter, changing UDF body results. Set 
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
> built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
> parameter in the function body. To restore the previous behavior, set 
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
> {{{}true{}}}.
>  * *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation 
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet 
> files with UNKNOWN logical-type annotation now infers physical type (e.g. 
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType 
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
> files with the {{UNKNOWN}} logical-type annotation infers the physical type 
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore 
> the 4.1.0 behavior of inferring NullType, set 
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
> {{{}true{}}}.
>  * *[SPARK-56414][SQL] Per-write options should take precedence over session 
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options 
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf 
> in Parquet/Avro writes; previously such options were silently ignored, so 
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
> take precedence over session config for several Parquet/Avro write keys (e.g. 
> {{{}spark.sql.parquet.outputTimestampType{}}}, 
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
> silently ignored, so the written file format can change when both are set.
>  * *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading 
> all data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC 
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor 
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0 
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL 
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), 
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the 
> whole table into memory. To restore the previous behavior, set the 
> {{fetchsize}} option to {{{}0{}}}.
>  * *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}} 
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with 
> a bare name now resolves to a session variable of that name first (if one 
> exists) before treating it as a catalog name; there is no opt-out config. 
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG 
> accepts foldable expressions and a bare name is first resolved as a session 
> variable of that name (if one exists) before being treated as a catalog name. 
> There is no opt-out; a session variable that shadows a catalog name changes 
> which catalog is set.
>  * *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe 
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|' 
> is now a SQL pipe-operator token by default, so a query using a pipe keyword 
> as a column name after bitwise-OR may reparse; restore via 
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
> operator token by default. A pipe keyword used as a column name after a 
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
>  * *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect 
> Catalog.createTable now executes eagerly instead of lazily, so the table is 
> created (and errors like already-exists surface) immediately at the call 
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
> {{Catalog.createTable}} executes eagerly: the table is created (and errors 
> such as table-already-exists surface) immediately at the call rather than 
> lazily on a later action. Code relying on the previous lazy behavior is 
> affected.
>  * *[SPARK-55198][SQL] spark-sql should skip comment line with leading 
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
> skips comment lines that have leading whitespace before -- (line.trim 
> startsWith --), matching Hive/beeline; such lines were previously sent as 
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
> skips comment lines whose first non-whitespace characters are {{--}} (i.e. 
> {{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline. 
> Previously such leading-whitespace comment lines were sent as SQL. There is 
> no opt-out.
>  * *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always 
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
> always propagates metadata columns by default; queries that failed with 
> AnalysisException may now succeed and joins may newly raise ambiguous-column 
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
> always propagates metadata columns from its child, so some queries that 
> previously failed with AnalysisException now succeed and joins may raise new 
> ambiguous-column errors. To restore the legacy behavior, set 
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
>  * *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in 
> DESCRIBE TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
> EXTENDED for v2 tables/views now emits structured 
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier 
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
> the command output may be affected.
>  * *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON 
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant 
> reject unpaired UTF-16 surrogates (raising an error or returning NULL) 
> instead of silently substituting U+FFFD. To restore the previous permissive 
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to 
> {{{}false{}}}.
>  * *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as 
> Variant now honors inferSchema=false (leaf/attribute text kept as strings, 
> not inferred boolean/long/decimal), changing results. Set 
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
> values as strings instead of inferring boolean/long/decimal. To restore the 
> previous behavior of always inferring types, set 
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
>  * *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance() 
> now preserves attribute name casing by default, changing output column-name 
> casing for self-joins on a CTE with case-differing duplicate columns; restore 
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
> references preserve attribute name casing when re-instantiated, so self-joins 
> on a CTE with case-differing duplicate columns keep the original column-name 
> casing in the output. To restore the previous behavior, set 
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
>  * *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming 
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
> differently-bit NaNs and signed zeros are treated as duplicates; dedup 
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or 
> double key columns normalize NaN and signed zero, so differently-bit NaN 
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes 
> deduplication results for queries with floating-point keys; there is no 
> opt-out.
>  * *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a 
> streaming query whose checkpoint has offset/commit logs but no metadata file 
> now fails with MISSING_METADATA_FILE by default instead of starting a new 
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
> streaming query whose checkpoint has offset and commit logs but no metadata 
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
> silently generating a new query id. To restore the previous behavior, set 
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
> {{{}false{}}}.
>  * *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format, 
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing 
> /applications/\{appId}/sql REST endpoint default length changed from 20 to 
> -1, so it now returns all SQL executions by default; clients relying on the 
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the 
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} 
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by 
> default. To restore the previous behavior, pass {{length=20}} (any {{length 
> <= 0}} returns all executions).
>  * *[SPARK-55075][K8S] Track executor pod creation errors with 
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor 
> pod-creation failures are now caught, logged and counted by 
> ExecutorFailureTracker (continue until max failures) instead of being 
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
> executor pod-creation failures are caught, logged, and counted by 
> ExecutorFailureTracker (allocation continues until the max-failures 
> threshold) instead of being rethrown immediately. There is no opt-out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-57452) Auditing the migration guide

Reply via email to