[jira] [Updated] (SPARK-57452) Auditing the migration guide

Xiao Li (Jira) Sun, 14 Jun 2026 23:26:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Li updated SPARK-57452:
----------------------------
    Description: 
I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and 
identified 38 changes that appear to be missing from the migration guide.

The JIRAs listed below were identified through this analysis. However, this may 
not be a complete list, so please also review the remaining commits for any 
additional migration guide updates that may be required.

 
 * *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get 
(Scala and Connect) now raises the underlying exception when metric collection 
fails instead of returning an empty map; code that tolerated an empty result on 
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}} 
(Scala and Spark Connect) raises the underlying exception when observed-metric 
collection fails instead of silently returning an empty map. Code that 
tolerated an empty result on failure now sees a thrown exception. There is no 
opt-out.
 * *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when 
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now 
breaks ties between equal-count terms lexicographically, making the vocabulary 
deterministic; this can change vocabulary term order and feature indices versus 
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}} 
breaks ties between equal-count terms lexicographically so the vocabulary is 
deterministic. This can change vocabulary term order and feature indices 
compared with prior (non-deterministic) output. There is no opt-out.
 * *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
(previously only if all missing), for all pandas versions; pass errors='ignore' 
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
{{'raise'}} and now raise KeyError if any requested label is missing 
(previously only if all were missing), across all pandas versions. To skip 
missing labels, pass {{{}errors='ignore'{}}}.
 * *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior with 
pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas 
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results 
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
pandas semantics for NA: with pandas 2 they return null for groups containing 
NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
returning an index label. There is no opt-out.
 * *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer 
matches '1'), changing results for all pandas versions; this is pandas-parity, 
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
values of incompatible types no longer match (for example integer 1 no longer 
matches string '1'). Results change across all pandas versions. There is no 
opt-out.
 * *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create dataframe 
from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame 
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
directly, dropping np.dtype-based StructType inference so inferred schema can 
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
from a NumPy ndarray converts the array directly to an Arrow Table and now 
requires PyArrow. The previous np.dtype-based StructType inference is dropped, 
so the inferred schema may differ. To control the schema, pass an explicit 
schema.
 * *SPARK-56186[PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer 
officially supported in PySpark (CI, docker image, classifier, and 
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
classifier, and PyPy-specific code/test skips have been removed. PyPy users 
should migrate to CPython. There is no opt-out.
 * *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas 
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
supported version for pandas on Spark Connect has been raised from 2.0.0 to 
2.2.0, matching the minimum already required by PySpark.
 * *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
nullable integer columns containing nulls now use a nullable Int extension 
dtype instead of float64, so values/dtype inside the UDF change (fixing 
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
nullable integer column that contains nulls receive a pandas nullable integer 
extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
large integers. The dtype and values seen inside the UDF change accordingly. 
There is no opt-out configuration.
 * *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data 
Source read returning a pa.RecordBatch whose Arrow types differ from the 
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the 
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. 
Type-mismatched batches that previously loaded by coincidence now error. There 
is no opt-out.
 * *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when 
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python 
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch 
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of 
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
batch with end offset equal to start now fails with 
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
memory. Empty batches with end == start are still allowed. There is no opt-out.
 * *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names 
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES 
instead of silently overwriting; previously-accepted queries now fail. No 
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
detection is case-insensitive. CTE definitions whose names differ only in case 
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
definition. There is no opt-out; rename the conflicting CTEs.
 * *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint 
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
now always prints RELY/NORELY for table constraints (previously omitted the 
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
constraints, including {{NORELY}} for the default state which was previously 
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
constraint output text for tools parsing it.
 * *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
view now drops the view by default instead of raising 
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view 
drops the view by default instead of raising 
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
 * *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the 
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
session-level value is honored, changing when the limit error fires; error 
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive 
SerDe tables is always enforced on the Spark side and honors the session-level 
value, changing when the limit is checked. The error is now reported as 
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
 * *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with default 
collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString 
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
output for such columns. Default non-collated strings are unchanged. No opt-out.
 * *SPARK-54918[SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
array_distinct/union/intersect/except and arrays_overlap now normalize floats 
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of 
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}}, 
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values, 
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal. 
This changes the results of these functions; there is no opt-out.
 * *SPARK-54777[SQL] Changed dropTable error handling in 
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now 
only swallows object-not-found errors; other failures (permission, etc.) 
propagate instead of silently returning, so a drop that previously appeared to 
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
catalog only swallows object-not-found errors when running DROP TABLE; other 
failures such as permission-denied or constraint violations now propagate 
instead of silently returning success. A DROP TABLE that previously appeared to 
succeed may now throw.
 * *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE 
with withReplacement=true is no longer pushed down (correctness fix; pushdown 
default-on), so .sample(withReplacement=true) on JDBC tables now returns 
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
correctness fix, as no mainstream RDBMS supports sampling with replacement), 
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
sample-with-replacement on JDBC tables change accordingly.
 * *SPARK-56031[SQL] Make Natural Join column matching respect case sensitivity 
conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
respects spark.sql.caseSensitive (default false), so joins on case-differing 
common columns instead of degrading to CROSS JOIN, changing results; set 
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns 
that differ only in case are joined instead of degrading to a CROSS JOIN, 
changing results. To match columns case-sensitively, set 
{{spark.sql.caseSensitive}} to {{{}true{}}}.
 * *SPARK-31561[SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing 
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the 
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A 
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
 * *SPARK-57188[SQL] Parameterless function takes precedence over UDF parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless 
built-in (current_user, current_date, etc.) now takes precedence over a 
same-named SQL UDF parameter, changing UDF body results. Set 
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
parameter in the function body. To restore the previous behavior, set 
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
{{{}true{}}}.
 * *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation and 
revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet 
files with UNKNOWN logical-type annotation now infers physical type (e.g. 
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via 
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
files with the {{UNKNOWN}} logical-type annotation infers the physical type 
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the 
4.1.0 behavior of inferring NullType, set 
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
{{{}true{}}}.
 * *SPARK-56414[SQL] Per-write options should take precedence over session 
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options 
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in 
Parquet/Avro writes; previously such options were silently ignored, so written 
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
take precedence over session config for several Parquet/Avro write keys (e.g. 
{{{}spark.sql.parquet.outputTimestampType{}}}, 
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
silently ignored, so the written file format can change when both are set.
 * *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all 
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads 
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with 
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC 
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling 
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole 
table into memory. To restore the previous behavior, set the {{fetchsize}} 
option to {{{}0{}}}.
 * *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}} statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a 
bare name now resolves to a session variable of that name first (if one exists) 
before treating it as a catalog name; there is no opt-out config. Edge case but 
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts 
foldable expressions and a bare name is first resolved as a session variable of 
that name (if one exists) before being treated as a catalog name. There is no 
opt-out; a session variable that shadows a catalog name changes which catalog 
is set.
 * *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe 
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is 
now a SQL pipe-operator token by default, so a query using a pipe keyword as a 
column name after bitwise-OR may reparse; restore via 
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
operator token by default. A pipe keyword used as a column name after a 
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
 * *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect 
Catalog.createTable now executes eagerly instead of lazily, so the table is 
created (and errors like already-exists surface) immediately at the call rather 
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
{{Catalog.createTable}} executes eagerly: the table is created (and errors such 
as table-already-exists surface) immediately at the call rather than lazily on 
a later action. Code relying on the previous lazy behavior is affected.
 * *SPARK-55198[SQL] spark-sql should skip comment line with leading 
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
skips comment lines that have leading whitespace before – (line.trim startsWith 
--), matching Hive/beeline; such lines were previously sent as SQL. CLI-only, 
no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
skips comment lines whose first non-whitespace characters are {{-{-}{-}}} (i.e. 
{{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline. 
Previously such leading-whitespace comment lines were sent as SQL. There is no 
opt-out.
 * *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always propagate 
metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
always propagates metadata columns by default; queries that failed with 
AnalysisException may now succeed and joins may newly raise ambiguous-column 
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
always propagates metadata columns from its child, so some queries that 
previously failed with AnalysisException now succeed and joins may raise new 
ambiguous-column errors. To restore the legacy behavior, set 
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
 * *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE 
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
EXTENDED for v2 tables/views now emits structured 
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row; 
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
the command output may be affected.
 * *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}}, 
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16 
surrogates (raising an error or returning NULL) instead of silently 
substituting U+FFFD. To restore the previous permissive behavior, set 
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
 * *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as 
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not 
inferred boolean/long/decimal), changing results. Set 
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
values as strings instead of inferring boolean/long/decimal. To restore the 
previous behavior of always inferring types, set 
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
 * *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance() 
now preserves attribute name casing by default, changing output column-name 
casing for self-joins on a CTE with case-differing duplicate columns; restore 
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
references preserve attribute name casing when re-instantiated, so self-joins 
on a CTE with case-differing duplicate columns keep the original column-name 
casing in the output. To restore the previous behavior, set 
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
 * *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming 
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
differently-bit NaNs and signed zeros are treated as duplicates; dedup results 
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double 
key columns normalize NaN and signed zero, so differently-bit NaN values and 
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication 
results for queries with floating-point keys; there is no opt-out.
 * *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a 
streaming query whose checkpoint has offset/commit logs but no metadata file 
now fails with MISSING_METADATA_FILE by default instead of starting a new query 
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
streaming query whose checkpoint has offset and commit logs but no metadata 
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
silently generating a new query id. To restore the previous behavior, set 
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
{{{}false{}}}.
 * *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format, and 
appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing 
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1, 
so it now returns all SQL executions by default; clients relying on the 20-row 
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the 
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter 
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To 
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns 
all executions).
 * *SPARK-55075[K8S] Track executor pod creation errors with 
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor 
pod-creation failures are now caught, logged and counted by 
ExecutorFailureTracker (continue until max failures) instead of being rethrown 
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
executor pod-creation failures are caught, logged, and counted by 
ExecutorFailureTracker (allocation continues until the max-failures threshold) 
instead of being rethrown immediately. There is no opt-out.

  was:
* *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get 
(Scala and Connect) now raises the underlying exception when metric collection 
fails instead of returning an empty map; code that tolerated an empty result on 
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}} 
(Scala and Spark Connect) raises the underlying exception when observed-metric 
collection fails instead of silently returning an empty map. Code that 
tolerated an empty result on failure now sees a thrown exception. There is no 
opt-out.
 * *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic when 
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now 
breaks ties between equal-count terms lexicographically, making the vocabulary 
deterministic; this can change vocabulary term order and feature indices versus 
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}} 
breaks ties between equal-count terms lexicographically so the vocabulary is 
deterministic. This can change vocabulary term order and feature indices 
compared with prior (non-deterministic) output. There is no opt-out.
 * *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
(previously only if all missing), for all pandas versions; pass errors='ignore' 
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
{{'raise'}} and now raise KeyError if any requested label is missing 
(previously only if all were missing), across all pandas versions. To skip 
missing labels, pass {{{}errors='ignore'{}}}.
 * *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior 
with pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas 
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results 
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
pandas semantics for NA: with pandas 2 they return null for groups containing 
NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
returning an index label. There is no opt-out.
 * *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps 
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer 
matches '1'), changing results for all pandas versions; this is pandas-parity, 
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
values of incompatible types no longer match (for example integer 1 no longer 
matches string '1'). Results change across all pandas versions. There is no 
opt-out.
 * *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create 
dataframe from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame 
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
directly, dropping np.dtype-based StructType inference so inferred schema can 
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
from a NumPy ndarray converts the array directly to an Arrow Table and now 
requires PyArrow. The previous np.dtype-based StructType inference is dropped, 
so the inferred schema may differ. To control the schema, pass an explicit 
schema.
 * *[SPARK-56186][PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer 
officially supported in PySpark (CI, docker image, classifier, and 
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
classifier, and PyPy-specific code/test skips have been removed. PyPy users 
should migrate to CPython. There is no opt-out.
 * *[SPARK-55096][PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas 
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
supported version for pandas on Spark Connect has been raised from 2.0.0 to 
2.2.0, matching the minimum already required by PySpark.
 * *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
nullable integer columns containing nulls now use a nullable Int extension 
dtype instead of float64, so values/dtype inside the UDF change (fixing 
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
nullable integer column that contains nulls receive a pandas nullable integer 
extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
large integers. The dtype and values seen inside the UDF change accordingly. 
There is no opt-out configuration.
 * *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data 
Source read returning a pa.RecordBatch whose Arrow types differ from the 
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the 
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. 
Type-mismatched batches that previously loaded by coincidence now error. There 
is no opt-out.
 * *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when 
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python 
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch 
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of 
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
batch with end offset equal to start now fails with 
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
memory. Empty batches with end == start are still allowed. There is no opt-out.
 * *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names 
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES 
instead of silently overwriting; previously-accepted queries now fail. No 
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
detection is case-insensitive. CTE definitions whose names differ only in case 
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
definition. There is no opt-out; rename the conflicting CTEs.
 * *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint 
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
now always prints RELY/NORELY for table constraints (previously omitted the 
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
constraints, including {{NORELY}} for the default state which was previously 
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
constraint output text for tools parsing it.
 * *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
view now drops the view by default instead of raising 
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view 
drops the view by default instead of raising 
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
 * *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on the 
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
session-level value is honored, changing when the limit error fires; error 
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive 
SerDe tables is always enforced on the Spark side and honors the session-level 
value, changing when the limit is checked. The error is now reported as 
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
 * *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with 
default collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString 
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
output for such columns. Default non-collated strings are unchanged. No opt-out.
 * *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
array_distinct/union/intersect/except and arrays_overlap now normalize floats 
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of 
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}}, 
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values, 
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal. 
This changes the results of these functions; there is no opt-out.
 * *[SPARK-54777][SQL] Changed dropTable error handling in 
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now 
only swallows object-not-found errors; other failures (permission, etc.) 
propagate instead of silently returning, so a drop that previously appeared to 
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
catalog only swallows object-not-found errors when running DROP TABLE; other 
failures such as permission-denied or constraint violations now propagate 
instead of silently returning success. A DROP TABLE that previously appeared to 
succeed may now throw.
 * *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE 
with withReplacement=true is no longer pushed down (correctness fix; pushdown 
default-on), so .sample(withReplacement=true) on JDBC tables now returns 
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
correctness fix, as no mainstream RDBMS supports sampling with replacement), 
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
sample-with-replacement on JDBC tables change accordingly.
 * *[SPARK-56031][SQL] Make Natural Join column matching respect case 
sensitivity conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
respects spark.sql.caseSensitive (default false), so joins on case-differing 
common columns instead of degrading to CROSS JOIN, changing results; set 
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns 
that differ only in case are joined instead of degrading to a CROSS JOIN, 
changing results. To match columns case-sensitively, set 
{{spark.sql.caseSensitive}} to {{{}true{}}}.
 * *[SPARK-31561][SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing 
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the 
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A 
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
 * *[SPARK-57188][SQL] Parameterless function takes precedence over UDF 
parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless 
built-in (current_user, current_date, etc.) now takes precedence over a 
same-named SQL UDF parameter, changing UDF body results. Set 
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
parameter in the function body. To restore the previous behavior, set 
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
{{{}true{}}}.
 * *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation 
and revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet 
files with UNKNOWN logical-type annotation now infers physical type (e.g. 
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via 
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
files with the {{UNKNOWN}} logical-type annotation infers the physical type 
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the 
4.1.0 behavior of inferring NullType, set 
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
{{{}true{}}}.
 * *[SPARK-56414][SQL] Per-write options should take precedence over session 
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options 
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in 
Parquet/Avro writes; previously such options were silently ignored, so written 
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
take precedence over session config for several Parquet/Avro write keys (e.g. 
{{{}spark.sql.parquet.outputTimestampType{}}}, 
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
silently ignored, so the written file format can change when both are set.
 * *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all 
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads 
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with 
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC 
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling 
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole 
table into memory. To restore the previous behavior, set the {{fetchsize}} 
option to {{{}0{}}}.
 * *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}} 
statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a 
bare name now resolves to a session variable of that name first (if one exists) 
before treating it as a catalog name; there is no opt-out config. Edge case but 
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts 
foldable expressions and a bare name is first resolved as a session variable of 
that name (if one exists) before being treated as a catalog name. There is no 
opt-out; a session variable that shadows a catalog name changes which catalog 
is set.
 * *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe 
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is 
now a SQL pipe-operator token by default, so a query using a pipe keyword as a 
column name after bitwise-OR may reparse; restore via 
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
operator token by default. A pipe keyword used as a column name after a 
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
 * *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect 
Catalog.createTable now executes eagerly instead of lazily, so the table is 
created (and errors like already-exists surface) immediately at the call rather 
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
{{Catalog.createTable}} executes eagerly: the table is created (and errors such 
as table-already-exists surface) immediately at the call rather than lazily on 
a later action. Code relying on the previous lazy behavior is affected.
 * *[SPARK-55198][SQL] spark-sql should skip comment line with leading 
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
skips comment lines that have leading whitespace before -- (line.trim 
startsWith --), matching Hive/beeline; such lines were previously sent as SQL. 
CLI-only, no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
skips comment lines whose first non-whitespace characters are {{--}} (i.e. 
{{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline. 
Previously such leading-whitespace comment lines were sent as SQL. There is no 
opt-out.
 * *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always 
propagate metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
always propagates metadata columns by default; queries that failed with 
AnalysisException may now succeed and joins may newly raise ambiguous-column 
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
always propagates metadata columns from its child, so some queries that 
previously failed with AnalysisException now succeed and joins may raise new 
ambiguous-column errors. To restore the legacy behavior, set 
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
 * *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE 
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
EXTENDED for v2 tables/views now emits structured 
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row; 
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
the command output may be affected.
 * *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON 
parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: 
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}}, 
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16 
surrogates (raising an error or returning NULL) instead of silently 
substituting U+FFFD. To restore the previous permissive behavior, set 
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
 * *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as 
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not 
inferred boolean/long/decimal), changing results. Set 
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
values as strings instead of inferring boolean/long/decimal. To restore the 
previous behavior of always inferring types, set 
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
 * *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance() 
now preserves attribute name casing by default, changing output column-name 
casing for self-joins on a CTE with case-differing duplicate columns; restore 
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
references preserve attribute name casing when re-instantiated, so self-joins 
on a CTE with case-differing duplicate columns keep the original column-name 
casing in the output. To restore the previous behavior, set 
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
 * *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming 
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
differently-bit NaNs and signed zeros are treated as duplicates; dedup results 
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double 
key columns normalize NaN and signed zero, so differently-bit NaN values and 
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication 
results for queries with floating-point keys; there is no opt-out.
 * *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a 
streaming query whose checkpoint has offset/commit logs but no metadata file 
now fails with MISSING_METADATA_FILE by default instead of starting a new query 
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
streaming query whose checkpoint has offset and commit logs but no metadata 
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
silently generating a new query id. To restore the previous behavior, set 
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
{{{}false{}}}.
 * *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format, 
and appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing 
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1, 
so it now returns all SQL executions by default; clients relying on the 20-row 
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the 
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter 
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To 
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns 
all executions).
 * *[SPARK-55075][K8S] Track executor pod creation errors with 
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor 
pod-creation failures are now caught, logged and counted by 
ExecutorFailureTracker (continue until max failures) instead of being rethrown 
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
executor pod-creation failures are caught, logged, and counted by 
ExecutorFailureTracker (allocation continues until the max-failures threshold) 
instead of being rethrown immediately. There is no opt-out.


> Auditing the migration guide
> ----------------------------
>
>                 Key: SPARK-57452
>                 URL: https://issues.apache.org/jira/browse/SPARK-57452
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured 
> Streaming, Web UI
>    Affects Versions: 4.2.0
>            Reporter: Xiao Li
>            Priority: Blocker
>
> I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and 
> identified 38 changes that appear to be missing from the migration guide.
> The JIRAs listed below were identified through this analysis. However, this 
> may not be a complete list, so please also review the remaining commits for 
> any additional migration guide updates that may be required.
>  
>  * *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get 
> (Scala and Connect) now raises the underlying exception when metric 
> collection fails instead of returning an empty map; code that tolerated an 
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception 
> when observed-metric collection fails instead of silently returning an empty 
> map. Code that tolerated an empty result on failure now sees a thrown 
> exception. There is no opt-out.
>  * *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when 
> counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer 
> now breaks ties between equal-count terms lexicographically, making the 
> vocabulary deterministic; this can change vocabulary term order and feature 
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, 
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically 
> so the vocabulary is deterministic. This can change vocabulary term order and 
> feature indices compared with prior (non-deterministic) output. There is no 
> opt-out.
>  * *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing 
> (previously only if all missing), for all pandas versions; pass 
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to 
> {{'raise'}} and now raise KeyError if any requested label is missing 
> (previously only if all were missing), across all pandas versions. To skip 
> missing labels, pass {{{}errors='ignore'{}}}.
>  * *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior 
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark 
> groupby idxmax/idxmin with skipna=False now returns null for NA groups 
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, 
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow 
> pandas semantics for NA: with pandas 2 they return null for groups containing 
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of 
> returning an index label. There is no opt-out.
>  * *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps 
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no 
> longer matches '1'), changing results for all pandas versions; this is 
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark 
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so 
> values of incompatible types no longer match (for example integer 1 no longer 
> matches string '1'). Results change across all pandas versions. There is no 
> opt-out.
>  * *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create 
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame 
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow 
> directly, dropping np.dtype-based StructType inference so inferred schema can 
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame 
> from a NumPy ndarray converts the array directly to an Arrow Table and now 
> requires PyArrow. The previous np.dtype-based StructType inference is 
> dropped, so the inferred schema may differ. To control the schema, pass an 
> explicit schema.
>  * *SPARK-56186[PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer 
> officially supported in PySpark (CI, docker image, classifier, and 
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer 
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py 
> classifier, and PyPy-specific code/test skips have been removed. PyPy users 
> should migrate to CPython. There is no opt-out.
>  * *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas 
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer 
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum 
> supported version for pandas on Spark Connect has been raised from 2.0.0 to 
> 2.2.0, matching the minimum already required by PySpark.
>  * *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on 
> nullable integer columns containing nulls now use a nullable Int extension 
> dtype instead of float64, so values/dtype inside the UDF change (fixing 
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a 
> nullable integer column that contains nulls receive a pandas nullable integer 
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for 
> large integers. The dtype and values seen inside the UDF change accordingly. 
> There is no opt-out configuration.
>  * *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data 
> Source read returning a pa.RecordBatch whose Arrow types differ from the 
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH; 
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data 
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from 
> the declared schema now fails with 
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that 
> previously loaded by coincidence now error. There is no opt-out.
>  * *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when 
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming 
> Python Data Source SimpleDataSourceStreamReader whose read() returns a 
> non-empty batch with end==start now fails with 
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects 
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python 
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty 
> batch with end offset equal to start now fails with 
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver 
> memory. Empty batches with end == start are still allowed. There is no 
> opt-out.
>  * *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE 
> names differing only in case (e.g. WITH cte1, CTE1) now raise 
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted 
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name 
> detection is case-insensitive. CTE definitions whose names differ only in 
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise 
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier 
> definition. There is no opt-out; rename the conflicting CTEs.
>  * *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint 
> output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED 
> now always prints RELY/NORELY for table constraints (previously omitted the 
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE 
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table 
> constraints, including {{NORELY}} for the default state which was previously 
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's 
> constraint output text for tools parsing it.
>  * *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a 
> view now drops the view by default instead of raising 
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via 
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a 
> view drops the view by default instead of raising 
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set 
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
>  * *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the 
> spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the 
> session-level value is honored, changing when the limit error fires; error 
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the 
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to 
> Hive SerDe tables is always enforced on the Spark side and honors the 
> session-level value, changing when the limit is checked. The error is now 
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
>  * *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with 
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString 
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string 
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing 
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a 
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation 
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example 
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema 
> output for such columns. Default non-collated strings are unchanged. No 
> opt-out.
>  * *SPARK-54918[SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> array_distinct/union/intersect/except and arrays_overlap now normalize floats 
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results 
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set 
> functions {{{}array_distinct{}}}, {{{}array_union{}}}, 
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}} 
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit 
> NaN values are treated as equal. This changes the results of these functions; 
> there is no opt-out.
>  * *SPARK-54777[SQL] Changed dropTable error handling in 
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE 
> now only swallows object-not-found errors; other failures (permission, etc.) 
> propagate instead of silently returning, so a drop that previously appeared 
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table 
> catalog only swallows object-not-found errors when running DROP TABLE; other 
> failures such as permission-denied or constraint violations now propagate 
> instead of silently returning success. A DROP TABLE that previously appeared 
> to succeed may now throw.
>  * *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC 
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness 
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables 
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector 
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a 
> correctness fix, as no mainstream RDBMS supports sampling with replacement), 
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of 
> sample-with-replacement on JDBC tables change accordingly.
>  * *SPARK-56031[SQL] Make Natural Join column matching respect case 
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now 
> respects spark.sql.caseSensitive (default false), so joins on case-differing 
> common columns instead of degrading to CROSS JOIN, changing results; set 
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN 
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common 
> columns that differ only in case are joined instead of degrading to a CROSS 
> JOIN, changing results. To match columns case-sensitively, set 
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
>  * *SPARK-31561[SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a 
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a 
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote 
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}} 
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. 
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t 
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous 
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
>  * *SPARK-57188[SQL] Parameterless function takes precedence over UDF 
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless 
> built-in (current_user, current_date, etc.) now takes precedence over a 
> same-named SQL UDF parameter, changing UDF body results. Set 
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless 
> built-in function ({{{}current_user{}}}, {{{}current_date{}}}, 
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF 
> parameter in the function body. To restore the previous behavior, set 
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to 
> {{{}true{}}}.
>  * *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation 
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet 
> files with UNKNOWN logical-type annotation now infers physical type (e.g. 
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType 
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet 
> files with the {{UNKNOWN}} logical-type annotation infers the physical type 
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore 
> the 4.1.0 behavior of inferring NullType, set 
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to 
> {{{}true{}}}.
>  * *SPARK-56414[SQL] Per-write options should take precedence over session 
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options 
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf 
> in Parquet/Avro writes; previously such options were silently ignored, so 
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options 
> take precedence over session config for several Parquet/Avro write keys (e.g. 
> {{{}spark.sql.parquet.outputTimestampType{}}}, 
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were 
> silently ignored, so the written file format can change when both are set.
>  * *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all 
> data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC 
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor 
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0 
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL 
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), 
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the 
> whole table into memory. To restore the previous behavior, set the 
> {{fetchsize}} option to {{{}0{}}}.
>  * *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}} 
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with 
> a bare name now resolves to a session variable of that name first (if one 
> exists) before treating it as a catalog name; there is no opt-out config. 
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG 
> accepts foldable expressions and a bare name is first resolved as a session 
> variable of that name (if one exists) before being treated as a catalog name. 
> There is no opt-out; a session variable that shadows a catalog name changes 
> which catalog is set.
>  * *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe 
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|' 
> is now a SQL pipe-operator token by default, so a query using a pipe keyword 
> as a column name after bitwise-OR may reparse; restore via 
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser 
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe 
> operator token by default. A pipe keyword used as a column name after a 
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set 
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
>  * *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect 
> Catalog.createTable now executes eagerly instead of lazily, so the table is 
> created (and errors like already-exists surface) immediately at the call 
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect 
> {{Catalog.createTable}} executes eagerly: the table is created (and errors 
> such as table-already-exists surface) immediately at the call rather than 
> lazily on a later action. Code relying on the previous lazy behavior is 
> affected.
>  * *SPARK-55198[SQL] spark-sql should skip comment line with leading 
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now 
> skips comment lines that have leading whitespace before – (line.trim 
> startsWith --), matching Hive/beeline; such lines were previously sent as 
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI 
> skips comment lines whose first non-whitespace characters are {{-{-}{-}}} 
> (i.e. {{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline. 
> Previously such leading-whitespace comment lines were sent as SQL. There is 
> no opt-out.
>  * *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always 
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now 
> always propagates metadata columns by default; queries that failed with 
> AnalysisException may now succeed and joins may newly raise ambiguous-column 
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}} 
> always propagates metadata columns from its child, so some queries that 
> previously failed with AnalysisException now succeed and joins may raise new 
> ambiguous-column errors. To restore the legacy behavior, set 
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
>  * *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE 
> TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE 
> EXTENDED for v2 tables/views now emits structured 
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier 
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE 
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}}, 
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows 
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse 
> the command output may be affected.
>  * *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON 
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: 
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16 
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted 
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to 
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, 
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant 
> reject unpaired UTF-16 surrogates (raising an error or returning NULL) 
> instead of silently substituting U+FFFD. To restore the previous permissive 
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to 
> {{{}false{}}}.
>  * *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as 
> Variant now honors inferSchema=false (leaf/attribute text kept as strings, 
> not inferred boolean/long/decimal), changing results. Set 
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as 
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute 
> values as strings instead of inferring boolean/long/decimal. To restore the 
> previous behavior of always inferring types, set 
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
>  * *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance() 
> now preserves attribute name casing by default, changing output column-name 
> casing for self-joins on a CTE with case-differing duplicate columns; restore 
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation 
> references preserve attribute name casing when re-instantiated, so self-joins 
> on a CTE with case-differing duplicate columns keep the original column-name 
> casing in the output. To restore the previous behavior, set 
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
>  * *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming 
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so 
> differently-bit NaNs and signed zeros are treated as duplicates; dedup 
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming 
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or 
> double key columns normalize NaN and signed zero, so differently-bit NaN 
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes 
> deduplication results for queries with floating-point keys; there is no 
> opt-out.
>  * *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a 
> streaming query whose checkpoint has offset/commit logs but no metadata file 
> now fails with MISSING_METADATA_FILE by default instead of starting a new 
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a 
> streaming query whose checkpoint has offset and commit logs but no metadata 
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of 
> silently generating a new query id. To restore the previous behavior, set 
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to 
> {{{}false{}}}.
>  * *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format, 
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing 
> /applications/\{appId}/sql REST endpoint default length changed from 20 to 
> -1, so it now returns all SQL executions by default; clients relying on the 
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the 
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} 
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by 
> default. To restore the previous behavior, pass {{length=20}} (any {{length 
> <= 0}} returns all executions).
>  * *SPARK-55075[K8S] Track executor pod creation errors with 
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor 
> pod-creation failures are now caught, logged and counted by 
> ExecutorFailureTracker (continue until max failures) instead of being 
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes 
> executor pod-creation failures are caught, logged, and counted by 
> ExecutorFailureTracker (allocation continues until the max-failures 
> threshold) instead of being rethrown immediately. There is no opt-out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-57452) Auditing the migration guide

Reply via email to