[
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li updated SPARK-57452:
----------------------------
Description:
I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and
identified 38 changes that appear to be missing from the migration guide.
The JIRAs listed below were identified through this analysis. However, this may
not be a complete list, so please also review the remaining commits for any
additional migration guide updates that may be required.
* *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get
(Scala and Connect) now raises the underlying exception when metric collection
fails instead of returning an empty map; code that tolerated an empty result on
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}}
(Scala and Spark Connect) raises the underlying exception when observed-metric
collection fails instead of silently returning an empty map. Code that
tolerated an empty result on failure now sees a thrown exception. There is no
opt-out.
* *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now
breaks ties between equal-count terms lexicographically, making the vocabulary
deterministic; this can change vocabulary term order and feature indices versus
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}}
breaks ties between equal-count terms lexicographically so the vocabulary is
deterministic. This can change vocabulary term order and feature indices
compared with prior (non-deterministic) output. There is no opt-out.
* *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
(previously only if all missing), for all pandas versions; pass errors='ignore'
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
{{'raise'}} and now raise KeyError if any requested label is missing
(previously only if all were missing), across all pandas versions. To skip
missing labels, pass {{{}errors='ignore'{}}}.
* *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior with
pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
pandas semantics for NA: with pandas 2 they return null for groups containing
NA values, and with pandas 3 they raise on NA-containing inputs, instead of
returning an index label. There is no opt-out.
* *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer
matches '1'), changing results for all pandas versions; this is pandas-parity,
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
values of incompatible types no longer match (for example integer 1 no longer
matches string '1'). Results change across all pandas versions. There is no
opt-out.
* *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create dataframe
from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
directly, dropping np.dtype-based StructType inference so inferred schema can
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
from a NumPy ndarray converts the array directly to an Arrow Table and now
requires PyArrow. The previous np.dtype-based StructType inference is dropped,
so the inferred schema may differ. To control the schema, pass an explicit
schema.
* *SPARK-56186[PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer
officially supported in PySpark (CI, docker image, classifier, and
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
classifier, and PyPy-specific code/test skips have been removed. PyPy users
should migrate to CPython. There is no opt-out.
* *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
supported version for pandas on Spark Connect has been raised from 2.0.0 to
2.2.0, matching the minimum already required by PySpark.
* *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on
nullable integer columns containing nulls now use a nullable Int extension
dtype instead of float64, so values/dtype inside the UDF change (fixing
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
nullable integer column that contains nulls receive a pandas nullable integer
extension dtype (e.g. Int64) instead of float64, fixing precision loss for
large integers. The dtype and values seen inside the UDF change accordingly.
There is no opt-out configuration.
* *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data
Source read returning a pa.RecordBatch whose Arrow types differ from the
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}.
Type-mismatched batches that previously loaded by coincidence now error. There
is no opt-out.
* *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
batch with end offset equal to start now fails with
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
memory. Empty batches with end == start are still allowed. There is no opt-out.
* *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES
instead of silently overwriting; previously-accepted queries now fail. No
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
detection is case-insensitive. CTE definitions whose names differ only in case
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
definition. There is no opt-out; rename the conflicting CTEs.
* *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
now always prints RELY/NORELY for table constraints (previously omitted the
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
constraints, including {{NORELY}} for the default state which was previously
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
constraint output text for tools parsing it.
* *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a
view now drops the view by default instead of raising
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view
drops the view by default instead of raising
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
* *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
session-level value is honored, changing when the limit error fires; error
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive
SerDe tables is always enforced on the Spark side and honors the session-level
value, changing when the limit is checked. The error is now reported as
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
* *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with default
collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
output for such columns. Default non-collated strings are unchanged. No opt-out.
* *SPARK-54918[SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
array_distinct/union/intersect/except and arrays_overlap now normalize floats
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}},
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values,
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal.
This changes the results of these functions; there is no opt-out.
* *SPARK-54777[SQL] Changed dropTable error handling in
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now
only swallows object-not-found errors; other failures (permission, etc.)
propagate instead of silently returning, so a drop that previously appeared to
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
catalog only swallows object-not-found errors when running DROP TABLE; other
failures such as permission-denied or constraint violations now propagate
instead of silently returning success. A DROP TABLE that previously appeared to
succeed may now throw.
* *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE
with withReplacement=true is no longer pushed down (correctness fix; pushdown
default-on), so .sample(withReplacement=true) on JDBC tables now returns
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
correctness fix, as no mainstream RDBMS supports sampling with replacement),
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
sample-with-replacement on JDBC tables change accordingly.
* *SPARK-56031[SQL] Make Natural Join column matching respect case sensitivity
conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
respects spark.sql.caseSensitive (default false), so joins on case-differing
common columns instead of degrading to CROSS JOIN, changing results; set
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns
that differ only in case are joined instead of degrading to a CROSS JOIN,
changing results. To match columns case-sensitively, set
{{spark.sql.caseSensitive}} to {{{}true{}}}.
* *SPARK-31561[SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
* *SPARK-57188[SQL] Parameterless function takes precedence over UDF parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless
built-in (current_user, current_date, etc.) now takes precedence over a
same-named SQL UDF parameter, changing UDF body results. Set
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
built-in function ({{{}current_user{}}}, {{{}current_date{}}},
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
parameter in the function body. To restore the previous behavior, set
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
{{{}true{}}}.
* *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation and
revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet
files with UNKNOWN logical-type annotation now infers physical type (e.g.
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
files with the {{UNKNOWN}} logical-type annotation infers the physical type
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the
4.1.0 behavior of inferring NullType, set
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
{{{}true{}}}.
* *SPARK-56414[SQL] Per-write options should take precedence over session
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in
Parquet/Avro writes; previously such options were silently ignored, so written
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
take precedence over session config for several Parquet/Avro write keys (e.g.
{{{}spark.sql.parquet.outputTimestampType{}}},
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
silently ignored, so the written file format can change when both are set.
* *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole
table into memory. To restore the previous behavior, set the {{fetchsize}}
option to {{{}0{}}}.
* *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}} statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a
bare name now resolves to a session variable of that name first (if one exists)
before treating it as a catalog name; there is no opt-out config. Edge case but
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts
foldable expressions and a bare name is first resolved as a session variable of
that name (if one exists) before being treated as a catalog name. There is no
opt-out; a session variable that shadows a catalog name changes which catalog
is set.
* *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is
now a SQL pipe-operator token by default, so a query using a pipe keyword as a
column name after bitwise-OR may reparse; restore via
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
accepts single-character {{|}} as an alternative to {{|>}} for the pipe
operator token by default. A pipe keyword used as a column name after a
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
* *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect
Catalog.createTable now executes eagerly instead of lazily, so the table is
created (and errors like already-exists surface) immediately at the call rather
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
{{Catalog.createTable}} executes eagerly: the table is created (and errors such
as table-already-exists surface) immediately at the call rather than lazily on
a later action. Code relying on the previous lazy behavior is affected.
* *SPARK-55198[SQL] spark-sql should skip comment line with leading
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now
skips comment lines that have leading whitespace before – (line.trim startsWith
--), matching Hive/beeline; such lines were previously sent as SQL. CLI-only,
no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
skips comment lines whose first non-whitespace characters are {{-{-}{-}}} (i.e.
{{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline.
Previously such leading-whitespace comment lines were sent as SQL. There is no
opt-out.
* *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always propagate
metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now
always propagates metadata columns by default; queries that failed with
AnalysisException may now succeed and joins may newly raise ambiguous-column
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
always propagates metadata columns from its child, so some queries that
previously failed with AnalysisException now succeed and joins may raise new
ambiguous-column errors. To restore the legacy behavior, set
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
* *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
EXTENDED for v2 tables/views now emits structured
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row;
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
the command output may be affected.
* *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}},
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16
surrogates (raising an error or returning NULL) instead of silently
substituting U+FFFD. To restore the previous permissive behavior, set
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
* *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not
inferred boolean/long/decimal), changing results. Set
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
values as strings instead of inferring boolean/long/decimal. To restore the
previous behavior of always inferring types, set
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
* *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance()
now preserves attribute name casing by default, changing output column-name
casing for self-joins on a CTE with case-differing duplicate columns; restore
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
references preserve attribute name casing when re-instantiated, so self-joins
on a CTE with case-differing duplicate columns keep the original column-name
casing in the output. To restore the previous behavior, set
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
* *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
differently-bit NaNs and signed zeros are treated as duplicates; dedup results
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double
key columns normalize NaN and signed zero, so differently-bit NaN values and
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication
results for queries with floating-point keys; there is no opt-out.
* *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a
streaming query whose checkpoint has offset/commit logs but no metadata file
now fails with MISSING_METADATA_FILE by default instead of starting a new query
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
streaming query whose checkpoint has offset and commit logs but no metadata
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
silently generating a new query id. To restore the previous behavior, set
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
{{{}false{}}}.
* *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format, and
appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1,
so it now returns all SQL executions by default; clients relying on the 20-row
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns
all executions).
* *SPARK-55075[K8S] Track executor pod creation errors with
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor
pod-creation failures are now caught, logged and counted by
ExecutorFailureTracker (continue until max failures) instead of being rethrown
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
executor pod-creation failures are caught, logged, and counted by
ExecutorFailureTracker (allocation continues until the max-failures threshold)
instead of being rethrown immediately. There is no opt-out.
was:
* *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get
(Scala and Connect) now raises the underlying exception when metric collection
fails instead of returning an empty map; code that tolerated an empty result on
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}}
(Scala and Spark Connect) raises the underlying exception when observed-metric
collection fails instead of silently returning an empty map. Code that
tolerated an empty result on failure now sees a thrown exception. There is no
opt-out.
* *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic when
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now
breaks ties between equal-count terms lexicographically, making the vocabulary
deterministic; this can change vocabulary term order and feature indices versus
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}}
breaks ties between equal-count terms lexicographically so the vocabulary is
deterministic. This can change vocabulary term order and feature indices
compared with prior (non-deterministic) output. There is no opt-out.
* *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
(previously only if all missing), for all pandas versions; pass errors='ignore'
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
{{'raise'}} and now raise KeyError if any requested label is missing
(previously only if all were missing), across all pandas versions. To skip
missing labels, pass {{{}errors='ignore'{}}}.
* *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior
with pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
pandas semantics for NA: with pandas 2 they return null for groups containing
NA values, and with pandas 3 they raise on NA-containing inputs, instead of
returning an index label. There is no opt-out.
* *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer
matches '1'), changing results for all pandas versions; this is pandas-parity,
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
values of incompatible types no longer match (for example integer 1 no longer
matches string '1'). Results change across all pandas versions. There is no
opt-out.
* *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create
dataframe from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
directly, dropping np.dtype-based StructType inference so inferred schema can
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
from a NumPy ndarray converts the array directly to an Arrow Table and now
requires PyArrow. The previous np.dtype-based StructType inference is dropped,
so the inferred schema may differ. To control the schema, pass an explicit
schema.
* *[SPARK-56186][PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer
officially supported in PySpark (CI, docker image, classifier, and
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
classifier, and PyPy-specific code/test skips have been removed. PyPy users
should migrate to CPython. There is no opt-out.
* *[SPARK-55096][PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
supported version for pandas on Spark Connect has been raised from 2.0.0 to
2.2.0, matching the minimum already required by PySpark.
* *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on
nullable integer columns containing nulls now use a nullable Int extension
dtype instead of float64, so values/dtype inside the UDF change (fixing
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
nullable integer column that contains nulls receive a pandas nullable integer
extension dtype (e.g. Int64) instead of float64, fixing precision loss for
large integers. The dtype and values seen inside the UDF change accordingly.
There is no opt-out configuration.
* *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data
Source read returning a pa.RecordBatch whose Arrow types differ from the
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}.
Type-mismatched batches that previously loaded by coincidence now error. There
is no opt-out.
* *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
batch with end offset equal to start now fails with
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
memory. Empty batches with end == start are still allowed. There is no opt-out.
* *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES
instead of silently overwriting; previously-accepted queries now fail. No
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
detection is case-insensitive. CTE definitions whose names differ only in case
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
definition. There is no opt-out; rename the conflicting CTEs.
* *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
now always prints RELY/NORELY for table constraints (previously omitted the
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
constraints, including {{NORELY}} for the default state which was previously
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
constraint output text for tools parsing it.
* *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a
view now drops the view by default instead of raising
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view
drops the view by default instead of raising
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
* *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on the
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
session-level value is honored, changing when the limit error fires; error
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive
SerDe tables is always enforced on the Spark side and honors the session-level
value, changing when the limit is checked. The error is now reported as
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
* *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with
default collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
output for such columns. Default non-collated strings are unchanged. No opt-out.
* *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
array_distinct/union/intersect/except and arrays_overlap now normalize floats
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}},
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values,
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal.
This changes the results of these functions; there is no opt-out.
* *[SPARK-54777][SQL] Changed dropTable error handling in
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now
only swallows object-not-found errors; other failures (permission, etc.)
propagate instead of silently returning, so a drop that previously appeared to
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
catalog only swallows object-not-found errors when running DROP TABLE; other
failures such as permission-denied or constraint violations now propagate
instead of silently returning success. A DROP TABLE that previously appeared to
succeed may now throw.
* *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE
with withReplacement=true is no longer pushed down (correctness fix; pushdown
default-on), so .sample(withReplacement=true) on JDBC tables now returns
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
correctness fix, as no mainstream RDBMS supports sampling with replacement),
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
sample-with-replacement on JDBC tables change accordingly.
* *[SPARK-56031][SQL] Make Natural Join column matching respect case
sensitivity conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
respects spark.sql.caseSensitive (default false), so joins on case-differing
common columns instead of degrading to CROSS JOIN, changing results; set
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns
that differ only in case are joined instead of degrading to a CROSS JOIN,
changing results. To match columns case-sensitively, set
{{spark.sql.caseSensitive}} to {{{}true{}}}.
* *[SPARK-31561][SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
* *[SPARK-57188][SQL] Parameterless function takes precedence over UDF
parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless
built-in (current_user, current_date, etc.) now takes precedence over a
same-named SQL UDF parameter, changing UDF body results. Set
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
built-in function ({{{}current_user{}}}, {{{}current_date{}}},
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
parameter in the function body. To restore the previous behavior, set
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
{{{}true{}}}.
* *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation
and revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet
files with UNKNOWN logical-type annotation now infers physical type (e.g.
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
files with the {{UNKNOWN}} logical-type annotation infers the physical type
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the
4.1.0 behavior of inferring NullType, set
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
{{{}true{}}}.
* *[SPARK-56414][SQL] Per-write options should take precedence over session
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in
Parquet/Avro writes; previously such options were silently ignored, so written
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
take precedence over session config for several Parquet/Avro write keys (e.g.
{{{}spark.sql.parquet.outputTimestampType{}}},
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
silently ignored, so the written file format can change when both are set.
* *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole
table into memory. To restore the previous behavior, set the {{fetchsize}}
option to {{{}0{}}}.
* *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}}
statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a
bare name now resolves to a session variable of that name first (if one exists)
before treating it as a catalog name; there is no opt-out config. Edge case but
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts
foldable expressions and a bare name is first resolved as a session variable of
that name (if one exists) before being treated as a catalog name. There is no
opt-out; a session variable that shadows a catalog name changes which catalog
is set.
* *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is
now a SQL pipe-operator token by default, so a query using a pipe keyword as a
column name after bitwise-OR may reparse; restore via
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
accepts single-character {{|}} as an alternative to {{|>}} for the pipe
operator token by default. A pipe keyword used as a column name after a
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
* *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect
Catalog.createTable now executes eagerly instead of lazily, so the table is
created (and errors like already-exists surface) immediately at the call rather
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
{{Catalog.createTable}} executes eagerly: the table is created (and errors such
as table-already-exists surface) immediately at the call rather than lazily on
a later action. Code relying on the previous lazy behavior is affected.
* *[SPARK-55198][SQL] spark-sql should skip comment line with leading
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now
skips comment lines that have leading whitespace before -- (line.trim
startsWith --), matching Hive/beeline; such lines were previously sent as SQL.
CLI-only, no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
skips comment lines whose first non-whitespace characters are {{--}} (i.e.
{{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline.
Previously such leading-whitespace comment lines were sent as SQL. There is no
opt-out.
* *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always
propagate metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now
always propagates metadata columns by default; queries that failed with
AnalysisException may now succeed and joins may newly raise ambiguous-column
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
always propagates metadata columns from its child, so some queries that
previously failed with AnalysisException now succeed and joins may raise new
ambiguous-column errors. To restore the legacy behavior, set
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
* *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
EXTENDED for v2 tables/views now emits structured
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row;
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
the command output may be affected.
* *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON
parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}},
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16
surrogates (raising an error or returning NULL) instead of silently
substituting U+FFFD. To restore the previous permissive behavior, set
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
* *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not
inferred boolean/long/decimal), changing results. Set
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
values as strings instead of inferring boolean/long/decimal. To restore the
previous behavior of always inferring types, set
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
* *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance()
now preserves attribute name casing by default, changing output column-name
casing for self-joins on a CTE with case-differing duplicate columns; restore
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
references preserve attribute name casing when re-instantiated, so self-joins
on a CTE with case-differing duplicate columns keep the original column-name
casing in the output. To restore the previous behavior, set
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
* *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
differently-bit NaNs and signed zeros are treated as duplicates; dedup results
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double
key columns normalize NaN and signed zero, so differently-bit NaN values and
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication
results for queries with floating-point keys; there is no opt-out.
* *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a
streaming query whose checkpoint has offset/commit logs but no metadata file
now fails with MISSING_METADATA_FILE by default instead of starting a new query
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
streaming query whose checkpoint has offset and commit logs but no metadata
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
silently generating a new query id. To restore the previous behavior, set
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
{{{}false{}}}.
* *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format,
and appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1,
so it now returns all SQL executions by default; clients relying on the 20-row
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns
all executions).
* *[SPARK-55075][K8S] Track executor pod creation errors with
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor
pod-creation failures are now caught, logged and counted by
ExecutorFailureTracker (continue until max failures) instead of being rethrown
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
executor pod-creation failures are caught, logged, and counted by
ExecutorFailureTracker (allocation continues until the max-failures threshold)
instead of being rethrown immediately. There is no opt-out.
> Auditing the migration guide
> ----------------------------
>
> Key: SPARK-57452
> URL: https://issues.apache.org/jira/browse/SPARK-57452
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured
> Streaming, Web UI
> Affects Versions: 4.2.0
> Reporter: Xiao Li
> Priority: Blocker
>
> I used AI to analyze more than 1,900 commits from the Spark 4.2.0 release and
> identified 38 changes that appear to be missing from the migration guide.
> The JIRAs listed below were identified through this analysis. However, this
> may not be a complete list, so please also review the remaining commits for
> any additional migration guide updates that may be required.
>
> * *SPARK-55314[CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get
> (Scala and Connect) now raises the underlying exception when metric
> collection fails instead of returning an empty map; code that tolerated an
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception
> when observed-metric collection fails instead of silently returning an empty
> map. Code that tolerated an empty result on failure now sees a thrown
> exception. There is no opt-out.
> * *SPARK-55655[MLLIB] Make {{CountVectorizer}} vocabulary deterministic when
> counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer
> now breaks ties between equal-count terms lexicographically, making the
> vocabulary deterministic; this can change vocabulary term order and feature
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2,
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically
> so the vocabulary is deterministic. This can change vocabulary term order and
> feature indices compared with prior (non-deterministic) output. There is no
> opt-out.
> * *SPARK-47997[PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
> (previously only if all missing), for all pandas versions; pass
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
> {{'raise'}} and now raise KeyError if any requested label is missing
> (previously only if all were missing), across all pandas versions. To skip
> missing labels, pass {{{}errors='ignore'{}}}.
> * *SPARK-56219[PS] Align groupby idxmax and idxmin skipna=False behavior
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark
> groupby idxmax/idxmin with skipna=False now returns null for NA groups
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out,
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
> pandas semantics for NA: with pandas 2 they return null for groups containing
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of
> returning an index label. There is no opt-out.
> * *SPARK-55977[PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no
> longer matches '1'), changing results for all pandas versions; this is
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
> values of incompatible types no longer match (for example integer 1 no longer
> matches string '1'). Results change across all pandas versions. There is no
> opt-out.
> * *SPARK-54568[PYTHON] Avoid unnecessary pandas conversion in create
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
> directly, dropping np.dtype-based StructType inference so inferred schema can
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
> from a NumPy ndarray converts the array directly to an Arrow Table and now
> requires PyArrow. The previous np.dtype-based StructType inference is
> dropped, so the inferred schema may differ. To control the schema, pass an
> explicit schema.
> * *SPARK-56186[PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer
> officially supported in PySpark (CI, docker image, classifier, and
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
> classifier, and PyPy-specific code/test skips have been removed. PyPy users
> should migrate to CPython. There is no opt-out.
> * *SPARK-55096[PYTHON] Update pandas minimum version in {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
> supported version for pandas on Spark Connect has been raised from 2.0.0 to
> 2.2.0, matching the minimum already required by PySpark.
> * *SPARK-54962[PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on
> nullable integer columns containing nulls now use a nullable Int extension
> dtype instead of float64, so values/dtype inside the UDF change (fixing
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
> nullable integer column that contains nulls receive a pandas nullable integer
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for
> large integers. The dtype and values seen inside the UDF change accordingly.
> There is no opt-out configuration.
> * *SPARK-55583[PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data
> Source read returning a pa.RecordBatch whose Arrow types differ from the
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from
> the declared schema now fails with
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that
> previously loaded by coincidence now error. There is no opt-out.
> * *SPARK-55416[SS][PYTHON] Streaming Python Data Source memory leak when
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming
> Python Data Source SimpleDataSourceStreamReader whose read() returns a
> non-empty batch with end==start now fails with
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
> batch with end offset equal to start now fails with
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
> memory. Empty batches with end == start are still allowed. There is no
> opt-out.
> * *SPARK-56206[SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE
> names differing only in case (e.g. WITH cte1, CTE1) now raise
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
> detection is case-insensitive. CTE definitions whose names differ only in
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
> definition. There is no opt-out; rename the conflicting CTEs.
> * *SPARK-56652[SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint
> output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
> now always prints RELY/NORELY for table constraints (previously omitted the
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
> constraints, including {{NORELY}} for the default state which was previously
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
> constraint output text for tools parsing it.
> * *SPARK-55019[SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a
> view now drops the view by default instead of raising
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a
> view drops the view by default instead of raising
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
> * *SPARK-54853[SQL] Always check {{hive.exec.max.dynamic.partitions}} on the
> spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
> session-level value is honored, changing when the limit error fires; error
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to
> Hive SerDe tables is always enforced on the Spark side and honors the
> session-level value, changing when the limit is checked. The error is now
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
> * *SPARK-55372[SQL] Fix {{SHOW CREATE TABLE}} for tables / views with
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
> output for such columns. Default non-collated strings are unchanged. No
> opt-out.
> * *SPARK-54918[SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> array_distinct/union/intersect/except and arrays_overlap now normalize floats
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
> functions {{{}array_distinct{}}}, {{{}array_union{}}},
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}}
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit
> NaN values are treated as equal. This changes the results of these functions;
> there is no opt-out.
> * *SPARK-54777[SQL] Changed dropTable error handling in
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE
> now only swallows object-not-found errors; other failures (permission, etc.)
> propagate instead of silently returning, so a drop that previously appeared
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
> catalog only swallows object-not-found errors when running DROP TABLE; other
> failures such as permission-denied or constraint violations now propagate
> instead of silently returning success. A DROP TABLE that previously appeared
> to succeed may now throw.
> * *SPARK-57040[SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
> correctness fix, as no mainstream RDBMS supports sampling with replacement),
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
> sample-with-replacement on JDBC tables change accordingly.
> * *SPARK-56031[SQL] Make Natural Join column matching respect case
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
> respects spark.sql.caseSensitive (default false), so joins on case-differing
> common columns instead of degrading to CROSS JOIN, changing results; set
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common
> columns that differ only in case are joined instead of degrading to a CROSS
> JOIN, changing results. To match columns case-sensitively, set
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
> * *SPARK-31561[SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword.
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
> * *SPARK-57188[SQL] Parameterless function takes precedence over UDF
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless
> built-in (current_user, current_date, etc.) now takes precedence over a
> same-named SQL UDF parameter, changing UDF body results. Set
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
> built-in function ({{{}current_user{}}}, {{{}current_date{}}},
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
> parameter in the function body. To restore the previous behavior, set
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
> {{{}true{}}}.
> * *SPARK-56045[SQL] Add flag for ignoring Parquet UNKNOWN type annotation
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet
> files with UNKNOWN logical-type annotation now infers physical type (e.g.
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
> files with the {{UNKNOWN}} logical-type annotation infers the physical type
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore
> the 4.1.0 behavior of inferring NullType, set
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
> {{{}true{}}}.
> * *SPARK-56414[SQL] Per-write options should take precedence over session
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf
> in Parquet/Avro writes; previously such options were silently ignored, so
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
> take precedence over session config for several Parquet/Avro write keys (e.g.
> {{{}spark.sql.parquet.outputTimestampType{}}},
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
> silently ignored, so the written file format can change when both are set.
> * *SPARK-56251[SQL] Add default fetchSize for postgres to avoid loading all
> data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}),
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the
> whole table into memory. To restore the previous behavior, set the
> {{fetchsize}} option to {{{}0{}}}.
> * *SPARK-55155[SQL] Support foldable expressions in {{SET CATALOG}}
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with
> a bare name now resolves to a session variable of that name first (if one
> exists) before treating it as a catalog name; there is no opt-out config.
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG
> accepts foldable expressions and a bare name is first resolved as a session
> variable of that name (if one exists) before being treated as a catalog name.
> There is no opt-out; a session variable that shadows a catalog name changes
> which catalog is set.
> * *SPARK-51518[SQL] Support | as an alternative to |> for the SQL pipe
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|'
> is now a SQL pipe-operator token by default, so a query using a pipe keyword
> as a column name after bitwise-OR may reparse; restore via
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe
> operator token by default. A pipe keyword used as a column name after a
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
> * *SPARK-52812[SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect
> Catalog.createTable now executes eagerly instead of lazily, so the table is
> created (and errors like already-exists surface) immediately at the call
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
> {{Catalog.createTable}} executes eagerly: the table is created (and errors
> such as table-already-exists surface) immediately at the call rather than
> lazily on a later action. Code relying on the previous lazy behavior is
> affected.
> * *SPARK-55198[SQL] spark-sql should skip comment line with leading
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now
> skips comment lines that have leading whitespace before – (line.trim
> startsWith --), matching Hive/beeline; such lines were previously sent as
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
> skips comment lines whose first non-whitespace characters are {{-{-}{-}}}
> (i.e. {{line.trim}} starts with {{{}-{}}}), aligning with Hive and beeline.
> Previously such leading-whitespace comment lines were sent as SQL. There is
> no opt-out.
> * *SPARK-49110[SQL] Simplify SubqueryAlias.metadataOutput to always
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now
> always propagates metadata columns by default; queries that failed with
> AnalysisException may now succeed and joins may newly raise ambiguous-column
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
> always propagates metadata columns from its child, so some queries that
> previously failed with AnalysisException now succeed and joins may raise new
> ambiguous-column errors. To restore the legacy behavior, set
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
> * *SPARK-56678[SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE
> TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
> EXTENDED for v2 tables/views now emits structured
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
> the command output may be affected.
> * *SPARK-56654[SQL] Reject unpaired UTF-16 surrogates in Variant JSON
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant
> reject unpaired UTF-16 surrogates (raising an error or returning NULL)
> instead of silently substituting U+FFFD. To restore the previous permissive
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to
> {{{}false{}}}.
> * *SPARK-56554[SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as
> Variant now honors inferSchema=false (leaf/attribute text kept as strings,
> not inferred boolean/long/decimal), changing results. Set
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
> values as strings instead of inferring boolean/long/decimal. To restore the
> previous behavior of always inferring types, set
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
> * *SPARK-54718[SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance()
> now preserves attribute name casing by default, changing output column-name
> casing for self-joins on a CTE with case-differing duplicate columns; restore
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
> references preserve attribute name casing when re-instantiated, so self-joins
> on a CTE with case-differing duplicate columns keep the original column-name
> casing in the output. To restore the previous behavior, set
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
> * *SPARK-56280[SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
> differently-bit NaNs and signed zeros are treated as duplicates; dedup
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or
> double key columns normalize NaN and signed zero, so differently-bit NaN
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes
> deduplication results for queries with floating-point keys; there is no
> opt-out.
> * *SPARK-55058[SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a
> streaming query whose checkpoint has offset/commit logs but no metadata file
> now fails with MISSING_METADATA_FILE by default instead of starting a new
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
> streaming query whose checkpoint has offset and commit logs but no metadata
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
> silently generating a new query id. To restore the previous behavior, set
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
> {{{}false{}}}.
> * *SPARK-56239[UI] Fix SQL tab DataTables: API default limit, date format,
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing
> /applications/\{appId}/sql REST endpoint default length changed from 20 to
> -1, so it now returns all SQL executions by default; clients relying on the
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}}
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by
> default. To restore the previous behavior, pass {{length=20}} (any {{length
> <= 0}} returns all executions).
> * *SPARK-55075[K8S] Track executor pod creation errors with
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor
> pod-creation failures are now caught, logged and counted by
> ExecutorFailureTracker (continue until max failures) instead of being
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
> executor pod-creation failures are caught, logged, and counted by
> ExecutorFailureTracker (allocation continues until the max-failures
> threshold) instead of being rethrown immediately. There is no opt-out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]