[
https://issues.apache.org/jira/browse/SPARK-57452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li updated SPARK-57452:
----------------------------
Description:
* *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
*Component:* CONNECT
*Why no migration-guide note needed:* Should be documented: Observation.get
(Scala and Connect) now raises the underlying exception when metric collection
fails instead of returning an empty map; code that tolerated an empty result on
failure now sees a thrown exception.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{Observation.get}}
(Scala and Spark Connect) raises the underlying exception when observed-metric
collection fails instead of silently returning an empty map. Code that
tolerated an empty result on failure now sees a thrown exception. There is no
opt-out.
* *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic when
counts are equal*
*Component:* MLLIB
*Why no migration-guide note needed:* Should be documented: CountVectorizer now
breaks ties between equal-count terms lexicographically, making the vocabulary
deterministic; this can change vocabulary term order and feature indices versus
prior (non-deterministic) output. No opt-out.
*Proposed migration-guide message:* [Core] Since Spark 4.2, {{CountVectorizer}}
breaks ties between equal-count terms lexicographically so the vocabulary is
deterministic. This can change vocabulary term order and feature indices
compared with prior (non-deterministic) output. There is no opt-out.
* *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
(previously only if all missing), for all pandas versions; pass errors='ignore'
to skip missing labels.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
{{'raise'}} and now raise KeyError if any requested label is missing
(previously only if all were missing), across all pandas versions. To skip
missing labels, pass {{{}errors='ignore'{}}}.
* *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior
with pandas 2/3*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: pandas-on-Spark
groupby idxmax/idxmin with skipna=False now returns null for NA groups (pandas
2) or raises on NA inputs (pandas 3) instead of a label; no opt-out, results
change for existing skipna=False users.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
{{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
pandas semantics for NA: with pandas 2 they return null for groups containing
NA values, and with pandas 3 they raise on NA-containing inputs, instead of
returning an index label. There is no opt-out.
* *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
*Component:* PS
*Why no migration-guide note needed:* Should be documented: ps
Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no longer
matches '1'), changing results for all pandas versions; this is pandas-parity,
no opt-out config.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
values of incompatible types no longer match (for example integer 1 no longer
matches string '1'). Results change across all pandas versions. There is no
opt-out.
* *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create
dataframe from ndarray*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: createDataFrame
from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
directly, dropping np.dtype-based StructType inference so inferred schema can
differ.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
from a NumPy ndarray converts the array directly to an Arrow Table and now
requires PyArrow. The previous np.dtype-based StructType inference is dropped,
so the inferred schema may differ. To control the schema, pass an explicit
schema.
* *[SPARK-56186][PYTHON] Retire pypy*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: PyPy is no longer
officially supported in PySpark (CI, docker image, classifier, and
PyPy-specific code removed); PyPy users should migrate to CPython.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
classifier, and PyPy-specific code/test skips have been removed. PyPy users
should migrate to CPython. There is no opt-out.
* *[SPARK-55096][PYTHON] Update pandas minimum version in {{connect/setup.py}}*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: minimum pandas
raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
supported on Connect.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
supported version for pandas on Spark Connect has been raised from 2.0.0 to
2.2.0, matching the minimum already required by PySpark.
* *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: Pandas UDFs on
nullable integer columns containing nulls now use a nullable Int extension
dtype instead of float64, so values/dtype inside the UDF change (fixing
precision loss for large integers); no opt-out.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
nullable integer column that contains nulls receive a pandas nullable integer
extension dtype (e.g. Int64) instead of float64, fixing precision loss for
large integers. The dtype and values seen inside the UDF change accordingly.
There is no opt-out configuration.
* *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
*Component:* PYTHON
*Why no migration-guide note needed:* Should be documented: a Python Data
Source read returning a pa.RecordBatch whose Arrow types differ from the
declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
type-mismatched batches that previously loaded by coincidence now error.
*Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
source read that returns a {{pa.RecordBatch}} whose Arrow types differ from the
declared schema now fails with {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}.
Type-mismatched batches that previously loaded by coincidence now error. There
is no opt-out.
* *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when
end-offset is not updated*
*Component:* SS,PYTHON
*Why no migration-guide note needed:* Should be documented: a Streaming Python
Data Source SimpleDataSourceStreamReader whose read() returns a non-empty batch
with end==start now fails with STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of
leaking memory; affects existing (buggy) reader impls.
*Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
batch with end offset equal to start now fails with
{{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
memory. Empty batches with end == start are still allowed. There is no opt-out.
* *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: duplicate CTE names
differing only in case (e.g. WITH cte1, CTE1) now raise DUPLICATED_CTE_NAMES
instead of silently overwriting; previously-accepted queries now fail. No
opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
detection is case-insensitive. CTE definitions whose names differ only in case
(e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
{{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
definition. There is no opt-out; rename the conflicting CTEs.
* *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED constraint
output*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
now always prints RELY/NORELY for table constraints (previously omitted the
default NORELY), changing the command output text for tools parsing it.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
constraints, including {{NORELY}} for the default state which was previously
omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
constraint output text for tools parsing it.
* *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DROP TABLE on a
view now drops the view by default instead of raising
WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
spark.sql.dropTableOnView.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a view
drops the view by default instead of raising
{{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
{{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
* *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on the
spark side*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
session-level value is honored, changing when the limit error fires; error
renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the
{{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to Hive
SerDe tables is always enforced on the Spark side and honors the session-level
value, changing when the limit is checked. The error is now reported as
{{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
* *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with
default collation*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: typeName/toString
of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
SHOW CREATE TABLE and schema output for such columns.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a
StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
{{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
output for such columns. Default non-collated strings are unchanged. No opt-out.
* *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
array_distinct/union/intersect/except and arrays_overlap now normalize floats
so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results of
these array set operations. No opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
functions {{{}array_distinct{}}}, {{{}array_union{}}}, {{{}array_intersect{}}},
{{{}array_except{}}}, and {{arrays_overlap}} normalize floating-point values,
so {{0.0}} and {{-0.0}} and differently-bit NaN values are treated as equal.
This changes the results of these functions; there is no opt-out.
* *[SPARK-54777][SQL] Changed dropTable error handling in
JDBCTableCatalog.dropTable(...)*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE now
only swallows object-not-found errors; other failures (permission, etc.)
propagate instead of silently returning, so a drop that previously appeared to
succeed now throws.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
catalog only swallows object-not-found errors when running DROP TABLE; other
failures such as permission-denied or constraint violations now propagate
instead of silently returning success. A DROP TABLE that previously appeared to
succeed may now throw.
* *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: V2 JDBC TABLESAMPLE
with withReplacement=true is no longer pushed down (correctness fix; pushdown
default-on), so .sample(withReplacement=true) on JDBC tables now returns
different (correct) results.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
correctness fix, as no mainstream RDBMS supports sampling with replacement),
and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
sample-with-replacement on JDBC tables change accordingly.
* *[SPARK-56031][SQL] Make Natural Join column matching respect case
sensitivity conf*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
respects spark.sql.caseSensitive (default false), so joins on case-differing
common columns instead of degrading to CROSS JOIN, changing results; set
spark.sql.caseSensitive=true to match case-sensitively.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common columns
that differ only in case are joined instead of degrading to a CROSS JOIN,
changing results. To match columns case-sensitively, set
{{spark.sql.caseSensitive}} to {{{}true{}}}.
* *[SPARK-31561][SQL] Add QUALIFY Clause*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: QUALIFY is now a
(non-reserved) clause keyword, so a query using unquoted QUALIFY as a trailing
table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote the
identifier to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword. A
query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
* *[SPARK-57188][SQL] Parameterless function takes precedence over UDF
parameter*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: a parameterless
built-in (current_user, current_date, etc.) now takes precedence over a
same-named SQL UDF parameter, changing UDF body results. Set
spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
built-in function ({{{}current_user{}}}, {{{}current_date{}}},
{{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
parameter in the function body. To restore the previous behavior, set
{{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
{{{}true{}}}.
* *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation
and revert to old behavior*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading Parquet
files with UNKNOWN logical-type annotation now infers physical type (e.g.
IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType via
spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
files with the {{UNKNOWN}} logical-type annotation infers the physical type
(for example IntegerType) instead of the NullType used in 4.1.0. To restore the
4.1.0 behavior of inferring NullType, set
{{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
{{{}true{}}}.
* *[SPARK-56414][SQL] Per-write options should take precedence over session
config in file source writes*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: per-write options
(e.g. parquet.outputTimestampType) now override the matching session SQLConf in
Parquet/Avro writes; previously such options were silently ignored, so written
file format can change when both are set.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
take precedence over session config for several Parquet/Avro write keys (e.g.
{{{}spark.sql.parquet.outputTimestampType{}}},
{{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
silently ignored, so the written file format can change when both are set.
* *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all
data in memory*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Postgres JDBC reads
now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor fetch with
autoCommit=false; changes default read behavior. Set fetchsize=0 to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL JDBC
dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}), enabling
cursor-based fetching with {{autoCommit=false}} to avoid loading the whole
table into memory. To restore the previous behavior, set the {{fetchsize}}
option to {{{}0{}}}.
* *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}}
statement*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SET CATALOG with a
bare name now resolves to a session variable of that name first (if one exists)
before treating it as a catalog name; there is no opt-out config. Edge case but
a default behavior change.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG accepts
foldable expressions and a bare name is first resolved as a session variable of
that name (if one exists) before being treated as a catalog name. There is no
opt-out; a session variable that shadows a catalog name changes which catalog
is set.
* *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe
operator token*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: single-char '|' is
now a SQL pipe-operator token by default, so a query using a pipe keyword as a
column name after bitwise-OR may reparse; restore via
spark.sql.parser.singleCharacterPipeOperator.enabled=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
accepts single-character {{|}} as an alternative to {{|>}} for the pipe
operator token by default. A pipe keyword used as a column name after a
bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
{{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
* *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: Spark Connect
Catalog.createTable now executes eagerly instead of lazily, so the table is
created (and errors like already-exists surface) immediately at the call rather
than on a later action.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
{{Catalog.createTable}} executes eagerly: the table is created (and errors such
as table-already-exists surface) immediately at the call rather than lazily on
a later action. Code relying on the previous lazy behavior is affected.
* *[SPARK-55198][SQL] spark-sql should skip comment line with leading
whitespaces*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: spark-sql CLI now
skips comment lines that have leading whitespace before -- (line.trim
startsWith --), matching Hive/beeline; such lines were previously sent as SQL.
CLI-only, no opt-out.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
skips comment lines whose first non-whitespace characters are {{--}} (i.e.
{{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline.
Previously such leading-whitespace comment lines were sent as SQL. There is no
opt-out.
* *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always
propagate metadata columns*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: SubqueryAlias now
always propagates metadata columns by default; queries that failed with
AnalysisException may now succeed and joins may newly raise ambiguous-column
errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
always propagates metadata columns from its child, so some queries that
previously failed with AnalysisException now succeed and joins may raise new
ambiguous-column errors. To restore the legacy behavior, set
{{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
* *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in DESCRIBE
TABLE EXTENDED for v2 tables and views*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
EXTENDED for v2 tables/views now emits structured
Catalog/Namespace/Database/Table rows instead of a single Name/Identifier row;
consumers parsing the output are affected.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
{{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
the command output may be affected.
* *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON
parsing*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented:
parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, {{{}parse_json{}}},
{{{}try_parse_json{}}}, and {{from_json}} to variant reject unpaired UTF-16
surrogates (raising an error or returning NULL) instead of silently
substituting U+FFFD. To restore the previous permissive behavior, set
{{spark.sql.variant.validateUnicodeInJsonParsing}} to {{{}false{}}}.
* *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: reading XML as
Variant now honors inferSchema=false (leaf/attribute text kept as strings, not
inferred boolean/long/decimal), changing results. Set
spark.sql.xml.variant.respectInferSchema=false to restore.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
values as strings instead of inferring boolean/long/decimal. To restore the
previous behavior of always inferring types, set
{{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
* *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
*Component:* SQL
*Why no migration-guide note needed:* Should be documented: CTE newInstance()
now preserves attribute name casing by default, changing output column-name
casing for self-joins on a CTE with case-differing duplicate columns; restore
via spark.sql.legacy.cteDuplicateAttributeNames.
*Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
references preserve attribute name casing when re-instantiated, so self-joins
on a CTE with case-differing duplicate columns keep the original column-name
casing in the output. To restore the previous behavior, set
{{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
* *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: streaming
dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
differently-bit NaNs and signed zeros are treated as duplicates; dedup results
change with no opt-out.
*Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
{{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or double
key columns normalize NaN and signed zero, so differently-bit NaN values and
{{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes deduplication
results for queries with floating-point keys; there is no opt-out.
* *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
*Component:* SS
*Why no migration-guide note needed:* Should be documented: restarting a
streaming query whose checkpoint has offset/commit logs but no metadata file
now fails with MISSING_METADATA_FILE by default instead of starting a new query
id; disable via the verifyMetadataExists.enabled config=false.
*Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
streaming query whose checkpoint has offset and commit logs but no metadata
file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
silently generating a new query id. To restore the previous behavior, set
{{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
{{{}false{}}}.
* *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format,
and appId resolution*
*Component:* UI
*Why no migration-guide note needed:* Should be documented: the long-existing
/applications/\{appId}/sql REST endpoint default length changed from 20 to -1,
so it now returns all SQL executions by default; clients relying on the 20-row
default get more rows.
*Proposed migration-guide message:* [Core] Since Spark 4.2, the
{{/applications/\{appId}/sql}} REST endpoint defaults the {{length}} parameter
to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by default. To
restore the previous behavior, pass {{length=20}} (any {{length <= 0}} returns
all executions).
* *[SPARK-55075][K8S] Track executor pod creation errors with
ExecutorFailureTracker*
*Component:* K8S
*Why no migration-guide note needed:* Should be documented: on K8s, executor
pod-creation failures are now caught, logged and counted by
ExecutorFailureTracker (continue until max failures) instead of being rethrown
immediately, changing default failure semantics for K8s deployments.
*Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
executor pod-creation failures are caught, logged, and counted by
ExecutorFailureTracker (allocation continues until the max-failures threshold)
instead of being rethrown immediately. There is no opt-out.
> Auditing the migration guide
> ----------------------------
>
> Key: SPARK-57452
> URL: https://issues.apache.org/jira/browse/SPARK-57452
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes, MLlib, PySpark, Spark Core, SQL, Structured
> Streaming, Web UI
> Affects Versions: 4.2.0
> Reporter: Xiao Li
> Priority: Blocker
>
> * *[SPARK-55314][CONNECT] Propagate observed metrics errors to client*
> *Component:* CONNECT
> *Why no migration-guide note needed:* Should be documented: Observation.get
> (Scala and Connect) now raises the underlying exception when metric
> collection fails instead of returning an empty map; code that tolerated an
> empty result on failure now sees a thrown exception.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{Observation.get}} (Scala and Spark Connect) raises the underlying exception
> when observed-metric collection fails instead of silently returning an empty
> map. Code that tolerated an empty result on failure now sees a thrown
> exception. There is no opt-out.
> * *[SPARK-55655][MLLIB] Make {{CountVectorizer}} vocabulary deterministic
> when counts are equal*
> *Component:* MLLIB
> *Why no migration-guide note needed:* Should be documented: CountVectorizer
> now breaks ties between equal-count terms lexicographically, making the
> vocabulary deterministic; this can change vocabulary term order and feature
> indices versus prior (non-deterministic) output. No opt-out.
> *Proposed migration-guide message:* [Core] Since Spark 4.2,
> {{CountVectorizer}} breaks ties between equal-count terms lexicographically
> so the vocabulary is deterministic. This can change vocabulary term order and
> feature indices compared with prior (non-deterministic) output. There is no
> opt-out.
> * *[SPARK-47997][PS] Add errors parameter to DataFrame.drop and Series.drop*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> DataFrame.drop/Series.drop now raise KeyError if ANY label is missing
> (previously only if all missing), for all pandas versions; pass
> errors='ignore' to skip missing labels.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> DataFrame.drop/Series.drop add an {{errors}} parameter defaulting to
> {{'raise'}} and now raise KeyError if any requested label is missing
> (previously only if all were missing), across all pandas versions. To skip
> missing labels, pass {{{}errors='ignore'{}}}.
> * *[SPARK-56219][PS] Align groupby idxmax and idxmin skipna=False behavior
> with pandas 2/3*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: pandas-on-Spark
> groupby idxmax/idxmin with skipna=False now returns null for NA groups
> (pandas 2) or raises on NA inputs (pandas 3) instead of a label; no opt-out,
> results change for existing skipna=False users.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> {{GroupBy.idxmax}} and {{GroupBy.idxmin}} with {{skipna=False}} now follow
> pandas semantics for NA: with pandas 2 they return null for groups containing
> NA values, and with pandas 3 they raise on NA-containing inputs, instead of
> returning an index label. There is no opt-out.
> * *[SPARK-55977][PS] Fix isin() to use strict type matching like pandas*
> *Component:* PS
> *Why no migration-guide note needed:* Should be documented: ps
> Series/DataFrame.isin() now uses strict Python-type matching (e.g. 1 no
> longer matches '1'), changing results for all pandas versions; this is
> pandas-parity, no opt-out config.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, pandas-on-Spark
> Series/DataFrame {{isin()}} uses strict Python-type matching like pandas, so
> values of incompatible types no longer match (for example integer 1 no longer
> matches string '1'). Results change across all pandas versions. There is no
> opt-out.
> * *[SPARK-54568][PYTHON] Avoid unnecessary pandas conversion in create
> dataframe from ndarray*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: createDataFrame
> from a numpy ndarray now requires pyarrow and converts ndarray to Arrow
> directly, dropping np.dtype-based StructType inference so inferred schema can
> differ.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, createDataFrame
> from a NumPy ndarray converts the array directly to an Arrow Table and now
> requires PyArrow. The previous np.dtype-based StructType inference is
> dropped, so the inferred schema may differ. To control the schema, pass an
> explicit schema.
> * *[SPARK-56186][PYTHON] Retire pypy*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: PyPy is no longer
> officially supported in PySpark (CI, docker image, classifier, and
> PyPy-specific code removed); PyPy users should migrate to CPython.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, PyPy is no longer
> officially supported in PySpark: PyPy CI, the PyPy docker image, the setup.py
> classifier, and PyPy-specific code/test skips have been removed. PyPy users
> should migrate to CPython. There is no opt-out.
> * *[SPARK-55096][PYTHON] Update pandas minimum version in
> {{connect/setup.py}}*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: minimum pandas
> raised to 2.2.0 for Spark Connect (was 2.0.0); pandas <2.2 is no longer
> supported on Connect.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, the minimum
> supported version for pandas on Spark Connect has been raised from 2.0.0 to
> 2.2.0, matching the minimum already required by PySpark.
> * *[SPARK-54962][PYTHON] Fix nullable integers handling in Pandas UDF*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: Pandas UDFs on
> nullable integer columns containing nulls now use a nullable Int extension
> dtype instead of float64, so values/dtype inside the UDF change (fixing
> precision loss for large integers); no opt-out.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, Pandas UDFs on a
> nullable integer column that contains nulls receive a pandas nullable integer
> extension dtype (e.g. Int64) instead of float64, fixing precision loss for
> large integers. The dtype and values seen inside the UDF change accordingly.
> There is no opt-out configuration.
> * *[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source*
> *Component:* PYTHON
> *Why no migration-guide note needed:* Should be documented: a Python Data
> Source read returning a pa.RecordBatch whose Arrow types differ from the
> declared schema now fails with DATA_SOURCE_RETURN_SCHEMA_MISMATCH;
> type-mismatched batches that previously loaded by coincidence now error.
> *Proposed migration-guide message:* [PySpark] In Spark 4.2, a Python data
> source read that returns a {{pa.RecordBatch}} whose Arrow types differ from
> the declared schema now fails with
> {{{}DATA_SOURCE_RETURN_SCHEMA_MISMATCH{}}}. Type-mismatched batches that
> previously loaded by coincidence now error. There is no opt-out.
> * *[SPARK-55416][SS][PYTHON] Streaming Python Data Source memory leak when
> end-offset is not updated*
> *Component:* SS,PYTHON
> *Why no migration-guide note needed:* Should be documented: a Streaming
> Python Data Source SimpleDataSourceStreamReader whose read() returns a
> non-empty batch with end==start now fails with
> STREAM_READER_OFFSET_DID_NOT_ADVANCE instead of leaking memory; affects
> existing (buggy) reader impls.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, a streaming Python
> data source SimpleDataSourceStreamReader whose {{read()}} returns a non-empty
> batch with end offset equal to start now fails with
> {{SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE}} instead of leaking driver
> memory. Empty batches with end == start are still allowed. There is no
> opt-out.
> * *[SPARK-56206][SQL] Fix case-insensitive duplicate CTE name detection*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: duplicate CTE
> names differing only in case (e.g. WITH cte1, CTE1) now raise
> DUPLICATED_CTE_NAMES instead of silently overwriting; previously-accepted
> queries now fail. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, duplicate CTE name
> detection is case-insensitive. CTE definitions whose names differ only in
> case (e.g. {{{}WITH cte AS (...), CTE AS (...){}}}) now raise
> {{DUPLICATED_CTE_NAMES}} instead of silently overwriting the earlier
> definition. There is no opt-out; rename the conflicting CTEs.
> * *[SPARK-56652][SQL] Always emit RELY/NORELY in DESCRIBE EXTENDED
> constraint output*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE EXTENDED
> now always prints RELY/NORELY for table constraints (previously omitted the
> default NORELY), changing the command output text for tools parsing it.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE
> EXTENDED}} always emits the {{{}RELY{}}}/{{{}NORELY{}}} token for table
> constraints, including {{NORELY}} for the default state which was previously
> omitted. This matches {{SHOW CREATE TABLE}} output and changes the command's
> constraint output text for tools parsing it.
> * *[SPARK-55019][SQL] Allow DROP TABLE to drop VIEW*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DROP TABLE on a
> view now drops the view by default instead of raising
> WRONG_COMMAND_FOR_OBJECT_TYPE; restore via
> spark.sql.dropTableOnView.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, DROP TABLE on a
> view drops the view by default instead of raising
> {{{}WRONG_COMMAND_FOR_OBJECT_TYPE{}}}. To restore the previous behavior, set
> {{spark.sql.dropTableOnView.enabled}} to {{{}false{}}}.
> * *[SPARK-54853][SQL] Always check {{hive.exec.max.dynamic.partitions}} on
> the spark side*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> hive.exec.max.dynamic.partitions is now always enforced Spark-side and the
> session-level value is honored, changing when the limit error fires; error
> renamed to DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the
> {{hive.exec.max.dynamic.partitions}} limit for dynamic partition writes to
> Hive SerDe tables is always enforced on the Spark side and honors the
> session-level value, changing when the limit is checked. The error is now
> reported as {{{}DYNAMIC_PARTITION_WRITE_PARTITION_NUM_LIMIT_EXCEEDED{}}}.
> * *[SPARK-55372][SQL] Fix {{SHOW CREATE TABLE}} for tables / views with
> default collation*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: typeName/toString
> of an explicitly UTF8_BINARY-collated StringType/CharType now render 'string
> collate UTF8_BINARY' not 'string' (default non-collated unchanged), changing
> SHOW CREATE TABLE and schema output for such columns.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a
> StringType/CharType/VarcharType with an explicit {{UTF8_BINARY}} collation
> renders its collation in {{{}typeName{}}}/{{{}toString{}}} (for example
> {{{}string collate UTF8_BINARY{}}}), changing SHOW CREATE TABLE and schema
> output for such columns. Default non-collated strings are unchanged. No
> opt-out.
> * *[SPARK-54918][SQL] Normalize floating numbers in array set operations*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> array_distinct/union/intersect/except and arrays_overlap now normalize floats
> so 0.0/-0.0 and differently-bit NaNs are treated as equal, changing results
> of these array set operations. No opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the array set
> functions {{{}array_distinct{}}}, {{{}array_union{}}},
> {{{}array_intersect{}}}, {{{}array_except{}}}, and {{arrays_overlap}}
> normalize floating-point values, so {{0.0}} and {{-0.0}} and differently-bit
> NaN values are treated as equal. This changes the results of these functions;
> there is no opt-out.
> * *[SPARK-54777][SQL] Changed dropTable error handling in
> JDBCTableCatalog.dropTable(...)*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: JDBC DROP TABLE
> now only swallows object-not-found errors; other failures (permission, etc.)
> propagate instead of silently returning, so a drop that previously appeared
> to succeed now throws.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC table
> catalog only swallows object-not-found errors when running DROP TABLE; other
> failures such as permission-denied or constraint violations now propagate
> instead of silently returning success. A DROP TABLE that previously appeared
> to succeed may now throw.
> * *[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: V2 JDBC
> TABLESAMPLE with withReplacement=true is no longer pushed down (correctness
> fix; pushdown default-on), so .sample(withReplacement=true) on JDBC tables
> now returns different (correct) results.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the JDBC connector
> no longer pushes down {{TABLESAMPLE}} when {{withReplacement=true}} (a
> correctness fix, as no mainstream RDBMS supports sampling with replacement),
> and adds {{TABLESAMPLE SYSTEM}} pushdown for PostgreSQL. Results of
> sample-with-replacement on JDBC tables change accordingly.
> * *[SPARK-56031][SQL] Make Natural Join column matching respect case
> sensitivity conf*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: NATURAL JOIN now
> respects spark.sql.caseSensitive (default false), so joins on case-differing
> common columns instead of degrading to CROSS JOIN, changing results; set
> spark.sql.caseSensitive=true to match case-sensitively.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, NATURAL JOIN
> respects {{spark.sql.caseSensitive}} (default {{{}false{}}}), so common
> columns that differ only in case are joined instead of degrading to a CROSS
> JOIN, changing results. To match columns case-sensitively, set
> {{spark.sql.caseSensitive}} to {{{}true{}}}.
> * *[SPARK-31561][SQL] Add QUALIFY Clause*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: QUALIFY is now a
> (non-reserved) clause keyword, so a query using unquoted QUALIFY as a
> trailing table alias (FROM t QUALIFY) now parses as a QUALIFY clause; quote
> the identifier to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the {{QUALIFY}}
> clause is supported, and {{QUALIFY}} becomes a (non-reserved) clause keyword.
> A query using unquoted {{QUALIFY}} as a trailing table alias (e.g. {{{}FROM t
> QUALIFY{}}}) is now parsed as a {{QUALIFY}} clause. To restore the previous
> behavior, quote the alias (e.g. {{{}`QUALIFY`{}}}).
> * *[SPARK-57188][SQL] Parameterless function takes precedence over UDF
> parameter*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: a parameterless
> built-in (current_user, current_date, etc.) now takes precedence over a
> same-named SQL UDF parameter, changing UDF body results. Set
> spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction=true to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, a parameterless
> built-in function ({{{}current_user{}}}, {{{}current_date{}}},
> {{{}session_user{}}}, etc.) takes precedence over a same-named SQL UDF
> parameter in the function body. To restore the previous behavior, set
> {{spark.sql.legacy.allowUdfParameterToShadowParameterlessFunction}} to
> {{{}true{}}}.
> * *[SPARK-56045][SQL] Add flag for ignoring Parquet UNKNOWN type annotation
> and revert to old behavior*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading Parquet
> files with UNKNOWN logical-type annotation now infers physical type (e.g.
> IntegerType) instead of NullType shipped in v4.1.0; opt back into NullType
> via spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled=true.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading Parquet
> files with the {{UNKNOWN}} logical-type annotation infers the physical type
> (for example IntegerType) instead of the NullType used in 4.1.0. To restore
> the 4.1.0 behavior of inferring NullType, set
> {{spark.sql.parquet.reader.respectUnknownTypeAnnotation.enabled}} to
> {{{}true{}}}.
> * *[SPARK-56414][SQL] Per-write options should take precedence over session
> config in file source writes*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: per-write options
> (e.g. parquet.outputTimestampType) now override the matching session SQLConf
> in Parquet/Avro writes; previously such options were silently ignored, so
> written file format can change when both are set.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, per-write options
> take precedence over session config for several Parquet/Avro write keys (e.g.
> {{{}spark.sql.parquet.outputTimestampType{}}},
> {{{}spark.sql.parquet.writeLegacyFormat{}}}). Previously such options were
> silently ignored, so the written file format can change when both are set.
> * *[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading
> all data in memory*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Postgres JDBC
> reads now default fetchSize to 1000 (was 0/all-in-memory), enabling cursor
> fetch with autoCommit=false; changes default read behavior. Set fetchsize=0
> to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the PostgreSQL
> JDBC dialect defaults the read {{fetchsize}} to {{1000}} (was {{{}0{}}}),
> enabling cursor-based fetching with {{autoCommit=false}} to avoid loading the
> whole table into memory. To restore the previous behavior, set the
> {{fetchsize}} option to {{{}0{}}}.
> * *[SPARK-55155][SQL] Support foldable expressions in {{SET CATALOG}}
> statement*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SET CATALOG with
> a bare name now resolves to a session variable of that name first (if one
> exists) before treating it as a catalog name; there is no opt-out config.
> Edge case but a default behavior change.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, SET CATALOG
> accepts foldable expressions and a bare name is first resolved as a session
> variable of that name (if one exists) before being treated as a catalog name.
> There is no opt-out; a session variable that shadows a catalog name changes
> which catalog is set.
> * *[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe
> operator token*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: single-char '|'
> is now a SQL pipe-operator token by default, so a query using a pipe keyword
> as a column name after bitwise-OR may reparse; restore via
> spark.sql.parser.singleCharacterPipeOperator.enabled=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the SQL parser
> accepts single-character {{|}} as an alternative to {{|>}} for the pipe
> operator token by default. A pipe keyword used as a column name after a
> bitwise-OR {{|}} may now reparse. To restore the previous behavior, set
> {{spark.sql.parser.singleCharacterPipeOperator.enabled}} to {{{}false{}}}.
> * *[SPARK-52812][SQL] Make Spark Connect Catalog.createTable eager*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: Spark Connect
> Catalog.createTable now executes eagerly instead of lazily, so the table is
> created (and errors like already-exists surface) immediately at the call
> rather than on a later action.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, Spark Connect
> {{Catalog.createTable}} executes eagerly: the table is created (and errors
> such as table-already-exists surface) immediately at the call rather than
> lazily on a later action. Code relying on the previous lazy behavior is
> affected.
> * *[SPARK-55198][SQL] spark-sql should skip comment line with leading
> whitespaces*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: spark-sql CLI now
> skips comment lines that have leading whitespace before -- (line.trim
> startsWith --), matching Hive/beeline; such lines were previously sent as
> SQL. CLI-only, no opt-out.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, the spark-sql CLI
> skips comment lines whose first non-whitespace characters are {{--}} (i.e.
> {{line.trim}} starts with {{{}--{}}}), aligning with Hive and beeline.
> Previously such leading-whitespace comment lines were sent as SQL. There is
> no opt-out.
> * *[SPARK-49110][SQL] Simplify SubqueryAlias.metadataOutput to always
> propagate metadata columns*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: SubqueryAlias now
> always propagates metadata columns by default; queries that failed with
> AnalysisException may now succeed and joins may newly raise ambiguous-column
> errors. Restore via subqueryAliasAlwaysPropagateMetadataColumns=false.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{SubqueryAlias}}
> always propagates metadata columns from its child, so some queries that
> previously failed with AnalysisException now succeed and joins may raise new
> ambiguous-column errors. To restore the legacy behavior, set
> {{{}spark.sql.analyzer.subqueryAliasAlwaysPropagateMetadataColumns=false{}}}.
> * *[SPARK-56678][SQL] Use structured Catalog/Namespace/Table rows in
> DESCRIBE TABLE EXTENDED for v2 tables and views*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: DESCRIBE TABLE
> EXTENDED for v2 tables/views now emits structured
> Catalog/Namespace/Database/Table rows instead of a single Name/Identifier
> row; consumers parsing the output are affected.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, {{DESCRIBE TABLE
> EXTENDED}} for v2 tables and views emits structured {{{}Catalog{}}},
> {{{}Namespace{}}}, {{{}Database{}}}, and {{{}Table{}}}/{{{}View{}}} rows
> instead of a single {{{}Name{}}}/{{{}Identifier{}}} row. Consumers that parse
> the command output may be affected.
> * *[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON
> parsing*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented:
> parse_json/try_parse_json/from_json('variant') now reject unpaired UTF-16
> surrogates (error/NULL) instead of substituting U+FFFD; previously-accepted
> JSON now fails. Set spark.sql.variant.validateUnicodeInJsonParsing=false to
> restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2,
> {{{}parse_json{}}}, {{{}try_parse_json{}}}, and {{from_json}} to variant
> reject unpaired UTF-16 surrogates (raising an error or returning NULL)
> instead of silently substituting U+FFFD. To restore the previous permissive
> behavior, set {{spark.sql.variant.validateUnicodeInJsonParsing}} to
> {{{}false{}}}.
> * *[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: reading XML as
> Variant now honors inferSchema=false (leaf/attribute text kept as strings,
> not inferred boolean/long/decimal), changing results. Set
> spark.sql.xml.variant.respectInferSchema=false to restore.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, reading XML as
> Variant honors {{{}inferSchema=false{}}}, keeping leaf text and attribute
> values as strings instead of inferring boolean/long/decimal. To restore the
> previous behavior of always inferring types, set
> {{spark.sql.xml.variant.respectInferSchema}} to {{{}false{}}}.
> * *[SPARK-54718][SQL] Preserve attributes names during CTE newInstance()*
> *Component:* SQL
> *Why no migration-guide note needed:* Should be documented: CTE newInstance()
> now preserves attribute name casing by default, changing output column-name
> casing for self-joins on a CTE with case-differing duplicate columns; restore
> via spark.sql.legacy.cteDuplicateAttributeNames.
> *Proposed migration-guide message:* [SQL] Since Spark 4.2, CTE relation
> references preserve attribute name casing when re-instantiated, so self-joins
> on a CTE with case-differing duplicate columns keep the original column-name
> casing in the output. To restore the previous behavior, set
> {{spark.sql.legacy.cteDuplicateAttributeNames}} to {{{}true{}}}.
> * *[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: streaming
> dropDuplicates on float/double keys now normalizes NaN and +/-0.0, so
> differently-bit NaNs and signed zeros are treated as duplicates; dedup
> results change with no opt-out.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, streaming
> {{{}dropDuplicates{}}}/{{{}dropDuplicatesWithinWatermark{}}} on float or
> double key columns normalize NaN and signed zero, so differently-bit NaN
> values and {{{}+0.0{}}}/{{{}-0.0{}}} are treated as duplicates. This changes
> deduplication results for queries with floating-point keys; there is no
> opt-out.
> * *[SPARK-55058][SS] Throw error on inconsistent checkpoint metadata*
> *Component:* SS
> *Why no migration-guide note needed:* Should be documented: restarting a
> streaming query whose checkpoint has offset/commit logs but no metadata file
> now fails with MISSING_METADATA_FILE by default instead of starting a new
> query id; disable via the verifyMetadataExists.enabled config=false.
> *Proposed migration-guide message:* [SS] Since Spark 4.2, restarting a
> streaming query whose checkpoint has offset and commit logs but no metadata
> file fails with {{STREAMING_CHECKPOINT_MISSING_METADATA_FILE}} instead of
> silently generating a new query id. To restore the previous behavior, set
> {{spark.sql.streaming.checkpoint.verifyMetadataExists.enabled}} to
> {{{}false{}}}.
> * *[SPARK-56239][UI] Fix SQL tab DataTables: API default limit, date format,
> and appId resolution*
> *Component:* UI
> *Why no migration-guide note needed:* Should be documented: the long-existing
> /applications/\{appId}/sql REST endpoint default length changed from 20 to
> -1, so it now returns all SQL executions by default; clients relying on the
> 20-row default get more rows.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, the
> {{/applications/\{appId}/sql}} REST endpoint defaults the {{length}}
> parameter to {{-1}} (was {{{}20{}}}), so it returns all SQL executions by
> default. To restore the previous behavior, pass {{length=20}} (any {{length
> <= 0}} returns all executions).
> * *[SPARK-55075][K8S] Track executor pod creation errors with
> ExecutorFailureTracker*
> *Component:* K8S
> *Why no migration-guide note needed:* Should be documented: on K8s, executor
> pod-creation failures are now caught, logged and counted by
> ExecutorFailureTracker (continue until max failures) instead of being
> rethrown immediately, changing default failure semantics for K8s deployments.
> *Proposed migration-guide message:* [Core] Since Spark 4.2, on Kubernetes
> executor pod-creation failures are caught, logged, and counted by
> ExecutorFailureTracker (allocation continues until the max-failures
> threshold) instead of being rethrown immediately. There is no opt-out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]