bjornjorgensen commented on code in PR #608: URL: https://github.com/apache/spark-website/pull/608#discussion_r2104985282
########## releases/_posts/2025-05-23-spark-release-4-0-0.md: ########## @@ -0,0 +1,694 @@ +--- +layout: post +title: Spark Release 4.0.0 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Apache Spark 4.0.0 marks a significant milestone as the inaugural release in the 4.x series, embodying the collective effort of the vibrant open-source community. This release is a testament to tremendous collaboration, resolving over 5100 tickets with contributions from more than 390 individuals. + +Spark Connect continues its rapid advancement, delivering substantial improvements: +- A new lightweight Python client ([pyspark-client](https://pypi.org/project/pyspark-client)) at just 1.5 MB. +- Full API compatibility for the Java client. +- Greatly expanded API coverage. +- ML on Spark Connect. +- A new client implementation for [Swift](https://github.com/apache/spark-connect-swift). + +Spark SQL is significantly enriched with powerful new features designed to boost expressiveness and versatility for SQL workloads, such as VARIANT data type support, SQL user-defined functions, session variables, pipe syntax, and string collation. + +PySpark sees continuous dedication to both its functional breadth and the overall developer experience, bringing a native plotting API, a new Python Data Source API, support for Python UDTFs, and unified profiling for PySpark UDFs, alongside numerous other enhancements. + +Structured Streaming evolves with key additions that provide greater control and ease of debugging, notably the introduction of the Arbitrary State API v2 for more flexible state management and the State Data Source for easier debugging. + +To download Apache Spark 4.0.0, please visit the [downloads](https://spark.apache.org/downloads.html) page. For [detailed changes](https://issues.apache.org/jira/projects/SPARK/versions/12353359), you can consult JIRA. We have also curated a list of high-level changes here, grouped by major modules. + + +* This will become a table of contents (this text will be scraped). +{:toc} + + +### Core and Spark SQL Highlights + +- [[SPARK-45314]](https://issues.apache.org/jira/browse/SPARK-45314) Drop Scala 2.12 and make Scala 2.13 the default +- [[SPARK-45315]](https://issues.apache.org/jira/browse/SPARK-45315) Drop JDK 8/11 and make JDK 17 the default +- [[SPARK-45923]](https://issues.apache.org/jira/browse/SPARK-45923) Spark Kubernetes Operator +- [[SPARK-45869]](https://issues.apache.org/jira/browse/SPARK-45869) Revisit and improve Spark Standalone Cluster +- [[SPARK-42849]](https://issues.apache.org/jira/browse/SPARK-42849) Session Variables +- [[SPARK-44444]](https://issues.apache.org/jira/browse/SPARK-44444) Use ANSI SQL mode by default +- [[SPARK-46057]](https://issues.apache.org/jira/browse/SPARK-46057) Support SQL user-defined functions +- [[SPARK-45827]](https://issues.apache.org/jira/browse/SPARK-45827) Add VARIANT data type +- [[SPARK-49555]](https://issues.apache.org/jira/browse/SPARK-49555) SQL Pipe syntax +- [[SPARK-46830]](https://issues.apache.org/jira/browse/SPARK-46830) String Collation support +- [[SPARK-44265]](https://issues.apache.org/jira/browse/SPARK-44265) Built-in XML data source support + + +### Spark Core + +- [[SPARK-49524]](https://issues.apache.org/jira/browse/SPARK-49524) Improve K8s support +- [[SPARK-47240]](https://issues.apache.org/jira/browse/SPARK-47240) SPIP: Structured Logging Framework for Apache Spark +- [[SPARK-44893]](https://issues.apache.org/jira/browse/SPARK-44893) `ThreadInfo` improvements for monitoring APIs +- [[SPARK-46861]](https://issues.apache.org/jira/browse/SPARK-46861) Avoid Deadlock in DAGScheduler +- [[SPARK-47764]](https://issues.apache.org/jira/browse/SPARK-47764) Cleanup shuffle dependencies based on `ShuffleCleanupMode` +- [[SPARK-49459]](https://issues.apache.org/jira/browse/SPARK-49459) Support CRC32C for Shuffle Checksum +- [[SPARK-46383]](https://issues.apache.org/jira/browse/SPARK-46383) Reduce Driver Heap Usage by shortening `TaskInfo.accumulables()` lifespan +- [[SPARK-45527]](https://issues.apache.org/jira/browse/SPARK-45527) Use fraction-based resource calculation +- [[SPARK-47172]](https://issues.apache.org/jira/browse/SPARK-47172) Add AES-GCM as an optional AES cipher mode for RPC encryption +- [[SPARK-47448]](https://issues.apache.org/jira/browse/SPARK-47448) Enable `spark.shuffle.service.removeShuffle` by default +- [[SPARK-47674]](https://issues.apache.org/jira/browse/SPARK-47674) Enable `spark.metrics.appStatusSource.enabled` by default +- [[SPARK-48063]](https://issues.apache.org/jira/browse/SPARK-48063) Enable `spark.stage.ignoreDecommissionFetchFailure` by default +- [[SPARK-48268]](https://issues.apache.org/jira/browse/SPARK-48268) Add `spark.checkpoint.dir` config +- [[SPARK-48292]](https://issues.apache.org/jira/browse/SPARK-48292) Revert SPARK-39195 (OutputCommitCoordinator) to fix duplication issues +- [[SPARK-48518]](https://issues.apache.org/jira/browse/SPARK-48518) Make LZF compression run in parallel +- [[SPARK-46132]](https://issues.apache.org/jira/browse/SPARK-46132) Support key password for JKS keys for RPC SSL +- [[SPARK-46456]](https://issues.apache.org/jira/browse/SPARK-46456) Add `spark.ui.jettyStopTimeout` to set Jetty server stop timeout +- [[SPARK-46256]](https://issues.apache.org/jira/browse/SPARK-46256) Parallel Compression Support for ZSTD +- [[SPARK-45544]](https://issues.apache.org/jira/browse/SPARK-45544) Integrate SSL support into `TransportContext` +- [[SPARK-45351]](https://issues.apache.org/jira/browse/SPARK-45351) Change `spark.shuffle.service.db.backend` default value to `ROCKSDB` +- [[SPARK-44741]](https://issues.apache.org/jira/browse/SPARK-44741) Support regex-based `MetricFilter` in `StatsdSink` +- [[SPARK-43987]](https://issues.apache.org/jira/browse/SPARK-43987) Separate `finalizeShuffleMerge` Processing to Dedicated Thread Pools +- [[SPARK-45439]](https://issues.apache.org/jira/browse/SPARK-45439) Reduce memory usage of `LiveStageMetrics.accumIdsToMetricType` + + +### Spark SQL + +#### Features + +- [[SPARK-50541]](https://issues.apache.org/jira/browse/SPARK-50541) Describe Table As JSON +- [[SPARK-48031]](https://issues.apache.org/jira/browse/SPARK-48031) Support view schema evolution +- [[SPARK-50883]](https://issues.apache.org/jira/browse/SPARK-50883) Support altering multiple columns in the same command +- [[SPARK-47627]](https://issues.apache.org/jira/browse/SPARK-47627) Add `SQL MERGE` syntax to enable schema evolution +- [[SPARK-47430]](https://issues.apache.org/jira/browse/SPARK-47430) Support `GROUP BY` for `MapType` +- [[SPARK-49093]](https://issues.apache.org/jira/browse/SPARK-49093) `GROUP BY` with MapType nested inside complex type +- [[SPARK-49098]](https://issues.apache.org/jira/browse/SPARK-49098) Add write options for `INSERT` +- [[SPARK-49451]](https://issues.apache.org/jira/browse/SPARK-49451) Allow duplicate keys in `parse_json` +- [[SPARK-46536]](https://issues.apache.org/jira/browse/SPARK-46536) Support `GROUP BY calendar_interval_type` +- [[SPARK-46908]](https://issues.apache.org/jira/browse/SPARK-46908) Support star clause in `WHERE` clause +- [[SPARK-36680]](https://issues.apache.org/jira/browse/SPARK-36680) Support dynamic table options via `WITH OPTIONS` syntax +- [[SPARK-35553]](https://issues.apache.org/jira/browse/SPARK-35553) Improve correlated subqueries +- [[SPARK-47492]](https://issues.apache.org/jira/browse/SPARK-47492) Widen whitespace rules in lexer to allow Unicode +- [[SPARK-46246]](https://issues.apache.org/jira/browse/SPARK-46246) `EXECUTE IMMEDIATE`SQL support +- [[SPARK-46207]](https://issues.apache.org/jira/browse/SPARK-46207) Support `MergeInto` in DataFrameWriterV2 +- [[SPARK-50129]](https://issues.apache.org/jira/browse/SPARK-50129) Add DataFrame APIs for subqueries +- [[SPARK-50075]](https://issues.apache.org/jira/browse/SPARK-50075) DataFrame APIs for table-valued functions + + +#### Functions + +- [[SPARK-52016]](https://issues.apache.org/jira/browse/SPARK-52016) New built-in functions in Spark 4.0 +- [[SPARK-44001]](https://issues.apache.org/jira/browse/SPARK-44001) Add option to allow unwrapping protobuf well-known wrapper types +- [[SPARK-43427]](https://issues.apache.org/jira/browse/SPARK-43427) spark protobuf: allow upcasting unsigned integer types +- [[SPARK-44983]](https://issues.apache.org/jira/browse/SPARK-44983) Convert `binary` to `string` by `to_char` for the formats: hex, base64, utf-8 +- [[SPARK-44868]](https://issues.apache.org/jira/browse/SPARK-44868) Convert `datetime` to `string` by `to_char`/`to_varchar` +- [[SPARK-45796]](https://issues.apache.org/jira/browse/SPARK-45796) Support `MODE() WITHIN GROUP (ORDER BY col)` +- [[SPARK-48658]](https://issues.apache.org/jira/browse/SPARK-48658) Encode/Decode functions report coding errors instead of mojibake +- [[SPARK-45034]](https://issues.apache.org/jira/browse/SPARK-45034) Support deterministic mode function +- [[SPARK-44778]](https://issues.apache.org/jira/browse/SPARK-44778) Add the alias `TIMEDIFF` for `TIMESTAMPDIFF` +- [[SPARK-47497]](https://issues.apache.org/jira/browse/SPARK-47497) Make `to_csv` support arrays/maps/binary as pretty strings +- [[SPARK-44840]](https://issues.apache.org/jira/browse/SPARK-44840) Make `array_insert()` 1-based for negative indexes + +#### Query optimization + +- [[SPARK-46946]](https://issues.apache.org/jira/browse/SPARK-46946) Supporting broadcast of multiple filtering keys in `DynamicPruning` +- [[SPARK-48445]](https://issues.apache.org/jira/browse/SPARK-48445) Don’t inline UDFs with expansive children +- [[SPARK-41413]](https://issues.apache.org/jira/browse/SPARK-41413) Avoid shuffle in Storage-Partitioned Join when partition keys mismatch, but expressions are compatible +- [[SPARK-46941]](https://issues.apache.org/jira/browse/SPARK-46941) Prevent insertion of window group limit node with `SizeBasedWindowFunction` +- [[SPARK-46707]](https://issues.apache.org/jira/browse/SPARK-46707) Add throwable field to expressions to improve predicate pushdown +- [[SPARK-47511]](https://issues.apache.org/jira/browse/SPARK-47511) Canonicalize `WITH` expressions by reassigning IDs +- [[SPARK-46502]](https://issues.apache.org/jira/browse/SPARK-46502) Support timestamp types in `UnwrapCastInBinaryComparison` +- [[SPARK-46069]](https://issues.apache.org/jira/browse/SPARK-46069) Support unwrap timestamp type to date type +- [[SPARK-46219]](https://issues.apache.org/jira/browse/SPARK-46219) Unwrap cast in join predicates +- [[SPARK-45606]](https://issues.apache.org/jira/browse/SPARK-45606) Release restrictions on multi-layer runtime filter +- [[SPARK-45909]](https://issues.apache.org/jira/browse/SPARK-45909) Remove `NumericType` cast if it can safely up-cast in `IsNotNull` + +#### Query execution + +- [[SPARK-45592]](https://issues.apache.org/jira/browse/SPARK-45592) Correctness issue in AQE with `InMemoryTableScanExec` +- [[SPARK-50258]](https://issues.apache.org/jira/browse/SPARK-50258) Fix output column order changed issue after AQE +- [[SPARK-46693]](https://issues.apache.org/jira/browse/SPARK-46693) Inject `LocalLimitExec` when matching `OffsetAndLimit` or `LimitAndOffset` +- [[SPARK-48873]](https://issues.apache.org/jira/browse/SPARK-48873) Use `UnsafeRow` in JSON parser +- [[SPARK-41471]](https://issues.apache.org/jira/browse/SPARK-41471) Reduce Spark shuffle when only one side of a join is `KeyGroupedPartitioning` +- [[SPARK-45452]](https://issues.apache.org/jira/browse/SPARK-45452) Improve `InMemoryFileIndex` to use` FileSystem.listFiles` API +- [[SPARK-48649]](https://issues.apache.org/jira/browse/SPARK-48649) Add `ignoreInvalidPartitionPaths` configs for skipping invalid partition paths +- [[SPARK-45882]](https://issues.apache.org/jira/browse/SPARK-45882) `BroadcastHashJoinExec` propagate partitioning should respect CoalescedHashPartitioning + + +### Spark Connectors + +#### Data Source V2 framework + +- [[SPARK-45784]](https://issues.apache.org/jira/browse/SPARK-45784) Introduce clustering mechanism to Spark +- [[SPARK-50820]](https://issues.apache.org/jira/browse/SPARK-50820) DSv2: Conditional nullification of metadata columns in DML +- [[SPARK-51938]](https://issues.apache.org/jira/browse/SPARK-51938) Improve Storage Partition Join +- [[SPARK-50700]](https://issues.apache.org/jira/browse/SPARK-50700) `spark.sql.catalog.spark_catalog` supports builtin magic value +- [[SPARK-48781]](https://issues.apache.org/jira/browse/SPARK-48781) Add Catalog APIs for loading stored procedures +- [[SPARK-49246]](https://issues.apache.org/jira/browse/SPARK-49246) `TableCatalog#loadTable` should indicate if it's for writing +- [[SPARK-45965]](https://issues.apache.org/jira/browse/SPARK-45965) Move DSv2 partitioning expressions into functions.partitioning +- [[SPARK-46272]](https://issues.apache.org/jira/browse/SPARK-46272) Support CTAS using DSv2 sources +- [[SPARK-46043]](https://issues.apache.org/jira/browse/SPARK-46043) Support create table using DSv2 sources +- [[SPARK-48668]](https://issues.apache.org/jira/browse/SPARK-48668) Support `ALTER NAMESPACE ... UNSET PROPERTIES` in v2 +- [[SPARK-46442]](https://issues.apache.org/jira/browse/SPARK-46442) DS V2 supports push down `PERCENTILE_CONT` and `PERCENTILE_DISC` +- [[SPARK-49078]](https://issues.apache.org/jira/browse/SPARK-49078) Support show columns syntax in v2 table + +#### Hive Catalog + +- [[SPARK-45328]](https://issues.apache.org/jira/browse/SPARK-45328) Remove Hive support prior to 2.0.0 +- [[SPARK-47101]](https://issues.apache.org/jira/browse/SPARK-47101) Allow comma in top-level column names and relax HiveExternalCatalog schema check +- [[SPARK-45265]](https://issues.apache.org/jira/browse/SPARK-45265) Support Hive 4.0 metastore + +#### XML + +- [[SPARK-44265]](https://issues.apache.org/jira/browse/SPARK-44265) Built-in XML data source support + +#### CSV + +- [[SPARK-46862]](https://issues.apache.org/jira/browse/SPARK-46862) Disable CSV column pruning in multi-line mode +- [[SPARK-46890]](https://issues.apache.org/jira/browse/SPARK-46890) Fix CSV parsing bug with default values and column pruning +- [[SPARK-50616]](https://issues.apache.org/jira/browse/SPARK-50616) Add File Extension Option to CSV DataSource Writer +- [[SPARK-49125]](https://issues.apache.org/jira/browse/SPARK-49125) Allow duplicated column names in CSV writing +- [[SPARK-49016]](https://issues.apache.org/jira/browse/SPARK-49016) Restore behavior for queries from raw CSV files +- [[SPARK-48807]](https://issues.apache.org/jira/browse/SPARK-48807) Binary support for CSV datasource +- [[SPARK-48602]](https://issues.apache.org/jira/browse/SPARK-48602) Make csv generator support different output style via spark.sql.binaryOutputStyle + +#### ORC + +- [[SPARK-46648]](https://issues.apache.org/jira/browse/SPARK-46648) Use zstd as the default ORC compression +- [[SPARK-47456]](https://issues.apache.org/jira/browse/SPARK-47456) Support ORC Brotli codec +- [[SPARK-41858]](https://issues.apache.org/jira/browse/SPARK-41858) Fix ORC reader perf regression due to DEFAULT value feature + +#### Avro + +- [[SPARK-47739]](https://issues.apache.org/jira/browse/SPARK-47739) Register logical Avro type +- [[SPARK-49082]](https://issues.apache.org/jira/browse/SPARK-49082) Widening type promotions in `AvroDeserializer` +- [[SPARK-46633]](https://issues.apache.org/jira/browse/SPARK-46633) Fix Avro reader to handle zero-length blocks +- [[SPARK-50350]](https://issues.apache.org/jira/browse/SPARK-50350) Avro: add new function `schema_of_avro` (Scala side) +- [[SPARK-46930]](https://issues.apache.org/jira/browse/SPARK-46930) Add support for custom prefix for Union type fields in Avro +- [[SPARK-46746]](https://issues.apache.org/jira/browse/SPARK-46746) Attach codec extension to Avro datasource files +- [[SPARK-46759]](https://issues.apache.org/jira/browse/SPARK-46759) Support compression level for xz and zstandard in Avro +- [[SPARK-46766]](https://issues.apache.org/jira/browse/SPARK-46766) Add ZSTD Buffer Pool support for Avro datasource +- [[SPARK-43380]](https://issues.apache.org/jira/browse/SPARK-43380) Fix Avro data type conversion issues without causing performance regression +- [[SPARK-48545]](https://issues.apache.org/jira/browse/SPARK-48545) Create `to_avro` and `from_avro` SQL functions +- [[SPARK-46990]](https://issues.apache.org/jira/browse/SPARK-46990) Fix loading empty Avro files (infinite loop) + +#### JDBC + +- [[SPARK-47361]](https://issues.apache.org/jira/browse/SPARK-47361) Improve JDBC data sources +- [[SPARK-44977]](https://issues.apache.org/jira/browse/SPARK-44977) Upgrade Derby to 10.16.1.1 +- [[SPARK-47044]](https://issues.apache.org/jira/browse/SPARK-47044) Add executed query for JDBC external datasources to explain output +- [[SPARK-45139]](https://issues.apache.org/jira/browse/SPARK-45139) Add `DatabricksDialect` to handle SQL type conversion + +#### Other notable Spark Connectors changes + +- [[SPARK-45905]](https://issues.apache.org/jira/browse/SPARK-45905) Least common type between decimal types should retain integral digits first +- [[SPARK-45786]](https://issues.apache.org/jira/browse/SPARK-45786) Fix inaccurate Decimal multiplication and division results +- [[SPARK-50705]](https://issues.apache.org/jira/browse/SPARK-50705) Make `QueryPlan` lock‑free +- [[SPARK-46743]](https://issues.apache.org/jira/browse/SPARK-46743) Fix corner-case with `COUNT` + constant folding subquery +- [[SPARK-47509]](https://issues.apache.org/jira/browse/SPARK-47509) Block subquery expressions in lambda/higher-order functions for correctness +- [[SPARK-48498]](https://issues.apache.org/jira/browse/SPARK-48498) Always do char padding in predicates +- [[SPARK-45915]](https://issues.apache.org/jira/browse/SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings +- [[SPARK-46220]](https://issues.apache.org/jira/browse/SPARK-46220) Restrict charsets in decode() +- [[SPARK-45816]](https://issues.apache.org/jira/browse/SPARK-45816) Return `NULL` when overflowing during casting from timestamp to integers +- [[SPARK-45586]](https://issues.apache.org/jira/browse/SPARK-45586) Reduce compiler latency for plans with large expression trees +- [[SPARK-45507]](https://issues.apache.org/jira/browse/SPARK-45507) Correctness fix for nested correlated scalar subqueries with `COUNT` aggregates +- [[SPARK-44550]](https://issues.apache.org/jira/browse/SPARK-44550) Enable correctness fixes for null `IN` (empty list) under ANSI +- [[SPARK-47911]](https://issues.apache.org/jira/browse/SPARK-47911) Introduces a universal `BinaryFormatter` to make binary output consistent + + +### PySpark Highlights + +- [[SPARK-49530]](https://issues.apache.org/jira/browse/SPARK-49530) Introducing PySpark Plotting API +- [[SPARK-47540]](https://issues.apache.org/jira/browse/SPARK-47540) SPIP: Pure Python Package (Spark Connect) +- [[SPARK-50132]](https://issues.apache.org/jira/browse/SPARK-50132) Add DataFrame API for Lateral Joins +- [[SPARK-45981]](https://issues.apache.org/jira/browse/SPARK-45981) Improve Python language test coverage +- [[SPARK-46858]](https://issues.apache.org/jira/browse/SPARK-46858) Upgrade Pandas to 2 +- [[SPARK-46910]](https://issues.apache.org/jira/browse/SPARK-46910) Eliminate JDK Requirement in PySpark Installation +- [[SPARK-47274]](https://issues.apache.org/jira/browse/SPARK-47274) Provide more useful context for DataFrame API errors +- [[SPARK-44076]](https://issues.apache.org/jira/browse/SPARK-44076) SPIP: Python Data Source API +- [[SPARK-43797]](https://issues.apache.org/jira/browse/SPARK-43797) Python User-defined Table Functions +- [[SPARK-46685]](https://issues.apache.org/jira/browse/SPARK-46685) PySpark UDF Unified Profiling + +#### DataFrame APIs and Features + +- [[SPARK-51079]](https://issues.apache.org/jira/browse/SPARK-51079) Support large variable types in pandas UDF, `createDataFrame` and `toPandas` with Arrow +- [[SPARK-50718]](https://issues.apache.org/jira/browse/SPARK-50718) Support `addArtifact(s)` for PySpark +- [[SPARK-50778]](https://issues.apache.org/jira/browse/SPARK-50778) Add `metadataColumn` to PySpark DataFrame +- [[SPARK-50719]](https://issues.apache.org/jira/browse/SPARK-50719) Support `interruptOperation` for PySpark +- [[SPARK-50790]](https://issues.apache.org/jira/browse/SPARK-50790) Implement `parse_json` in PySpark +- [[SPARK-49306]](https://issues.apache.org/jira/browse/SPARK-49306) Create SQL function aliases for `zeroifnull` and `nullifzero` +- [[SPARK-50132]](https://issues.apache.org/jira/browse/SPARK-50132) Add DataFrame API for Lateral Joins +- [[SPARK-43295]](https://issues.apache.org/jira/browse/SPARK-43295) Support string type columns for `DataFrameGroupBy.sum` +- [[SPARK-45575]](https://issues.apache.org/jira/browse/SPARK-45575) Support time travel options for `df.read` API +- [[SPARK-45755]](https://issues.apache.org/jira/browse/SPARK-45755) Improve `Dataset.isEmpty()` by applying global limit 1 + - Improves performance of isEmpty() by pushing down a global limit of 1. Review Comment: missing link ########## releases/_posts/2025-05-23-spark-release-4-0-0.md: ########## @@ -0,0 +1,694 @@ +--- +layout: post +title: Spark Release 4.0.0 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Apache Spark 4.0.0 marks a significant milestone as the inaugural release in the 4.x series, embodying the collective effort of the vibrant open-source community. This release is a testament to tremendous collaboration, resolving over 5100 tickets with contributions from more than 390 individuals. + +Spark Connect continues its rapid advancement, delivering substantial improvements: +- A new lightweight Python client ([pyspark-client](https://pypi.org/project/pyspark-client)) at just 1.5 MB. +- Full API compatibility for the Java client. +- Greatly expanded API coverage. +- ML on Spark Connect. +- A new client implementation for [Swift](https://github.com/apache/spark-connect-swift). + +Spark SQL is significantly enriched with powerful new features designed to boost expressiveness and versatility for SQL workloads, such as VARIANT data type support, SQL user-defined functions, session variables, pipe syntax, and string collation. + +PySpark sees continuous dedication to both its functional breadth and the overall developer experience, bringing a native plotting API, a new Python Data Source API, support for Python UDTFs, and unified profiling for PySpark UDFs, alongside numerous other enhancements. + +Structured Streaming evolves with key additions that provide greater control and ease of debugging, notably the introduction of the Arbitrary State API v2 for more flexible state management and the State Data Source for easier debugging. + +To download Apache Spark 4.0.0, please visit the [downloads](https://spark.apache.org/downloads.html) page. For [detailed changes](https://issues.apache.org/jira/projects/SPARK/versions/12353359), you can consult JIRA. We have also curated a list of high-level changes here, grouped by major modules. + + +* This will become a table of contents (this text will be scraped). +{:toc} + + +### Core and Spark SQL Highlights + +- [[SPARK-45314]](https://issues.apache.org/jira/browse/SPARK-45314) Drop Scala 2.12 and make Scala 2.13 the default +- [[SPARK-45315]](https://issues.apache.org/jira/browse/SPARK-45315) Drop JDK 8/11 and make JDK 17 the default +- [[SPARK-45923]](https://issues.apache.org/jira/browse/SPARK-45923) Spark Kubernetes Operator +- [[SPARK-45869]](https://issues.apache.org/jira/browse/SPARK-45869) Revisit and improve Spark Standalone Cluster +- [[SPARK-42849]](https://issues.apache.org/jira/browse/SPARK-42849) Session Variables +- [[SPARK-44444]](https://issues.apache.org/jira/browse/SPARK-44444) Use ANSI SQL mode by default +- [[SPARK-46057]](https://issues.apache.org/jira/browse/SPARK-46057) Support SQL user-defined functions +- [[SPARK-45827]](https://issues.apache.org/jira/browse/SPARK-45827) Add VARIANT data type +- [[SPARK-49555]](https://issues.apache.org/jira/browse/SPARK-49555) SQL Pipe syntax +- [[SPARK-46830]](https://issues.apache.org/jira/browse/SPARK-46830) String Collation support +- [[SPARK-44265]](https://issues.apache.org/jira/browse/SPARK-44265) Built-in XML data source support + + +### Spark Core + +- [[SPARK-49524]](https://issues.apache.org/jira/browse/SPARK-49524) Improve K8s support +- [[SPARK-47240]](https://issues.apache.org/jira/browse/SPARK-47240) SPIP: Structured Logging Framework for Apache Spark +- [[SPARK-44893]](https://issues.apache.org/jira/browse/SPARK-44893) `ThreadInfo` improvements for monitoring APIs +- [[SPARK-46861]](https://issues.apache.org/jira/browse/SPARK-46861) Avoid Deadlock in DAGScheduler +- [[SPARK-47764]](https://issues.apache.org/jira/browse/SPARK-47764) Cleanup shuffle dependencies based on `ShuffleCleanupMode` +- [[SPARK-49459]](https://issues.apache.org/jira/browse/SPARK-49459) Support CRC32C for Shuffle Checksum +- [[SPARK-46383]](https://issues.apache.org/jira/browse/SPARK-46383) Reduce Driver Heap Usage by shortening `TaskInfo.accumulables()` lifespan +- [[SPARK-45527]](https://issues.apache.org/jira/browse/SPARK-45527) Use fraction-based resource calculation +- [[SPARK-47172]](https://issues.apache.org/jira/browse/SPARK-47172) Add AES-GCM as an optional AES cipher mode for RPC encryption +- [[SPARK-47448]](https://issues.apache.org/jira/browse/SPARK-47448) Enable `spark.shuffle.service.removeShuffle` by default +- [[SPARK-47674]](https://issues.apache.org/jira/browse/SPARK-47674) Enable `spark.metrics.appStatusSource.enabled` by default +- [[SPARK-48063]](https://issues.apache.org/jira/browse/SPARK-48063) Enable `spark.stage.ignoreDecommissionFetchFailure` by default +- [[SPARK-48268]](https://issues.apache.org/jira/browse/SPARK-48268) Add `spark.checkpoint.dir` config +- [[SPARK-48292]](https://issues.apache.org/jira/browse/SPARK-48292) Revert SPARK-39195 (OutputCommitCoordinator) to fix duplication issues +- [[SPARK-48518]](https://issues.apache.org/jira/browse/SPARK-48518) Make LZF compression run in parallel +- [[SPARK-46132]](https://issues.apache.org/jira/browse/SPARK-46132) Support key password for JKS keys for RPC SSL +- [[SPARK-46456]](https://issues.apache.org/jira/browse/SPARK-46456) Add `spark.ui.jettyStopTimeout` to set Jetty server stop timeout +- [[SPARK-46256]](https://issues.apache.org/jira/browse/SPARK-46256) Parallel Compression Support for ZSTD +- [[SPARK-45544]](https://issues.apache.org/jira/browse/SPARK-45544) Integrate SSL support into `TransportContext` +- [[SPARK-45351]](https://issues.apache.org/jira/browse/SPARK-45351) Change `spark.shuffle.service.db.backend` default value to `ROCKSDB` +- [[SPARK-44741]](https://issues.apache.org/jira/browse/SPARK-44741) Support regex-based `MetricFilter` in `StatsdSink` +- [[SPARK-43987]](https://issues.apache.org/jira/browse/SPARK-43987) Separate `finalizeShuffleMerge` Processing to Dedicated Thread Pools +- [[SPARK-45439]](https://issues.apache.org/jira/browse/SPARK-45439) Reduce memory usage of `LiveStageMetrics.accumIdsToMetricType` + + +### Spark SQL + +#### Features + +- [[SPARK-50541]](https://issues.apache.org/jira/browse/SPARK-50541) Describe Table As JSON +- [[SPARK-48031]](https://issues.apache.org/jira/browse/SPARK-48031) Support view schema evolution +- [[SPARK-50883]](https://issues.apache.org/jira/browse/SPARK-50883) Support altering multiple columns in the same command +- [[SPARK-47627]](https://issues.apache.org/jira/browse/SPARK-47627) Add `SQL MERGE` syntax to enable schema evolution +- [[SPARK-47430]](https://issues.apache.org/jira/browse/SPARK-47430) Support `GROUP BY` for `MapType` +- [[SPARK-49093]](https://issues.apache.org/jira/browse/SPARK-49093) `GROUP BY` with MapType nested inside complex type +- [[SPARK-49098]](https://issues.apache.org/jira/browse/SPARK-49098) Add write options for `INSERT` +- [[SPARK-49451]](https://issues.apache.org/jira/browse/SPARK-49451) Allow duplicate keys in `parse_json` +- [[SPARK-46536]](https://issues.apache.org/jira/browse/SPARK-46536) Support `GROUP BY calendar_interval_type` +- [[SPARK-46908]](https://issues.apache.org/jira/browse/SPARK-46908) Support star clause in `WHERE` clause +- [[SPARK-36680]](https://issues.apache.org/jira/browse/SPARK-36680) Support dynamic table options via `WITH OPTIONS` syntax +- [[SPARK-35553]](https://issues.apache.org/jira/browse/SPARK-35553) Improve correlated subqueries +- [[SPARK-47492]](https://issues.apache.org/jira/browse/SPARK-47492) Widen whitespace rules in lexer to allow Unicode +- [[SPARK-46246]](https://issues.apache.org/jira/browse/SPARK-46246) `EXECUTE IMMEDIATE`SQL support +- [[SPARK-46207]](https://issues.apache.org/jira/browse/SPARK-46207) Support `MergeInto` in DataFrameWriterV2 +- [[SPARK-50129]](https://issues.apache.org/jira/browse/SPARK-50129) Add DataFrame APIs for subqueries +- [[SPARK-50075]](https://issues.apache.org/jira/browse/SPARK-50075) DataFrame APIs for table-valued functions + + +#### Functions + +- [[SPARK-52016]](https://issues.apache.org/jira/browse/SPARK-52016) New built-in functions in Spark 4.0 +- [[SPARK-44001]](https://issues.apache.org/jira/browse/SPARK-44001) Add option to allow unwrapping protobuf well-known wrapper types +- [[SPARK-43427]](https://issues.apache.org/jira/browse/SPARK-43427) spark protobuf: allow upcasting unsigned integer types +- [[SPARK-44983]](https://issues.apache.org/jira/browse/SPARK-44983) Convert `binary` to `string` by `to_char` for the formats: hex, base64, utf-8 +- [[SPARK-44868]](https://issues.apache.org/jira/browse/SPARK-44868) Convert `datetime` to `string` by `to_char`/`to_varchar` +- [[SPARK-45796]](https://issues.apache.org/jira/browse/SPARK-45796) Support `MODE() WITHIN GROUP (ORDER BY col)` +- [[SPARK-48658]](https://issues.apache.org/jira/browse/SPARK-48658) Encode/Decode functions report coding errors instead of mojibake +- [[SPARK-45034]](https://issues.apache.org/jira/browse/SPARK-45034) Support deterministic mode function +- [[SPARK-44778]](https://issues.apache.org/jira/browse/SPARK-44778) Add the alias `TIMEDIFF` for `TIMESTAMPDIFF` +- [[SPARK-47497]](https://issues.apache.org/jira/browse/SPARK-47497) Make `to_csv` support arrays/maps/binary as pretty strings +- [[SPARK-44840]](https://issues.apache.org/jira/browse/SPARK-44840) Make `array_insert()` 1-based for negative indexes + +#### Query optimization + +- [[SPARK-46946]](https://issues.apache.org/jira/browse/SPARK-46946) Supporting broadcast of multiple filtering keys in `DynamicPruning` +- [[SPARK-48445]](https://issues.apache.org/jira/browse/SPARK-48445) Don’t inline UDFs with expansive children +- [[SPARK-41413]](https://issues.apache.org/jira/browse/SPARK-41413) Avoid shuffle in Storage-Partitioned Join when partition keys mismatch, but expressions are compatible +- [[SPARK-46941]](https://issues.apache.org/jira/browse/SPARK-46941) Prevent insertion of window group limit node with `SizeBasedWindowFunction` +- [[SPARK-46707]](https://issues.apache.org/jira/browse/SPARK-46707) Add throwable field to expressions to improve predicate pushdown +- [[SPARK-47511]](https://issues.apache.org/jira/browse/SPARK-47511) Canonicalize `WITH` expressions by reassigning IDs +- [[SPARK-46502]](https://issues.apache.org/jira/browse/SPARK-46502) Support timestamp types in `UnwrapCastInBinaryComparison` +- [[SPARK-46069]](https://issues.apache.org/jira/browse/SPARK-46069) Support unwrap timestamp type to date type +- [[SPARK-46219]](https://issues.apache.org/jira/browse/SPARK-46219) Unwrap cast in join predicates +- [[SPARK-45606]](https://issues.apache.org/jira/browse/SPARK-45606) Release restrictions on multi-layer runtime filter +- [[SPARK-45909]](https://issues.apache.org/jira/browse/SPARK-45909) Remove `NumericType` cast if it can safely up-cast in `IsNotNull` + +#### Query execution + +- [[SPARK-45592]](https://issues.apache.org/jira/browse/SPARK-45592) Correctness issue in AQE with `InMemoryTableScanExec` +- [[SPARK-50258]](https://issues.apache.org/jira/browse/SPARK-50258) Fix output column order changed issue after AQE +- [[SPARK-46693]](https://issues.apache.org/jira/browse/SPARK-46693) Inject `LocalLimitExec` when matching `OffsetAndLimit` or `LimitAndOffset` +- [[SPARK-48873]](https://issues.apache.org/jira/browse/SPARK-48873) Use `UnsafeRow` in JSON parser +- [[SPARK-41471]](https://issues.apache.org/jira/browse/SPARK-41471) Reduce Spark shuffle when only one side of a join is `KeyGroupedPartitioning` +- [[SPARK-45452]](https://issues.apache.org/jira/browse/SPARK-45452) Improve `InMemoryFileIndex` to use` FileSystem.listFiles` API +- [[SPARK-48649]](https://issues.apache.org/jira/browse/SPARK-48649) Add `ignoreInvalidPartitionPaths` configs for skipping invalid partition paths +- [[SPARK-45882]](https://issues.apache.org/jira/browse/SPARK-45882) `BroadcastHashJoinExec` propagate partitioning should respect CoalescedHashPartitioning + + +### Spark Connectors + +#### Data Source V2 framework + +- [[SPARK-45784]](https://issues.apache.org/jira/browse/SPARK-45784) Introduce clustering mechanism to Spark +- [[SPARK-50820]](https://issues.apache.org/jira/browse/SPARK-50820) DSv2: Conditional nullification of metadata columns in DML +- [[SPARK-51938]](https://issues.apache.org/jira/browse/SPARK-51938) Improve Storage Partition Join +- [[SPARK-50700]](https://issues.apache.org/jira/browse/SPARK-50700) `spark.sql.catalog.spark_catalog` supports builtin magic value +- [[SPARK-48781]](https://issues.apache.org/jira/browse/SPARK-48781) Add Catalog APIs for loading stored procedures +- [[SPARK-49246]](https://issues.apache.org/jira/browse/SPARK-49246) `TableCatalog#loadTable` should indicate if it's for writing +- [[SPARK-45965]](https://issues.apache.org/jira/browse/SPARK-45965) Move DSv2 partitioning expressions into functions.partitioning +- [[SPARK-46272]](https://issues.apache.org/jira/browse/SPARK-46272) Support CTAS using DSv2 sources +- [[SPARK-46043]](https://issues.apache.org/jira/browse/SPARK-46043) Support create table using DSv2 sources +- [[SPARK-48668]](https://issues.apache.org/jira/browse/SPARK-48668) Support `ALTER NAMESPACE ... UNSET PROPERTIES` in v2 +- [[SPARK-46442]](https://issues.apache.org/jira/browse/SPARK-46442) DS V2 supports push down `PERCENTILE_CONT` and `PERCENTILE_DISC` +- [[SPARK-49078]](https://issues.apache.org/jira/browse/SPARK-49078) Support show columns syntax in v2 table + +#### Hive Catalog + +- [[SPARK-45328]](https://issues.apache.org/jira/browse/SPARK-45328) Remove Hive support prior to 2.0.0 +- [[SPARK-47101]](https://issues.apache.org/jira/browse/SPARK-47101) Allow comma in top-level column names and relax HiveExternalCatalog schema check +- [[SPARK-45265]](https://issues.apache.org/jira/browse/SPARK-45265) Support Hive 4.0 metastore + +#### XML + +- [[SPARK-44265]](https://issues.apache.org/jira/browse/SPARK-44265) Built-in XML data source support + +#### CSV + +- [[SPARK-46862]](https://issues.apache.org/jira/browse/SPARK-46862) Disable CSV column pruning in multi-line mode +- [[SPARK-46890]](https://issues.apache.org/jira/browse/SPARK-46890) Fix CSV parsing bug with default values and column pruning +- [[SPARK-50616]](https://issues.apache.org/jira/browse/SPARK-50616) Add File Extension Option to CSV DataSource Writer +- [[SPARK-49125]](https://issues.apache.org/jira/browse/SPARK-49125) Allow duplicated column names in CSV writing +- [[SPARK-49016]](https://issues.apache.org/jira/browse/SPARK-49016) Restore behavior for queries from raw CSV files +- [[SPARK-48807]](https://issues.apache.org/jira/browse/SPARK-48807) Binary support for CSV datasource +- [[SPARK-48602]](https://issues.apache.org/jira/browse/SPARK-48602) Make csv generator support different output style via spark.sql.binaryOutputStyle + +#### ORC + +- [[SPARK-46648]](https://issues.apache.org/jira/browse/SPARK-46648) Use zstd as the default ORC compression +- [[SPARK-47456]](https://issues.apache.org/jira/browse/SPARK-47456) Support ORC Brotli codec +- [[SPARK-41858]](https://issues.apache.org/jira/browse/SPARK-41858) Fix ORC reader perf regression due to DEFAULT value feature + +#### Avro + +- [[SPARK-47739]](https://issues.apache.org/jira/browse/SPARK-47739) Register logical Avro type +- [[SPARK-49082]](https://issues.apache.org/jira/browse/SPARK-49082) Widening type promotions in `AvroDeserializer` +- [[SPARK-46633]](https://issues.apache.org/jira/browse/SPARK-46633) Fix Avro reader to handle zero-length blocks +- [[SPARK-50350]](https://issues.apache.org/jira/browse/SPARK-50350) Avro: add new function `schema_of_avro` (Scala side) +- [[SPARK-46930]](https://issues.apache.org/jira/browse/SPARK-46930) Add support for custom prefix for Union type fields in Avro +- [[SPARK-46746]](https://issues.apache.org/jira/browse/SPARK-46746) Attach codec extension to Avro datasource files +- [[SPARK-46759]](https://issues.apache.org/jira/browse/SPARK-46759) Support compression level for xz and zstandard in Avro +- [[SPARK-46766]](https://issues.apache.org/jira/browse/SPARK-46766) Add ZSTD Buffer Pool support for Avro datasource +- [[SPARK-43380]](https://issues.apache.org/jira/browse/SPARK-43380) Fix Avro data type conversion issues without causing performance regression +- [[SPARK-48545]](https://issues.apache.org/jira/browse/SPARK-48545) Create `to_avro` and `from_avro` SQL functions +- [[SPARK-46990]](https://issues.apache.org/jira/browse/SPARK-46990) Fix loading empty Avro files (infinite loop) + +#### JDBC + +- [[SPARK-47361]](https://issues.apache.org/jira/browse/SPARK-47361) Improve JDBC data sources +- [[SPARK-44977]](https://issues.apache.org/jira/browse/SPARK-44977) Upgrade Derby to 10.16.1.1 +- [[SPARK-47044]](https://issues.apache.org/jira/browse/SPARK-47044) Add executed query for JDBC external datasources to explain output +- [[SPARK-45139]](https://issues.apache.org/jira/browse/SPARK-45139) Add `DatabricksDialect` to handle SQL type conversion + +#### Other notable Spark Connectors changes + +- [[SPARK-45905]](https://issues.apache.org/jira/browse/SPARK-45905) Least common type between decimal types should retain integral digits first +- [[SPARK-45786]](https://issues.apache.org/jira/browse/SPARK-45786) Fix inaccurate Decimal multiplication and division results +- [[SPARK-50705]](https://issues.apache.org/jira/browse/SPARK-50705) Make `QueryPlan` lock‑free +- [[SPARK-46743]](https://issues.apache.org/jira/browse/SPARK-46743) Fix corner-case with `COUNT` + constant folding subquery +- [[SPARK-47509]](https://issues.apache.org/jira/browse/SPARK-47509) Block subquery expressions in lambda/higher-order functions for correctness +- [[SPARK-48498]](https://issues.apache.org/jira/browse/SPARK-48498) Always do char padding in predicates +- [[SPARK-45915]](https://issues.apache.org/jira/browse/SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings +- [[SPARK-46220]](https://issues.apache.org/jira/browse/SPARK-46220) Restrict charsets in decode() +- [[SPARK-45816]](https://issues.apache.org/jira/browse/SPARK-45816) Return `NULL` when overflowing during casting from timestamp to integers +- [[SPARK-45586]](https://issues.apache.org/jira/browse/SPARK-45586) Reduce compiler latency for plans with large expression trees +- [[SPARK-45507]](https://issues.apache.org/jira/browse/SPARK-45507) Correctness fix for nested correlated scalar subqueries with `COUNT` aggregates +- [[SPARK-44550]](https://issues.apache.org/jira/browse/SPARK-44550) Enable correctness fixes for null `IN` (empty list) under ANSI +- [[SPARK-47911]](https://issues.apache.org/jira/browse/SPARK-47911) Introduces a universal `BinaryFormatter` to make binary output consistent + + +### PySpark Highlights + +- [[SPARK-49530]](https://issues.apache.org/jira/browse/SPARK-49530) Introducing PySpark Plotting API +- [[SPARK-47540]](https://issues.apache.org/jira/browse/SPARK-47540) SPIP: Pure Python Package (Spark Connect) +- [[SPARK-50132]](https://issues.apache.org/jira/browse/SPARK-50132) Add DataFrame API for Lateral Joins +- [[SPARK-45981]](https://issues.apache.org/jira/browse/SPARK-45981) Improve Python language test coverage +- [[SPARK-46858]](https://issues.apache.org/jira/browse/SPARK-46858) Upgrade Pandas to 2 +- [[SPARK-46910]](https://issues.apache.org/jira/browse/SPARK-46910) Eliminate JDK Requirement in PySpark Installation +- [[SPARK-47274]](https://issues.apache.org/jira/browse/SPARK-47274) Provide more useful context for DataFrame API errors +- [[SPARK-44076]](https://issues.apache.org/jira/browse/SPARK-44076) SPIP: Python Data Source API +- [[SPARK-43797]](https://issues.apache.org/jira/browse/SPARK-43797) Python User-defined Table Functions +- [[SPARK-46685]](https://issues.apache.org/jira/browse/SPARK-46685) PySpark UDF Unified Profiling + +#### DataFrame APIs and Features + +- [[SPARK-51079]](https://issues.apache.org/jira/browse/SPARK-51079) Support large variable types in pandas UDF, `createDataFrame` and `toPandas` with Arrow +- [[SPARK-50718]](https://issues.apache.org/jira/browse/SPARK-50718) Support `addArtifact(s)` for PySpark +- [[SPARK-50778]](https://issues.apache.org/jira/browse/SPARK-50778) Add `metadataColumn` to PySpark DataFrame +- [[SPARK-50719]](https://issues.apache.org/jira/browse/SPARK-50719) Support `interruptOperation` for PySpark +- [[SPARK-50790]](https://issues.apache.org/jira/browse/SPARK-50790) Implement `parse_json` in PySpark +- [[SPARK-49306]](https://issues.apache.org/jira/browse/SPARK-49306) Create SQL function aliases for `zeroifnull` and `nullifzero` +- [[SPARK-50132]](https://issues.apache.org/jira/browse/SPARK-50132) Add DataFrame API for Lateral Joins +- [[SPARK-43295]](https://issues.apache.org/jira/browse/SPARK-43295) Support string type columns for `DataFrameGroupBy.sum` +- [[SPARK-45575]](https://issues.apache.org/jira/browse/SPARK-45575) Support time travel options for `df.read` API +- [[SPARK-45755]](https://issues.apache.org/jira/browse/SPARK-45755) Improve `Dataset.isEmpty()` by applying global limit 1 + - Improves performance of isEmpty() by pushing down a global limit of 1. +- [[SPARK-48761]](https://issues.apache.org/jira/browse/SPARK-48761) Introduce `clusterBy` DataFrameWriter API for Scala +- [[SPARK-45929]](https://issues.apache.org/jira/browse/SPARK-45929) Support `groupingSets` operation in DataFrame API + - Extends `groupingSets(...)` to DataFrame/DS-level APIs. Review Comment: missing link -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org