Re: [PR] DataFusion 54.0.0 blog post [datafusion-site]

via GitHub Thu, 11 Jun 2026 08:17:09 -0700


alamb commented on code in PR #197:
URL: https://github.com/apache/datafusion-site/pull/197#discussion_r3397077831



##########
content/blog/2026-06-08-datafusion-54.0.0.md:
##########
@@ -0,0 +1,437 @@
+---
+layout: post
+title: Apache DataFusion 54.0.0 Released
+date: 2026-06-08
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are proud to announce the release of [DataFusion 54.0.0]. This post 
highlights
+some of the major improvements since [DataFusion 53.0.0]. Notable additions
+include `LATERAL` joins, SQL lambda functions, and a new Avro reader, alongside
+significant join, scan, and planning performance improvements. The complete 
list
+of changes is available in the [changelog]. This release represents roughly 11
+weeks of development and 740 commits. Thanks to the [139 contributors]
+(a new record!) for making it possible.
+
+[DataFusion 54.0.0]: https://crates.io/crates/datafusion/54.0.0
+[DataFusion 53.0.0]: 
https://datafusion.apache.org/blog/2026/04/02/datafusion-53.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-54/dev/changelog/54.0.0.md
+[139 contributors]: 
https://github.com/apache/datafusion/blob/branch-54/dev/changelog/54.0.0.md#credits
+
+## Performance Improvements 🚀
+
+<img
+src="/blog/images/datafusion-54.0.0/performance_over_time_clickbench.png"
+width="100%"
+class="img-fluid"
+alt="Performance over time"
+/>
+
+**Figure 1**: Average and median normalized execution times for DataFusion 
54.0.0 on ClickBench queries, compared to previous releases.
+Query times are normalized using the ClickBench definition. See the
+[DataFusion Benchmarking 
Page](https://alamb.github.io/datafusion-benchmarking/)
+for more details.
+
+We continue to make significant performance improvements in DataFusion, as
+explained below. This release prunes more redundant work out of plans and
+makes joins, repartitioning, scans, and many built-in functions faster.
+
+### Execution Operator Improvements
+
+**Physical Execution of Uncorrelated Scalar Subqueries**:
+DataFusion previously executed an uncorrelated scalar subquery (one that 
doesn't
+depend on the outer query) by rewriting it into a join. DataFusion 54 instead
+evaluates it once with a new physical operator. This lets functions use their
+specialized scalar code paths, and allows uncorrelated scalar subqueries in
+`ORDER BY`, `JOIN ON`, and as arguments to aggregate functions.
+Thanks to [@neilconway] for implementing this feature, with reviews from
+[@Dandandan], [@alamb], and [@timsaucer]. Related PRs: [#21240]
+
+**Faster Sort-Merge Joins**:
+Semi, anti, and mark joins now track matches with a per-row bitset instead of
+materializing `(outer, inner)` pairs. Batched deferred filtering makes
+near-unique `LEFT` and `FULL` joins 20-50x faster. Finally, join-key 
comparisons
+now use a `DynComparator` that resolves the column type once rather than per 
row,
+making microbenchmarks up to 12% faster and TPC-H ~5% faster overall.
+Thanks to [@mbutrovich] for this work, with reviews from [@Dandandan],
+[@comphead], and [@rluvaton]. Related PRs: [#20806], [#21184], [#21484], 
[#21517]
+
+**Faster Repartitioning**:
+`RepartitionExec` now coalesces batches before sending them to distributor
+channels, cutting per-batch overhead for up to 50% faster execution on some
+repartition-heavy queries.
+Thanks to [@gabotechs] for this work, with reviews from [@Dandandan] and
+[@alamb]. Related PRs: [#22010]
+
+**Faster Functions and Hashing**:
+DataFusion ships hundreds of built-in functions, so speeding them up pays off
+across many workloads. This release optimizes many, including 
[array_to_string],
+[array_concat], [array_sort], [split_part], [substr], [strpos], [left], 
[right],
+[string_agg], and [approx_distinct], plus better `NULL` handling across many
+array and datetime functions. The `first_value` and `last_value` aggregates are
+also substantially faster over `Utf8` and `Binary` columns thanks to a new
+`GroupsAccumulator` ([#21090]). DataFusion 54 also swaps `ahash` for 
[foldhash] in
+`datafusion-common`, and optimizes [regexp_replace] by stripping trailing `.*`
+from anchored patterns.
+Thanks to the many contributors who drove this work, especially [@UBarney],
+[@neilconway], [@Dandandan], [@zhangxffff], [@lyne7-sc], [@CuteChuanChuan],
+[@kumarUjjawal], and [@coderfender].
+
+### Planner Improvements
+
+**Pruning Functionally Redundant Sort Keys**:
+Sorting is expensive, so it pays to sort by as few columns as possible.
+DataFusion 54 now drops functionally redundant `ORDER BY` keys: when an earlier
+key determines a later one, the later key can't change the ordering, so 
removing
+it cuts sorting cost without affecting results.
+Thanks to [@xiedeyantu] for implementing this feature, with reviews from
+[@alamb] and [@neilconway]. Related PRs: [#21362]
+
+**Skip Redundant Parquet Filters**:
+When statistics prove a filter matches every row in a Parquet row group,
+DataFusion now skips evaluating it — both row filters and page-level pruning —
+for that row group instead of re-checking each row.
+Thanks to [@xudong963] for implementing this, building on a suggestion from
+[@crepererum]. Related issues and PRs: [#19028], [#21637]
+
+**Statistics-Driven Sort Pushdown and TopK**:
+Files and Parquet row groups are now ordered using statistics, which can avoid
+sorting entirely and improve dynamic filtering and early stopping for TopK
+(`ORDER BY ... LIMIT`) queries. The most promising data is read first, often
+satisfying the `LIMIT` before scanning the rest.
+Thanks to [@zhuqi-lucas] for driving this work, with reviews from [@adriangb].
+Related PRs: [#21182], [#21426], [#21956]
+
+**Improved Statistics and Cardinality Estimation**:
+Good plans depend on good statistics. This release extracts NDV (number of
+distinct values) statistics from Parquet metadata, uses NDV for equality-filter
+selectivity, adds a pluggable [StatisticsRegistry] for operator-level 
statistics
+propagation, and improves cardinality estimation for semi and anti joins.
+Thanks to [@asolimando], [@jonathanc-n], and [@buraksenn] for driving this 
work.
+Related PRs: [#19957], [#20789], [#21077], [#21081], [#21483], [#20904]
+
+### Scan Improvements
+
+**Morsel-Driven Parquet Scans**:
+Parquet scan parallelism was previously bounded by the slowest scan thread, so
+data skew (large row groups, less-selective filters, or variable object store
+latency) left cores underutilized. DataFusion 54 reworks the Parquet scan 
around
+a [morsel-driven design], where idle threads dynamically pull small units of 
work
+("morsels") instead of each being assigned a fixed partition up front. This
+spreads work more evenly and can be up to ~2x faster for skewed scans such as
+ClickBench.
+Thanks to [@Dandandan], [@alamb], [@adriangb], [@xudong963], and [@zhuqi-lucas]
+for collaborating on this substantial effort. Related issues and PRs: [#20529],
+[#21327], [#21342], [#21351]
+
+**Struct Field Filter Pushdown and Leaf-Level Projection**:
+Filters on struct fields (e.g. `WHERE s['foo'] > 67`) are now pushed down into 
the
+Parquet decoder rather than evaluated after a full scan, and both filtering and
+projection read only the struct leaves they actually access,
+significantly improving performance for nested and `Variant` data in large
+Parquet files.
+Thanks to [@friendlymatthew] for this work, with reviews from [@adriangb],
+[@cetra3], and [@AdamGS]. Related PRs: [#20822], [#20854], [#20925]
+
+## New Features ✨
+
+### `LATERAL` Joins
+
+Lateral joins have been long requested ([#10048]). DataFusion 54 adds basic
+support for `CROSS JOIN LATERAL`, `INNER JOIN LATERAL`, and `LEFT JOIN LATERAL`
+([#21202], [#21352]). A lateral subquery in the `FROM` clause can reference
+columns from preceding tables — handy for expanding a per-row series or
+correlating against a set-returning function. It uses decorrelation, so the
+subquery is evaluated once rather than re-executed per outer row.
+
+```sql
+-- For each row in t1, expand a series 1..t1_int and join the values back
+SELECT t1_id, t1_name, i
+FROM join_t1 t1
+CROSS JOIN LATERAL (
+    SELECT * FROM unnest(generate_series(1, t1_int))
+) AS series(i);
+```
+
+Thanks to [@neilconway] for implementing this feature, with reviews from
+[@Dandandan], [@alamb], and [@crm26].
+
+### Lambda Functions
+
+DataFusion now supports lambda expressions (`x -> expr`) with column capture,
+plus new higher-order array UDFs like [array_transform], [array_filter], and
+[array_any_match] ([#21323], [#21679]). Lambdas express per-element computation
+directly in SQL:
+
+```sql
+-- Apply `x * 10` to every element
+SELECT array_transform([1, 2, 3, 4, 5], x -> x * 10);
+-- [10, 20, 30, 40, 50]
+
+-- Keep only elements where `x > 2`
+SELECT array_filter([1, 2, 3, 4, 5], x -> x > 2);
+-- [3, 4, 5]
+
+-- True if any element satisfies `x > 2`
+SELECT array_any_match([1, 2, 3], x -> x > 2);
+-- true
+```
+
+Lambdas compose, so you can filter then transform in one expression:
+
+```sql
+-- Keep elements > 2, then multiply each survivor by 10
+SELECT array_transform(array_filter([1, 2, 3, 4, 5], x -> x > 2), x -> x * 10);
+-- [30, 40, 50]
+```
+
+Thanks to [@gstvg] and [@rluvaton] for leading this effort, [@ologlogn] and
+[@LiaCastaneda] for the [array_filter] and [array_any_match] functions, and
+[@comphead] and [@martin-g] for reviews.

Review Comment:
   Thank you -- great idea. added in cf8e6eb



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DataFusion 54.0.0 blog post [datafusion-site]

Reply via email to