geoffreyclaude commented on code in PR #135:
URL: https://github.com/apache/datafusion-site/pull/135#discussion_r2719884290
##########
content/blog/2026-01-08-datafusion-52.0.0.md:
##########
@@ -0,0 +1,379 @@
+---
+layout: post
+title: Apache DataFusion 52.0.0 Released
+date: 2026-01-08
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are proud to announce the release of [DataFusion 52.0.0]. This post
highlights
+some of the major improvements since [DataFusion 51.0.0]. The complete list of
+changes is available in the [changelog]. Thanks to the [121 contributors] for
+making this release possible.
+
+TODO: confirm the release date for 52.0.0 and update the front matter if
needed.
+
+[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
+[DataFusion 51.0.0]:
https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/
+[changelog]:
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md
+[121 contributors]:
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits
+
+## Performance Improvements 🚀
+
+We continue to make significant performance improvements in DataFusion as
explained below.
+
+### Faster `CASE` Expressions
+
+DataFusion 52 has lookup-table-based evaluation for certain `CASE` expressions
+to avoid repeated evaluation for accelerating common ETL patterns such as
+
+```sql
+CASE company
+ WHEN 1 THEN 'Apple'
+ WHEN 5 THEN 'Samsung'
+ WHEN 2 THEN 'Motorola'
+ WHEN 3 THEN 'LG'
+ ELSE 'Other'
+END
+```
+
+This is the final work in our `CASE` performance epic ([#18075]), which has
+improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to
+[rluvaton] and [pepijnve] for the implementation.
+
+[rluvaton]: https://github.com/rluvaton
+[pepijnve]: https://github.com/pepijnve
+
+
+[#18075]: https://github.com/apache/datafusion/issues/18075
+[#18183]: https://github.com/apache/datafusion/pull/18183
+
+### New Merge Join
+
+DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with
+speedups of three orders of magnitude in some pathological cases such as the
+case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in
+[#18875] show dramatic gains for TPC-H Q21 (minutes to milliseconds) while
+leaving other queries unchanged or modestly faster. Thanks to [mbutrovich] for
+the implementation and reviews from [Dandandan].
+
+[#18487]: https://github.com/apache/datafusion/issues/18487
+[#18875]: https://github.com/apache/datafusion/pull/18875
+[Apache Comet]: https://datafusion.apache.org/comet/
+[mbutrovich]: https://github.com/mbutrovich
+
+### Rewritten merge join
+
+DataFusion 52 includes a rewrite of the sort-merge join (SMJ) output buffering
to
+avoid excessive `concat_batches` work and to use `BatchCoalescer` internally
and
+for final output. This change targets pathological slowdowns like the reported
+LeftAnti join case in [#18487], which also affected Comet workloads that rely
on
+SMJ. Benchmarks in [#18875] show dramatic gains for TPC-H Q21 (moving from
+minutes to milliseconds) while leaving most other queries unchanged or modestly
+faster, and the update is fully internal with no user-facing API changes.
+
+
+### Caching Improvements
+
+This release also includes several additional caching improvements.
+
+A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating
+statistics for Parquet backed files. This significantly improves planning time
+for certain queries. You can see the contents of the new cache using the
+[statistics_cache] function in the CLI:
+
+[statistics_cache]:
https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache
+
+
+```sql
+select * from statistics_cache();
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| path | file_modified | file_size_bytes | e_tag
| version | num_rows | num_columns | table_size_bytes |
statistics_size_bytes |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 |
0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 |
Exact(36445943240) | 0 |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+```
+Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics
cache,
+with reviews from [martin-g], [alamb], and [alchemist51].
+Related PRs: [#18971], [#19054]
+
+[#18971]: https://github.com/apache/datafusion/pull/18971
+[#19054]: https://github.com/apache/datafusion/pull/19054
+[bharath-techie]: https://github.com/bharath-techie
+[nuno-faria]: https://github.com/nuno-faria
+[martin-g]: https://github.com/martin-g
+[alchemist51]: https://github.com/alchemist51
+
+
+A prefix-aware list-files cache accelerates evaluating partition predicates for
+Hive partitioned tables.
+
+```sql
+-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
+CREATE EXTERNAL TABLE overturemaps
+STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
+-- Find all files where the path contains `theme=base without requiring
another LIST call
+select count(*) from overturemaps where theme='base';
+```
+
+You can see the
+contents of the new cache using the [list_files_cache] function in the CLI:
+
+[list_files_cache]:
https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache
+
+```sql
+create external table overturemaps
+stored as parquet
+location
's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
+0 row(s) fetched.
+> select table, path, metadata_size_bytes, expires_in,
unnest(metadata_list)['file_size_bytes'] as file_size_bytes,
unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| table | path |
metadata_size_bytes | expires_in | file_size_bytes |
e_tag |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 999055952 |
"35fc8fbe8400960b54c66fbb408c48e8-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 975592768 |
"8a16e10b722681cdc00242564b502965-59" |
+...
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 1016732378 |
"6d70857a0473ed9ed3fc6e149814168b-61" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 991363784 |
"c9cafb42fcbb413f851691c895dd7c2b-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 1032469715 |
"7540252d0d67158297a67038a3365e0f-62" |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+```
+
+Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files
cache work,
+with reviews from [gabotechs], [alamb], [alchemist51], [martin-g], and
[BlakeOrth].
+Related PRs: [#18146], [#18855], [#19366], [#19298],
+
+[Epic #17214]: https://github.com/apache/datafusion/issues/17214
+[#18146]: https://github.com/apache/datafusion/pull/18146
+[#18855]: https://github.com/apache/datafusion/pull/18855
+[#19366]: https://github.com/apache/datafusion/pull/19366
+[#19298]: https://github.com/apache/datafusion/pull/19298
+[BlakeOrth]: https://github.com/BlakeOrth
+[Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg
+
+
+### Improved Hash Join Filter Pushdown
+
+Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
+dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
+technique referred to as [Sideways Information Passing] in Database research
+literature. The initial implementation passed min/max values for the join keys.
+DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN`
list when the
+build size is small such as when the join is very selective. The `IN` list is
+pushed down to the probe side scan and is used to prune files, row groups, and
+individual rows. Thanks to [adriangb] for implementing this feature, with
+reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].
+
+
+[Sideways Information Passing]:
https://dl.acm.org/doi/10.1109/ICDE.2008.4497486
+[Dynamic Filtering blog]:
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters
+
+[#17171]: https://github.com/apache/datafusion/issues/17171
+[#18393]: https://github.com/apache/datafusion/pull/18393
+[adriangb]: https://github.com/adriangb
+[LiaCastaneda]: https://github.com/LiaCastaneda
+[asolimando]: https://github.com/asolimando
+[comphead]: https://github.com/comphead
+
+
+## Major Features ✨
+
+### Arrow IPC Stream file support
+
+DataFusion can now read Arrow IPC stream files ([#18457]). This expands
+interoperability with systems that emit Arrow streams directly, making it
+simpler to ingest Arrow-native data without conversion. Thanks to
[corasaurus-hex]
+for implementing this feature, with reviews from [martin-g], [Jefffrey],
+[jdcasale], [2010YOUY01], and [timsaucer].
+
+```sql
+CREATE EXTERNAL TABLE ipc_events
+STORED AS ARROW
+LOCATION 's3://bucket/events.arrow';
+```
+
+Related PRs: [#18457]
+
+[#18457]: https://github.com/apache/datafusion/pull/18457
+[corasaurus-hex]: https://github.com/corasaurus-hex
+[Jefffrey]: https://github.com/Jefffrey
+[jdcasale]: https://github.com/jdcasale
+[2010YOUY01]: https://github.com/2010YOUY01
+[timsaucer]: https://github.com/timsaucer
+
+### More Extensible SQL Planning with `RelationPlanner`
+
+DataFusion now has an API for extending the SQL planner for relations, as
+explained in the [Extending SQL in DataFusion Blog]. With this new API, you can
+customize DataFusion to support almost any SQL syntax, such as the following
+(which are not supported by default):
Review Comment:
I feel that this is slightly misleading: it reads as if the
`RelationPlanner` is what now allows extending expressions and types (and
relations). Maybe something like:
```
In addition to the existing expression and types extension points, this new
API now allows extending FROM clauses, leading DataFusion to support almost any
SQL syntax, such as the following (which are not supported by default):
```
But reworded a bit to be less of a run-on sentence...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]