Re: [PR] Blog: Add blog post about DataFusion 50.0.0 release [datafusion-site]

via GitHub Sat, 18 Oct 2025 02:28:14 -0700


alamb commented on code in PR #115:
URL: https://github.com/apache/datafusion-site/pull/115#discussion_r2378734213



##########
content/blog/2025-09-29-datafusion-50.0.0.md:
##########
@@ -0,0 +1,390 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-24
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]: 
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+  width="100%" class="img-responsive" alt="ClickBench performance results over
+  time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and 
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to extend the dynamic filters to other types of joins,
+such as left and right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce the execution time and the
+amount of memory required by using a finer-grained approach. Briefly, we now
+join one right batch with one left row at a time instead of joining a right
+batch with the entire left side in one step. This prevents having to 
potentially
+materialize large amounts of data at once. This new implementation also avoids
+some `indices <-> batches` conversions that were required in the old approach,
+further reducing the execution time.
+
+When evaluating this new approach in a microbenchmark, we have measured up to 
5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth  📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. New PMC members and committers: **📝TODO** joined the PMC. **📝TODO** joined
+   as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+   from 79 different committers, created over 235 issues, and closed 197 of 
them
+   🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published *[Using External Indexes, Metadata Stores, Catalogs and
+   Caches to Accelerate Queries on Apache Parquet]* and *[Dynamic Filters:
+   Passing Information Between Operators During Execution for 25x Faster
+   Queries]*, which detail several substantial performance optimizations.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0  . | wc -l
+    79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0  . | wc -l
+    318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate 
Queries on Apache Parquet]: 
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for 
25x Faster Queries]: 
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting:
+
+```sql
+datafusion.optimizer.enable_dynamic_filter_pushdown
+```
+
+The following example shows how execution plans look in DataFusion 50.0.0 with
+this optimization:
+
+```sql
+EXPLAIN ANALYZE
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+
+-- plan excerpt
+HashJoinExec
+    DataSourceExec:
+      predicate=c_phone@4 = 25-989-741-2988
+      metrics=[output_rows=1, ...]
+    DataSourceExec:
+      -- dynamic filter is added here, filtering directly at scan time
+      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 
<= 1 ]
+      -- the number of output rows is kept to a minimum
+      metrics=[output_rows=11, ...]
+```
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, ...) is now
+automatically cached to reduce disk/network round-trips and repeated decodings
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
+made to make this optimization production-ready, such as making the cache limit
+configurable. More details can be found in this
+[Epic](https://github.com/apache/datafusion/issues/17000).
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
+any metadata caching. The default `FileMetadataCache` implementation uses a
+*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the `RuntimeEnv`.
+
+If the underlying file changes, the cache is automatically invalidated.
+
+Here is the metadata caching in action:
+
+```sql
+-- disabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '0M';
+
+-- simple query (t.parquet: 100M rows, 3 cols)
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
+Elapsed 0.246 seconds.
+
+-- enabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '50M';
+
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
+Elapsed 0.003 seconds. -- 82x improvement in this specific query
+```
+
+We can also inspect the cache contents through the
+`FileMetadataCache::list_entries` method. In `datafusion-cli`, we can also use
+the
+[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
+function:
+
+```sql
+> SELECT * FROM metadata_cache();
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| path          | file_modified           | file_size_bytes | e_tag            
        | version | metadata_size_bytes | hits | extra           |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 
0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | 
page_index=true |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+1 row(s) fetched.
+Elapsed 0.003 seconds.
+```
+
+### `QUALIFY` Clause
+
+The `QUALIFY` clause is now available in DataFusion
+([#16933](https://github.com/apache/datafusion/pull/16933)). It allows window
+function columns to be filtered without requiring a subquery (similarly to what
+`HAVING` does for aggregations).
+
+For example, this query:
+```sql
+SELECT a, b, c
+FROM (
+   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+   FROM t
+)
+WHERE rk = 1
+```
+
+can now be written like this:
+```sql
+SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+FROM t
+QUALIFY rk = 1
+```
+
+Although it is not a part of the SQL standard (yet), it has been gaining
+adoption in several SQL analytical systems, such as DuckDB, Snowflake, and
+BigQuery.
+
+### `FILTER` Support for Window Functions
+
+Keeping with the theme, the `FILTER` clause has been extended to support
+[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
+This allows these functions to be applied to specific rows without having to
+rely on `CASE` expressions, similar to what was already possible with regular
+aggregate functions.
+
+> **📝TODO** *Add a practical example?*

Review Comment:
   I poked around and asked chatGPT and to be honest I couldn't really come up 
with anything compelling that actually meant something
   
   How about we start with the example from the ticket? 
   
   ```suggestion
   
   For example, you could gather multiple distinct sets of  values matching 
different criteria with a single pass over the input:
   
   ```sql
   SELECT 
     ARRAY_AGG(c2) FILTER (WHERE c2 >= 2) OVER (...)        -- e.g. [2, 3, 4]
     ARRAY_AGG(CASE WHEN c2 >= 2 THEN c2 END) OVER (...)    -- e.g. [NULL, 
NULL, 2, 3, 4]
   ...
   FROM table
   ```
   
   ```



##########
content/blog/2025-09-29-datafusion-50.0.0.md:
##########
@@ -0,0 +1,391 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-29
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]: 
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+  width="100%" class="img-responsive" alt="ClickBench performance results over
+  time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and 
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to [extend the dynamic filters to other types of
+joins](https://github.com/apache/datafusion/issues/16973), such as left and
+right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce execution time and memory
+usage by adopting a finer-grained approach. Specifically, we now limit the 
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old 
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we have measured up to 
5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth  📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. Qi Zhu ([zhuqi-lucas](https://github.com/zhuqi-lucas)) and Yoav Cohen
+   ([yoavcloud](https://github.com/yoavcloud)) became committers. See the
+   [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+   from 79 different committers, created over 235 issues, and closed 197 of 
them
+   🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published several blogs, including *[Using External Indexes, 
Metadata Stores, Catalogs and
+   Caches to Accelerate Queries on Apache Parquet]*.,  *[Dynamic Filters:
+   Passing Information Between Operators During Execution for 25x Faster
+   Queries]*, and *[Implementing User Defined Types and Custom Metadata 
+   in DataFusion]*.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0  . | wc -l
+    79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0  . | wc -l
+    318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate 
Queries on Apache Parquet]: 
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for 
25x Faster Queries]: 
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+[Implementing User Defined Types and Custom Metadata in DataFusion]: 
https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata/
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting (enabled by default):
+
+```sql
+SET datafusion.optimizer.enable_dynamic_filter_pushdown = true;
+```
+
+We do not anticipate the need to turn them off. The following example shows how
+execution plans look in DataFusion 50.0.0 with this optimization:
+
+```sql
+EXPLAIN ANALYZE
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+
+-- plan excerpt
+HashJoinExec
+    DataSourceExec:
+      predicate=c_phone@4 = 25-989-741-2988
+      metrics=[output_rows=1, ...]
+    DataSourceExec:
+      -- dynamic filter is added here, filtering directly at scan time
+      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 
<= 1 ]
+      -- the number of output rows is kept to a minimum
+      metrics=[output_rows=11, ...]
+```
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, ...) is now
+automatically cached to reduce disk/network round-trips and repeated decodings
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
+made to make this optimization production-ready, such as making the cache limit
+configurable. More details can be found in this
+[Epic](https://github.com/apache/datafusion/issues/17000).
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
+any metadata caching. The default `FileMetadataCache` implementation uses a
+*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the `RuntimeEnv`.
+
+If the underlying file changes, the cache is automatically invalidated.
+
+Here is the metadata caching in action:
+
+```sql
+-- disabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '0M';
+
+-- simple query (t.parquet: 100M rows, 3 cols)
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
+Elapsed 0.246 seconds.
+
+-- enabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '50M';
+
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
+Elapsed 0.003 seconds. -- 82x improvement in this specific query
+```
+
+We can also inspect the cache contents through the
+`FileMetadataCache::list_entries` method. In `datafusion-cli`, we can also use
+the
+[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
+function:
+
+```sql
+> SELECT * FROM metadata_cache();
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| path          | file_modified           | file_size_bytes | e_tag            
        | version | metadata_size_bytes | hits | extra           |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 
0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | 
page_index=true |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+1 row(s) fetched.
+Elapsed 0.003 seconds.
+```
+
+### `QUALIFY` Clause
+
+The `QUALIFY` clause is now available in DataFusion
+([#16933](https://github.com/apache/datafusion/pull/16933)). It allows window
+function columns to be filtered without requiring a subquery (similarly to what
+`HAVING` does for aggregations).
+
+For example, this query:
+```sql
+SELECT a, b, c
+FROM (
+   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+   FROM t
+)
+WHERE rk = 1
+```
+
+can now be written like this:
+```sql
+SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+FROM t
+QUALIFY rk = 1
+```
+
+Although it is not a part of the SQL standard (yet), it has been gaining
+adoption in several SQL analytical systems, such as DuckDB, Snowflake, and
+BigQuery.
+
+### `FILTER` Support for Window Functions
+
+Keeping with the theme, the `FILTER` clause has been extended to support
+[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
+This allows these functions to be applied to specific rows without having to
+rely on `CASE` expressions, similar to what was already possible with regular
+aggregate functions.
+
+> **📝TODO** *Add a practical example?*
+
+### Behavior of User-Defined Functions
+
+DataFusion 50.0.0 now allows User-Defined Functions (UDFs) to access the global
+configuration parameters
+([#16970](https://github.com/apache/datafusion/pull/16970)), allowing their
+behavior to better suit users' workloads. As an example, time UDFs can now use
+custom time zones instead of being limited to UTC.

Review Comment:
   ```suggestion
   DataFusion 50.0.0 passes User-Defined Functions (UDFs) the session 
   configuration parameters in 
[ScalarFunctionArgs](https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html),
 see ([#16970](https://github.com/apache/datafusion/pull/16970)). This allows 
   behavior that vary based on runtime state such as time UDFs can use
   session specified time zones instead in addition to UTC.
   ```



##########
content/blog/2025-09-29-datafusion-50.0.0.md:
##########
@@ -0,0 +1,391 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-29
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]: 
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+  width="100%" class="img-responsive" alt="ClickBench performance results over
+  time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and 
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to [extend the dynamic filters to other types of
+joins](https://github.com/apache/datafusion/issues/16973), such as left and
+right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce execution time and memory
+usage by adopting a finer-grained approach. Specifically, we now limit the 
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old 
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we have measured up to 
5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth  📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. Qi Zhu ([zhuqi-lucas](https://github.com/zhuqi-lucas)) and Yoav Cohen
+   ([yoavcloud](https://github.com/yoavcloud)) became committers. See the
+   [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+   from 79 different committers, created over 235 issues, and closed 197 of 
them
+   🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published several blogs, including *[Using External Indexes, 
Metadata Stores, Catalogs and
+   Caches to Accelerate Queries on Apache Parquet]*.,  *[Dynamic Filters:
+   Passing Information Between Operators During Execution for 25x Faster
+   Queries]*, and *[Implementing User Defined Types and Custom Metadata 
+   in DataFusion]*.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0  . | wc -l
+    79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0  . | wc -l
+    318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate 
Queries on Apache Parquet]: 
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for 
25x Faster Queries]: 
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+[Implementing User Defined Types and Custom Metadata in DataFusion]: 
https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata/
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting (enabled by default):
+
+```sql
+SET datafusion.optimizer.enable_dynamic_filter_pushdown = true;
+```
+
+We do not anticipate the need to turn them off. The following example shows how
+execution plans look in DataFusion 50.0.0 with this optimization:
+
+```sql
+EXPLAIN ANALYZE
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+
+-- plan excerpt
+HashJoinExec
+    DataSourceExec:
+      predicate=c_phone@4 = 25-989-741-2988
+      metrics=[output_rows=1, ...]
+    DataSourceExec:
+      -- dynamic filter is added here, filtering directly at scan time
+      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 
<= 1 ]
+      -- the number of output rows is kept to a minimum
+      metrics=[output_rows=11, ...]
+```
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, ...) is now
+automatically cached to reduce disk/network round-trips and repeated decodings
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
+made to make this optimization production-ready, such as making the cache limit
+configurable. More details can be found in this
+[Epic](https://github.com/apache/datafusion/issues/17000).
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
+any metadata caching. The default `FileMetadataCache` implementation uses a
+*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the `RuntimeEnv`.
+
+If the underlying file changes, the cache is automatically invalidated.
+
+Here is the metadata caching in action:
+
+```sql
+-- disabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '0M';
+
+-- simple query (t.parquet: 100M rows, 3 cols)
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
+Elapsed 0.246 seconds.
+
+-- enabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '50M';
+
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
+Elapsed 0.003 seconds. -- 82x improvement in this specific query
+```
+
+We can also inspect the cache contents through the
+`FileMetadataCache::list_entries` method. In `datafusion-cli`, we can also use
+the
+[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
+function:
+
+```sql
+> SELECT * FROM metadata_cache();
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| path          | file_modified           | file_size_bytes | e_tag            
        | version | metadata_size_bytes | hits | extra           |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 
0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | 
page_index=true |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+1 row(s) fetched.
+Elapsed 0.003 seconds.
+```
+
+### `QUALIFY` Clause
+
+The `QUALIFY` clause is now available in DataFusion
+([#16933](https://github.com/apache/datafusion/pull/16933)). It allows window
+function columns to be filtered without requiring a subquery (similarly to what
+`HAVING` does for aggregations).
+
+For example, this query:
+```sql
+SELECT a, b, c
+FROM (
+   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+   FROM t
+)
+WHERE rk = 1
+```
+
+can now be written like this:
+```sql
+SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+FROM t
+QUALIFY rk = 1
+```
+
+Although it is not a part of the SQL standard (yet), it has been gaining
+adoption in several SQL analytical systems, such as DuckDB, Snowflake, and
+BigQuery.
+
+### `FILTER` Support for Window Functions
+
+Keeping with the theme, the `FILTER` clause has been extended to support
+[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
+This allows these functions to be applied to specific rows without having to
+rely on `CASE` expressions, similar to what was already possible with regular
+aggregate functions.
+
+> **📝TODO** *Add a practical example?*
+
+### Behavior of User-Defined Functions

Review Comment:
   ```suggestion
   ### `ConfigOptions` Now Available to Functions
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Add blog post about DataFusion 50.0.0 release [datafusion-site]

Reply via email to