alamb commented on code in PR #91: URL: https://github.com/apache/datafusion-site/pull/91#discussion_r2231626685
########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the Review Comment: FYI @ozankabak ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). Review Comment: FYI @adriangb ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. Review Comment: fyi @berkaysynnada @xudong963 @timsaucer @blaginin @milenkovicm @adriangb @kosiew ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) Review Comment: fyi @goldmedal ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer + +We still have some work to do to fully support user-defined types, specifically +in documentation and testing, and we would +love your help in this area. If you are interested in contributing, +please see [issue #12644](https://github.com/apache/datafusion/issues/12644). + +### Parquet Modular Encryption + +DataFusion now supports reading and writing encrypted [Apache Parquet] files with [modular +encryption]. This allows users to encrypt specific columns in a Parquet file +using different keys, while still being able to read data without needing to +decrypt the entire file. + +[Apache Parquet]: https://parquet.apache.org/ +[modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/ + +Here is an example of how to configure DataFusion to read an encrypted Parquet +table with two columns, `double_field` and `float_field`, using modular +encryption: + +```sql +CREATE EXTERNAL TABLE encrypted_parquet_table +( +double_field double, +float_field float +) +STORED AS PARQUET LOCATION 'pq/' OPTIONS ( + -- encryption + 'format.crypto.file_encryption.encrypt_footer' 'true', + 'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" + -- decryption + 'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" +); +``` + +([Issue #15216](https://github.com/apache/datafusion/issues/15216), +[PR #16351](https://github.com/apache/datafusion/pull/16351) +from [corwinjoy](https://github.com/corwinjoy) and [adamreeve](https://github.com/adamreeve)) + + +### Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions + +DataFusion now supports the `WITHIN GROUP` clause for [ordered-set aggregate +functions] such as `approx_percentile_cont`, `percentile_cont`, and +`percentile_disc`, which allows users to specify the precise order. + +For example, the following query computes the 50th percentile for the `temperature` column +in the `city_data` table, ordered by `date`: + +```sql +SELECT + percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature +FROM city_data; +``` + +[ordered-set aggregate functions]: https://www.postgresql.org/docs/9.4/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE + +(Issue [#11732](https://github.com/apache/datafusion/issues/11732), +PR [#13511](https://github.com/apache/datafusion/pull/13511), +by [Garamda](https://github.com/Garamda)) Review Comment: FYI @Garamda ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer Review Comment: fyi @timsaucer ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer + +We still have some work to do to fully support user-defined types, specifically +in documentation and testing, and we would +love your help in this area. If you are interested in contributing, +please see [issue #12644](https://github.com/apache/datafusion/issues/12644). + +### Parquet Modular Encryption + +DataFusion now supports reading and writing encrypted [Apache Parquet] files with [modular +encryption]. This allows users to encrypt specific columns in a Parquet file +using different keys, while still being able to read data without needing to +decrypt the entire file. + +[Apache Parquet]: https://parquet.apache.org/ +[modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/ + +Here is an example of how to configure DataFusion to read an encrypted Parquet +table with two columns, `double_field` and `float_field`, using modular +encryption: + +```sql +CREATE EXTERNAL TABLE encrypted_parquet_table +( +double_field double, +float_field float +) +STORED AS PARQUET LOCATION 'pq/' OPTIONS ( + -- encryption + 'format.crypto.file_encryption.encrypt_footer' 'true', + 'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" + -- decryption + 'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" +); +``` + +([Issue #15216](https://github.com/apache/datafusion/issues/15216), +[PR #16351](https://github.com/apache/datafusion/pull/16351) +from [corwinjoy](https://github.com/corwinjoy) and [adamreeve](https://github.com/adamreeve)) + + +### Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions + +DataFusion now supports the `WITHIN GROUP` clause for [ordered-set aggregate +functions] such as `approx_percentile_cont`, `percentile_cont`, and +`percentile_disc`, which allows users to specify the precise order. + +For example, the following query computes the 50th percentile for the `temperature` column +in the `city_data` table, ordered by `date`: + +```sql +SELECT + percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature +FROM city_data; +``` + +[ordered-set aggregate functions]: https://www.postgresql.org/docs/9.4/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE + +(Issue [#11732](https://github.com/apache/datafusion/issues/11732), +PR [#13511](https://github.com/apache/datafusion/pull/13511), +by [Garamda](https://github.com/Garamda)) + +### Compressed Spill Files + +DataFusion now supports compressing the files written to disk when spilling +larger-than-memory datasets while sorting and grouping. Using compression +can significantly reduce the +size of the intermediate files and improve performance when reading them back into memory. + +(Issue [#16130](https://github.com/apache/datafusion/issues/16130), +PR [#16268](https://github.com/apache/datafusion/pull/16268) +by [ding-young](https://github.com/ding-young)) Review Comment: FYI @ding-young ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 Review Comment: FYI @pepijnve ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer + +We still have some work to do to fully support user-defined types, specifically +in documentation and testing, and we would +love your help in this area. If you are interested in contributing, +please see [issue #12644](https://github.com/apache/datafusion/issues/12644). + +### Parquet Modular Encryption + +DataFusion now supports reading and writing encrypted [Apache Parquet] files with [modular +encryption]. This allows users to encrypt specific columns in a Parquet file +using different keys, while still being able to read data without needing to +decrypt the entire file. + +[Apache Parquet]: https://parquet.apache.org/ +[modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/ + +Here is an example of how to configure DataFusion to read an encrypted Parquet +table with two columns, `double_field` and `float_field`, using modular +encryption: + +```sql +CREATE EXTERNAL TABLE encrypted_parquet_table +( +double_field double, +float_field float +) +STORED AS PARQUET LOCATION 'pq/' OPTIONS ( + -- encryption + 'format.crypto.file_encryption.encrypt_footer' 'true', + 'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" + -- decryption + 'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" +); +``` + +([Issue #15216](https://github.com/apache/datafusion/issues/15216), Review Comment: FYI @corwinjoy and @adamreeve ########## content/blog/2025-07-28-datafusion-49.0.0.md: ########## @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/16347 for details --> + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +<img + src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png" + width="100%" + class="img-responsive" + alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + +<!-- +NOTE: Andrew is working on gathering these numbers + +<img +src="/blog/images/datafusion-49.0.0/performance_over_time_planning.png" +width="80%" +class="img-responsive" +alt="Planning benchmark performance results over time for DataFusion" +/> + +**Figure 2**: Planning benchmark performance improved XXX between DataFusion 48.0.1 and DataFusion 49.0.0. Chart source: TODO +--> + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + +<!-- +# Unique committers +$ git shortlog -sn 46.0.0..49.0.0-rc1 .| wc -l + 172 +# commits +$ git log --pretty=oneline 46.0.0..49.0.0-rc1 . | wc -l + 884 + + +https://crates.io/crates/datafusion/49.0.0 +DataFusion 49 released July 25, 2025 + +https://crates.io/crates/datafusion/46.0.0 +DataFusion 46 released March 7, 2025 + +Issues created in this time: 290 open, 379 closed = 669 total +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-03-07..2025-07-25 + +Issues closed: 508 +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-03-07..2025-07-25 + +PRs merged in this time 874 +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-03-07..2025-07-25 + +--> + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result<ArrayRef> { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::<LLMResponse>() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer + +We still have some work to do to fully support user-defined types, specifically +in documentation and testing, and we would +love your help in this area. If you are interested in contributing, +please see [issue #12644](https://github.com/apache/datafusion/issues/12644). + +### Parquet Modular Encryption + +DataFusion now supports reading and writing encrypted [Apache Parquet] files with [modular +encryption]. This allows users to encrypt specific columns in a Parquet file +using different keys, while still being able to read data without needing to +decrypt the entire file. + +[Apache Parquet]: https://parquet.apache.org/ +[modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/ + +Here is an example of how to configure DataFusion to read an encrypted Parquet +table with two columns, `double_field` and `float_field`, using modular +encryption: + +```sql +CREATE EXTERNAL TABLE encrypted_parquet_table +( +double_field double, +float_field float +) +STORED AS PARQUET LOCATION 'pq/' OPTIONS ( + -- encryption + 'format.crypto.file_encryption.encrypt_footer' 'true', + 'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" + -- decryption + 'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" +); +``` + +([Issue #15216](https://github.com/apache/datafusion/issues/15216), +[PR #16351](https://github.com/apache/datafusion/pull/16351) +from [corwinjoy](https://github.com/corwinjoy) and [adamreeve](https://github.com/adamreeve)) + + +### Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions + +DataFusion now supports the `WITHIN GROUP` clause for [ordered-set aggregate +functions] such as `approx_percentile_cont`, `percentile_cont`, and +`percentile_disc`, which allows users to specify the precise order. + +For example, the following query computes the 50th percentile for the `temperature` column +in the `city_data` table, ordered by `date`: + +```sql +SELECT + percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature +FROM city_data; +``` + +[ordered-set aggregate functions]: https://www.postgresql.org/docs/9.4/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE + +(Issue [#11732](https://github.com/apache/datafusion/issues/11732), +PR [#13511](https://github.com/apache/datafusion/pull/13511), +by [Garamda](https://github.com/Garamda)) + +### Compressed Spill Files + +DataFusion now supports compressing the files written to disk when spilling +larger-than-memory datasets while sorting and grouping. Using compression +can significantly reduce the +size of the intermediate files and improve performance when reading them back into memory. + +(Issue [#16130](https://github.com/apache/datafusion/issues/16130), +PR [#16268](https://github.com/apache/datafusion/pull/16268) +by [ding-young](https://github.com/ding-young)) + +### Support for `REGEX_INSTR` function + +DataFusion now supports the [`REGEXP_INSTR` function], which returns the position of a +regular expression match within a string. + +For example, to find the position of the first match of the regular expression +`C(.)(..)` in the string `ABCDEF`, you can use: + +```sql +> SELECT regexp_instr('ABCDEF', 'C(.)(..)'); ++---------------------------------------------------------------+ +| regexp_instr(Utf8("ABCDEF"),Utf8("C(.)(..)")) | ++---------------------------------------------------------------+ +| 3 | ++---------------------------------------------------------------+ +``` + +[`REGEXP_INSTR` function]: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#regexp-instr +([Issue #13009](https://github.com/apache/datafusion/issues/13009), +[PR #15928](https://github.com/apache/datafusion/pull/15928) +by [nirnayroy](https://github.com/nirnayroy)) Review Comment: FYI @nirnayroy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org