This is an automated email from the ASF dual-hosted git repository.
berkay pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push:
new b353771 Blog post for DataFusion 46.0.0 (#64)
b353771 is described below
commit b353771a5c9643245bb666151db3bb4421d0bf10
Author: oznur-synnada <[email protected]>
AuthorDate: Wed Mar 26 16:13:20 2025 +0300
Blog post for DataFusion 46.0.0 (#64)
* Create 2025-03-24-datafusion-46.0.0.md
* Update 2025-03-24-datafusion-46.0.0.md
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
* Update 2025-03-24-datafusion-46.0.0.md
* Add upgrade guide link
* Add example image for diagnostics
* Update 2025-03-24-datafusion-46.0.0.md
* Update content/blog/2025-03-24-datafusion-46.0.0.md
Co-authored-by: Oleks V <[email protected]>
---------
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Berkay Şahin
<[email protected]>
Co-authored-by: Oleks V <[email protected]>
---
content/blog/2025-03-24-datafusion-46.0.0.md | 142 +++++++++++++++++++++
.../datafusion-46.0.0/diagnostic-example.png | Bin 0 -> 140405 bytes
2 files changed, 142 insertions(+)
diff --git a/content/blog/2025-03-24-datafusion-46.0.0.md
b/content/blog/2025-03-24-datafusion-46.0.0.md
new file mode 100644
index 0000000..5e80830
--- /dev/null
+++ b/content/blog/2025-03-24-datafusion-46.0.0.md
@@ -0,0 +1,142 @@
+---
+layout: post
+title: Apache DataFusion 46.0.0 Released
+date: 2025-03-24
+author: Oznur Hanci and Berkay Sahin on behalf of the PMC
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We’re excited to announce the release of **Apache DataFusion 46.0.0**! This
new version represents a significant milestone for the project, packing in a
wide range of improvements and fixes. You can find the complete details in the
full
[changelog](https://github.com/apache/datafusion/blob/branch-46/dev/changelog/46.0.0.md).
We’ll highlight the most important changes below and guide you through
upgrading.
+
+## Breaking Changes
+
+DataFusion 46.0.0 brings a few **breaking changes** that may require
adjustments to your code as described in the [Upgrade
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html). Here
are the most notable ones:
+
+- [Unified `DataSourceExec` Execution
Plan](https://github.com/apache/datafusion/pull/14224#)**:** DataFusion 46.0.0
introduces a major refactor of scan operators. The separate
file-format-specific execution plan nodes (`ParquetExec`, `CsvExec`,
`JsonExec`, `AvroExec`, etc.) have been **deprecated and merged into a single
`DataSourceExec` plan**. Format-specific logic is now encapsulated in new
`DataSource` and `FileSource` traits. This change simplifies the execution
model, but if you h [...]
+- [**Error Handling
Improvements](https://github.com/apache/arrow-datafusion/issues/7360#:~:text=2)
(`DataFusionError::Collection`):** We began overhauling DataFusion’s approach
to error handling. In this release, a new error variant
`DataFusionError::Collection` (and related mechanisms) has been introduced to
aggregate multiple errors into one. This is part of a broader effort to provide
richer error context and reduce internal panics. As a result, some error types
or messages have chan [...]
+
+## Performance Improvements
+
+DataFusion 46.0.0 comes with a slew of performance enhancements across the
board. Here are some of the noteworthy optimizations in this release:
+
+- **Faster `median()` (no grouping):** The `median()` aggregate function got a
special fast path when used without a `GROUP BY`. By optimizing its
accumulator, median calculation is about **2× faster** in the single-group
case. If you use `MEDIAN()` on large datasets (especially as a single value),
you should notice reduced query times (PR
[#14399](https://github.com/apache/datafusion/pull/14399) by
[@2010YOUY01](https://github.com/2010YOUY01)).
+- **Optimized `FIRST_VALUE`/`LAST_VALUE`:** The `FIRST_VALUE` and `LAST_VALUE`
window functions have been improved by avoiding an internal sort of rows.
Instead of sorting each partition, the implementation now uses a direct
approach to pick the first/last element. This yields **10–100% performance
improvement** for these functions, depending on the scenario. Queries using
`FIRST_VALUE(...) OVER (PARTITION BY ... ORDER BY ...)` will run faster,
especially when partitions are large (PR [# [...]
+- **`repeat()` String Function Boost:** Repeating strings is now more
efficient – the `repeat(text, n)` function was optimized by about **50%**. This
was achieved by reducing allocations and using a more efficient concatenation
strategy. If you generate large repeated strings in queries, this can cut the
time nearly in half (PR
[#14697](https://github.com/apache/datafusion/pull/14697) by
[@zjregee](https://github.com/zjregee)).
+- **Ultra-fast `uuid()` UDF:** The `uuid()` function (which generates random
UUID strings) received a major speed-up. It’s now roughly **40× faster** than
before! The new implementation avoids unnecessary string copying and uses a
more direct conversion to hex, making bulk UUID generation far more practical
(PR [#14675](https://github.com/apache/datafusion/pull/14675) by
[@simonvandel](https://github.com/simonvandel)).
+- **Accelerated `chr()` and `to_hex()`:** Several scalar functions have been
micro-optimized. The `chr()` function (which returns the character for a given
ASCII code) is about **4× faster** now, and the `to_hex()` function (which
converts numbers to hex string) is roughly **2× faster**. These improvements
may be most noticeable in tight loops or when these functions are applied to
large arrays of values (PR
[#14700](https://github.com/apache/datafusion/pull/14700) for `chr`,
[#14686](ht [...]
+- **No More RowConverter in Grouped Ordering:** We removed an inefficient step
in the *partial grouping* algorithm. The `GroupOrderingPartial` operator no
longer converts data to “row format” for each batch (via `RowConverter`).
Instead, it uses a direct arrow-based approach to detect sort key changes. This
eliminated overhead and yields a nice speedup for certain aggregation queries.
(PR [#14566](https://github.com/apache/datafusion/pull/14566) by
[@ctsk](https://github.com/ctsk)).
+- **Predicate Pruning for `NOT LIKE`:** DataFusion’s parquet reader can now
prune row groups using `NOT LIKE` filters, similar to how it handles `LIKE`.
This means if you have a filter such as `column NOT LIKE 'prefix%'`, DataFusion
can use min/max statistics to skip reading files/parts that can be determined
to either entirely match or not match the predicate. In particular, a pattern
like `NOT LIKE 'X%'` can skip data ranges that definitely start with "X". While
a niche case, it contri [...]
+
+## Google Summer of Code 2025
+
+Another exciting development: **Apache DataFusion has been accepted as a
mentoring organization for Google Summer of Code (GSoC) 2025**! 🎉 This means
that this summer, students from around the world will have the opportunity to
contribute to DataFusion under the guidance of our committers. We have put
together [a list of project
ideas](https://datafusion.apache.org/contributor-guide/gsoc_project_ideas.html)
that candidates can choose from.
+
+If you’re interested, check out our [GSoC Application
Guidelines](https://datafusion.apache.org/contributor-guide/gsoc_application_guidelines.html).
We encourage students to reach out, discuss ideas with us, and apply.
+
+## Highlighted New Features
+
+### Improved Diagnostics
+
+DataFusion 46.0.0 introduces a new [**SQL Diagnostics
framework**](https://github.com/apache/datafusion/issues/14429) to make error
messages more understandable. This comes in the form of new `Diagnostic` and
`DiagnosticEntry` types, which allow the system to attach rich context (like
source query text spans) to error messages. In practical terms, certain planner
errors will now point to the exact location in your SQL query that caused the
issue.
+
+For example, if you reference an unknown table or miss a column in `GROUP BY`
the error message will include the query snippet causing the error. These
diagnostics are meant for end-users of applications built on DataFusion,
providing clearer messages instead of generic errors. Here’s an example:
+
+<img src="/blog/images/datafusion-46.0.0/diagnostic-example.png"
alt="diagnostic-example" width="80%" class="img-responsive">
+
+Currently, diagnostics cover unresolved table/column references, missing
`GROUP BY` columns, ambiguous references, wrong number of UNION columns, type
mismatches, and a few others. Future releases will extend this to more error
types. This feature should greatly ease debugging of complex SQL by pinpointing
errors directly in the query text. We thank
[@eliaperantoni](https://github.com/eliaperantoni) for his contributions in
this project.
+
+### Unified `DataSourceExec` for Table Providers
+
+As mentioned, DataFusion now uses a unified `DataSourceExec` for reading
tables, which is both a breaking change and a feature. *Why is this important?*
The new approach simplifies how custom table providers are integrated and
optimized. Namely, the optimizer can treat file scans uniformly and push down
filters/limits more consistently when there is one execution plan that handles
all data sources. The new `DataSourceExec` is paired with a `DataSource` trait
that encapsulates format-spec [...]
+
+All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been
migrated to this framework. This unification makes the codebase cleaner and
sets the stage for future enhancements (like consistent metadata handling and
limit pushdown across all formats). Check out PR
[#14224](https://github.com/apache/datafusion/pull/14224) for design details.
We thank [@mertak-synnada](https://github.com/mertak-synnada) and
[@ozankabak](https://github.com/ozankabak) for their contributions.
+
+### FFI Support for Scalar UDFs
+
+DataFusion’s Foreign Function Interface (FFI) has been extended to support
[**user-defined scalar
functions**](https://github.com/apache/datafusion/pull/14579) defined in
external languages. In 46.0.0, you can now expose a custom scalar UDF through
the FFI layer and use it in DataFusion as if it were built-in. This is
particularly exciting for the **Python bindings** and other language
integrations – it means you could define a function in Python (or C, etc.) and
register it with DataFus [...]
+
+### New Statistics/Distribution Framework
+
+This release, thanks mainly to [@Fly-Style](https://github.com/Fly-Style) with
contributions from [@ozankabak](https://github.com/ozankabak) and
[@berkaysynnada](https://github.com/berkaysynnada), includes the initial pieces
of a [**redesigned statistics
framework](https://github.com/apache/datafusion/pull/14699).** DataFusion’s
optimizer can now represent column data distributions using a new
`Distribution` enum, instead of the old precision or range estimations. The
supported distribut [...]
+
+For example, if a filter expression is applied to a column with a known
uniform distribution range, the optimizer can propagate that to estimate result
selectivity more accurately. Similarly, comparisons (`=`, `>`, etc.) on columns
yield Bernoulli distributions (with true/false probabilities) in this model.
+
+This is a foundational change with many follow-on PRs underway. Even though
the immediate user-visible effect is limited (the optimizer didn't magically
improve by an order of magnitude overnight), but it lays groundwork for more
advanced query planning in the future. Over time, as statistics information
encapsulated in `Distribution`s get integrated, DataFusion will be able to make
smarter decisions like more aggressive parquet pruning, better join orderings,
and so on based on data dis [...]
+
+### Aggregate Monotonicity and Window Ordering
+
+DataFusion 46.0.0 adds a new concept of
[**set-monotonicity**](https://github.com/apache/datafusion/pull/14271) for
certain transformations, which helps avoid unnecessary sort operations. In
particular, the planner now understands when a **window function introduces new
orderings of data**.
+
+For example, DataFusion now recognizes that a window-aggregate like `MAX` on a
column can produce a result that is **monotonically increasing**, even if the
input column is unordered — depending on the window frame used.
+
+Consider the following query:
+
+```sql
+SELECT MAX(c1) OVER (
+ ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
+) AS max_c1
+FROM c1_table
+ORDER BY max_c1;
+```
+
+In earlier versions of DataFusion, this query would require an additional
SortExec on max_c1 to satisfy the ORDER BY clause. However, with the new
set-monotonicity logic, the planner knows that MAX(...) OVER (...) produces
values that are not smaller than the previous row, making the extra sort
redundant. This leads to more efficient query execution.
+
+PR [#14271](https://github.com/apache/datafusion/pull/14271) introduced the
core monotonicity tracking for aggregates and window functions.
+PR [#14813](https://github.com/apache/datafusion/pull/14813) improved ordering
preservation within various window frame types, and brought an extensive test
coverage.
+Huge thanks to [@berkaysynnada](https://github.com/berkaysynnada) and
[@mertak-synnada](https://github.com/mertak-synnada) for designing and
implementing this optimizer enhancement!
+
+### UNION [ALL | DISTINCT] BY NAME Support
+
+DataFusion now supports UNION BY NAME and UNION ALL BY NAME, which align
columns by name instead of position. This matches functionality found in
systems like Spark and DuckDB and simplifies combining heterogeneously ordered
result sets.
+
+You no longer need to rewrite column order manually — just write:
+
+```sql
+SELECT col1, col2 FROM t1
+UNION ALL BY NAME
+SELECT col2, col1 FROM t2;
+```
+
+Under the hood, this is supported by the new union_by_name() and
union_by_name_distinct() plan builder methods.
+
+Thanks to [@rkrishn7](https://github.com/rkrishn7) for PR
[#14538](https://github.com/apache/datafusion/pull/14538).
+
+### New range() Table Function
+
+A new table-valued function range(start, stop, step) has been added to make it
easy to generate integer sequences — similar to PostgreSQL’s generate_series()
or Spark’s range().
+
+Example:
+
+```sql
+SELECT * FROM range(1, 10, 2);
+```
+
+This returns: 1, 3, 5, 7, 9. It’s great for testing, cross joins, surrogate
keys, and more.
+
+Thanks to [@simonvandel](https://github.com/simonvandel) for PR
[#14830](https://github.com/apache/datafusion/pull/14830).
+
+## Upgrade Guide and Changelog
+
+Upgrading to 46.0.0 should be straightforward for most users, but do review
the [Upgrade Guide for DataFusion
46.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html) for
detailed steps and code changes. The upgrade guide covers the breaking changes
mentioned (like replacing old exec nodes with `DataSourceExec`, updating UDF
invocation to `invoke_with_args`, etc.) and provides code snippets to help with
the transition. For a comprehensive list of all changes, please refer [...]
+
+## Get Involved
+
+Apache DataFusion is an open-source project, and we welcome involvement from
anyone interested. Now is a great time to take 46.0.0 for a spin: try it out on
your workloads, and let us know if you encounter any issues or have
suggestions. You can report bugs or request features on our GitHub issue
tracker, or better yet, submit a pull request. Join our community discussions –
whether you have questions, want to share how you’re using DataFusion, or are
looking to contribute, we’d love to [...]
+
+Happy querying!
\ No newline at end of file
diff --git a/content/images/datafusion-46.0.0/diagnostic-example.png
b/content/images/datafusion-46.0.0/diagnostic-example.png
new file mode 100644
index 0000000..0dc3df5
Binary files /dev/null and
b/content/images/datafusion-46.0.0/diagnostic-example.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]