findepi commented on code in PR #6: URL: https://github.com/apache/datafusion-site/pull/6#discussion_r1684332130
########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer Review Comment: readme says "DataFusion-SQLancer" but repo name is "datafusion-sqllancer" (double "l") is this intentional? or can we choose just one spelling and stick to it? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser + +DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. +This can be useful in query federation to push predicates down into other +systems that only accept SQL, and for building systems that generate SQL. + +For example, you can now convert a logical expression back to SQL text: + +```rust +// Form a logical expression that represents the SQL "a < 5 OR a = 8" +let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); +// convert the expression back to SQL text +let sql = expr_to_sql(&expr)?.to_string(); +assert_eq!(sql, "a < 5 OR a = 8"); +``` + +You can also do even more complex things like parsing SQL, +modifying the plan, and then converting it back to SQL: + +```rust +let df = ctx + // Use SQL to read some data from the parquet file + .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain") + .await?; +// Programmatically add new filters `id > 1 and tinyint_col < double_col` +let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))? +// Convert the new logical plan back to SQL +let sql = plan_to_sql(df.logical_plan())?.to_string(); +assert_eq!(sql, + "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \ + FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))") +); +``` + +See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for more details. + +[expr_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html +[plan_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html +[Plan to SQL example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs + + +BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) + +## Low Level APIs for Fast Parquet Access (indexing) + +With the rising prevalence of Parquet files stored on object stores, it is +important for query engines to support efficient access to Parquet files. Part +of doing this efficiently is to minimize the number of requests made to an +object store by caching metadata and skipping over parts of the file that are +not needed (e.g. via an Index). + +DataFusion's Parquet reader has long internally supported advanced predicate +pushdown by reading the parquet metadata from the file footer and pruning based +on row group level statistics as well as data page level statisics. DataFusion +now also supports users supplying their own low level pruning information via +the [`ParquetAccessPlan`] API. + +This API can be used along with index informatio to selectively skip decoding +parts of the file. This feature has been used to add [efficient support] for +reading from DeltaLake tables and handling [deletion vectors]. + +```text + βββββββββββββββββββββββββ If the RowSelection does not include any + β ... β rows from a particular Data Page, that + β β Data Page is not fetched or decoded. + β βββββββββββββββββββββ β Note this requires a PageIndex + β β ββββββββββββ β β +Row β β βDataPage 0β β β ββββββββββββββββββββββ +Groups β β ββββββββββββ β β β β + β β ββββββββββββ β β β ParquetExec β + β β ... βDataPage 1β ββΌ βΌ β β β β (Parquet Reader) β + β β ββββββββββββ β β β β β β β ββ β + β β ββββββββββββ β β β βββββββββββββββββ β + β β βDataPage 2β β β If only rows β βParquetMetadataβ β + β β ββββββββββββ β β from DataPage 1 β βββββββββββββββββ β + β βββββββββββββββββββββ β are selected, ββββββββββββββββββββββ + β β only DataPage 1 + β ... β is fetched and + β β decoded + β βββββββββββββββββββββ β + β β Thrift metadata β β + β βββββββββββββββββββββ β + βββββββββββββββββββββββββ + Parquet File +``` + +See the [parquet_index.rs] and [advanced_parquet_index.rs] examples for more details. + +Thanks to [@alamb] and XXX for this feature. Review Comment: just a reminder to update XXX ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). Review Comment: reminder to replace this ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. Review Comment: > performance improvement for many workloads same here -- try to be a little more explicit or just don't sound like avoiding being explicit "for many workloads that compute expressions over string values" ? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser + +DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. Review Comment: a side question - is there assumption that every plan can be converted to SQL? what happens with operations that are result of optimization of original SQL, like partial aggregations? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. Review Comment: > for some benchmarks can this be a little more explicit what type of workfloads are expected to benefit? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 Review Comment: nit: double space (it's not rendered, right?) ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser Review Comment: Expression to SQL formatter? Renderer? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer Review Comment: same for the anchor within this blog. it's "SQL Lancer" but the text later uses "SQLancer" spelling. ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser + +DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. +This can be useful in query federation to push predicates down into other +systems that only accept SQL, and for building systems that generate SQL. + +For example, you can now convert a logical expression back to SQL text: + +```rust +// Form a logical expression that represents the SQL "a < 5 OR a = 8" +let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); +// convert the expression back to SQL text +let sql = expr_to_sql(&expr)?.to_string(); +assert_eq!(sql, "a < 5 OR a = 8"); +``` + +You can also do even more complex things like parsing SQL, +modifying the plan, and then converting it back to SQL: + +```rust +let df = ctx + // Use SQL to read some data from the parquet file + .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain") + .await?; +// Programmatically add new filters `id > 1 and tinyint_col < double_col` +let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))? +// Convert the new logical plan back to SQL +let sql = plan_to_sql(df.logical_plan())?.to_string(); +assert_eq!(sql, + "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \ + FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))") +); +``` + +See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for more details. + +[expr_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html +[plan_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html +[Plan to SQL example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs + + +BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) + +## Low Level APIs for Fast Parquet Access (indexing) + +With the rising prevalence of Parquet files stored on object stores, it is +important for query engines to support efficient access to Parquet files. Part +of doing this efficiently is to minimize the number of requests made to an +object store by caching metadata and skipping over parts of the file that are +not needed (e.g. via an Index). + +DataFusion's Parquet reader has long internally supported advanced predicate +pushdown by reading the parquet metadata from the file footer and pruning based +on row group level statistics as well as data page level statisics. DataFusion +now also supports users supplying their own low level pruning information via +the [`ParquetAccessPlan`] API. + +This API can be used along with index informatio to selectively skip decoding +parts of the file. This feature has been used to add [efficient support] for +reading from DeltaLake tables and handling [deletion vectors]. + +```text + βββββββββββββββββββββββββ If the RowSelection does not include any + β ... β rows from a particular Data Page, that + β β Data Page is not fetched or decoded. + β βββββββββββββββββββββ β Note this requires a PageIndex + β β ββββββββββββ β β +Row β β βDataPage 0β β β ββββββββββββββββββββββ +Groups β β ββββββββββββ β β β β + β β ββββββββββββ β β β ParquetExec β + β β ... βDataPage 1β ββΌ βΌ β β β β (Parquet Reader) β + β β ββββββββββββ β β β β β β β ββ β + β β ββββββββββββ β β β βββββββββββββββββ β + β β βDataPage 2β β β If only rows β βParquetMetadataβ β + β β ββββββββββββ β β from DataPage 1 β βββββββββββββββββ β + β βββββββββββββββββββββ β are selected, ββββββββββββββββββββββ + β β only DataPage 1 + β ... β is fetched and + β β decoded + β βββββββββββββββββββββ β + β β Thrift metadata β β + β βββββββββββββββββββββ β + βββββββββββββββββββββββββ + Parquet File +``` + +See the [parquet_index.rs] and [advanced_parquet_index.rs] examples for more details. + +Thanks to [@alamb] and XXX for this feature. + +[`ParquetAccessPlan`]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html +[efficient support]: https://github.com/spiceai/spiceai/pull/1891 +[deletion vectors]: https://docs.delta.io/latest/delta-deletion-vectors.html +[parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs +[advanced_parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs + +# Building Systems is Easier with DataFusion π οΈ + +* Faster and easier to use [TreeNode API] for traversing and manipulating plans and expressions. +* All functions now use the same [Scalar User Defined Function API], making it easier to customize + DataFusion's behavior without sacrificing performance. See [ticket] for more details. +* DataFusion can now be compiled to [WASM]. + + +[TreeNode API]: https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview +[Scalar User Defined Function API]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html +[ticket]: https://github.com/apache/arrow-datafusion/issues/8045 +[WASM]: https://github.com/apache/datafusion/discussions/9834 + + +## User Defined SQL Parsing Extensions + +As of DataFusion 40.0.0, the [`ExprPlanner`] API lets you provide a custom +implementation for planning expressions in SQL. This allows you to more +easily extend DataFusion's SQL planner to support custom operators or syntax. + +For example [datafusion-functions-json] uses this API add support for JSON +operator support in SQL queries. It provides a custom implementation for planning +JSON operators such as `->` and `->>` with the following code: + +```rust +struct MyCustomPlanner; + +impl ExprPlanner for MyCustomPlanner { + // Provide custom implementation for planning a binary operators + // such as `->` and `->>` + fn plan_binary_op( + &self, + expr: RawBinaryExpr, + _schema: &DFSchema, + ) -> Result<PlannerResult<RawBinaryExpr>> { + match &expr.op { + BinaryOperator::Arrow => { /* plan -> operator */ } + BinaryOperator::LongArrow => { /* plan ->> operator */ } + ... + } + } +} +``` + +Thanks to samuel colvin, jayzhan and others (TOD LINKS) Review Comment: TOD -> TODO so it;s easier to find ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser + +DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. +This can be useful in query federation to push predicates down into other +systems that only accept SQL, and for building systems that generate SQL. + +For example, you can now convert a logical expression back to SQL text: + +```rust +// Form a logical expression that represents the SQL "a < 5 OR a = 8" +let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); +// convert the expression back to SQL text +let sql = expr_to_sql(&expr)?.to_string(); +assert_eq!(sql, "a < 5 OR a = 8"); +``` + +You can also do even more complex things like parsing SQL, +modifying the plan, and then converting it back to SQL: + +```rust +let df = ctx + // Use SQL to read some data from the parquet file + .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain") + .await?; +// Programmatically add new filters `id > 1 and tinyint_col < double_col` +let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))? +// Convert the new logical plan back to SQL +let sql = plan_to_sql(df.logical_plan())?.to_string(); +assert_eq!(sql, + "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \ + FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))") +); +``` + +See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for more details. + +[expr_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html +[plan_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html +[Plan to SQL example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs + + +BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) + +## Low Level APIs for Fast Parquet Access (indexing) + +With the rising prevalence of Parquet files stored on object stores, it is +important for query engines to support efficient access to Parquet files. Part +of doing this efficiently is to minimize the number of requests made to an +object store by caching metadata and skipping over parts of the file that are Review Comment: Do we also join requests for adjacent or near-by pages (eg sibling columns a,b or siblings a,c where b is small)? ########## _posts/2024-07-09-datafusion-40.0.0.md: ########## @@ -0,0 +1,450 @@ +--- +layout: post +title: "Apache Arrow DataFusion 40.0.0 Released" +date: "2024-07-09 00:00:00" +author: alamb +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/9602 for details --> + +## Introduction + +We recently [released DataFusion 40.0.0]. This blog highlights some of the many +major improvements since we [released DataFusion 34.0.0] +and a preview of where the community is thinking about improving in the next 6 months. + +[released DataFusion 34.0.0]: https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/ +[released DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0 + +<!-- todo update this intro --> +[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusionβs primary design +goal] is to accelerate creating other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library] and +[command line SQL tool]. + +[DataFusionβs primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals +[dataframe library]: https://arrow.apache.org/datafusion-python/ +[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html + + +[apache arrow datafusion]: https://datafusion.apache.org/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion's core thesis is that as a community together, we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + + + +# Community Growth π + +In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to +grow in new ane exciting ways. + +1. DataFusion became a top level Apache Software Foundation project (read the + [press release] and [blog post]). +2. We added several PMC members and new + committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC, + [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] for + more details. +3. [DataFusion Comet] was [donated] and is nearing its first release. +4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 PRs from 182 different + committers, created over 1000 issues and closed 781 of them π. This is up from + 1000 PRs from 124 committers with 650 issues created in our last post π€―. You + can find a list of all changes in the detailed [CHANGELOG]. +5. DataFusion meetups in multiple cities around the world: [Austin], [San Francisco], + [Hangzhou], [New York], and [Belgrade]. +6. Many new projects in the [datafusion-contrib] organization, including + [Table Providers], [SQL Lancer], [Open Variant], [JSON], and [ORC]. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[CHANGELOG]: https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md +[press release]: https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion +[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/ +[@comphead]: https://github.com/comphead +[@mustafasrepo]: https://github.com/mustafasrepo +[@ozankabak]: https://github.com/ozankabak +[@jonahgao]: https://github.com/jonahgao +[@lewiszlw]: https://github.com/lewiszlw +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Austin]: https://github.com/apache/datafusion/discussions/8522 +[San Francisco]: https://github.com/apache/datafusion/discussions/10800 +[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 + +[datafusion-contrib]: https://github.com/datafusion-contrib +[Table Providers]: https://github.com/datafusion-contrib/datafusion-table-providers +[SQL Lancer]: https://github.com/datafusion-contrib/datafusion-sqllancer +[Open Variant]: https://github.com/datafusion-contrib/datafusion-functions-variant +[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json +[ORC]: https://github.com/datafusion-contrib/datafusion-orc + +<!-- +$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l + 1453 (up from 1009) + +$ git shortlog -sn 34.0.0..40.0.0 . | wc -l + 182 (up from 124) + + +https://crates.io/crates/datafusion/34.0.0 +DataFusion 34 released Dec 17, 2023 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 34 released July 12, 2024 + +Issues created in this time: 321 open, 781 closed (up from 214 open, 437 closed) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12 + +Issues closed: 911 (up from 517) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12 + +PRs merged in this time 1490 (up from 908) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12 + +--> + + +In addition, DataFusion has been appearing in more and more writing, both online and offline. Here are some highlights: + +1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine], was presented in [SIGMOD '24], one of the major database conferences +2. DataFusion described as part of the trend to define "the POSIX of databases" in ["What Goes Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker +3. ["Why you should keep an eye on Apache DataFusion and its community"] +4. [Apache DataFusion offline meetup in the Bay Area] + + +[DataFusion Comet]: https://datafusion.apache.org/comet/ +[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/ +[SIGMOD '24]: https://2024.sigmod.org/ + + +[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: https://dl.acm.org/doi/10.1145/3626246.3653368 +["What Goes Around Comes Around... And Around...]: https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf +["Why you should keep an eye on Apache DataFusion and its community"]: https://www.cpard.xyz/posts/datafusion/ +[Apache DataFusion offline meetup in the Bay Area]: https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/ + + +# Improved Performance π + +Performance is a key feature of DataFusion, and the community continues to work +to keep DataFusion state of the art in this area. One major area DataFusion +improved is the time it takes to convert a SQL query into a plan that can be +executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and +over 10x faster for some queries with many columns. + +Here is a chart showing the improvement due to the concerted effort of +many contributors (TODO list contributors by name) over several months (see +[ticket] for more details) + +<img src="{{ site.baseurl }}/assets/datafusion-40.0.0/improved-planning-time.png" width="700"> + +[ticket]: https://github.com/apache/datafusion/issues/9637 + +Also, we implemented [specialization for single Uft8/LargeUtf8/Binary/LargeBinary] +group by columns which resulted in a 40% performance improvement for some +benchmarks. + +[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: https://github.com/apache/datafusion/pull/8827 + +We are also in the final phases of our initial integration of the new [Arrow +StringView] which will provide a significant performance improvement +for many workloads. This feature should be available in future versions of DataFusion. +Kudos to [@XiangpengHong], [@PsiACE], [@Weigju>XXXX] and +[@AriesDevil], and [@alamb] for driving this along. + +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html + + +# Improved Quality π + +DataFusion continues to improve overall in quality. One of the most exciting +improvements is the addition of a new [SQLancer] based [DataFusion Fuzzing] +suite thanks to [@2010YOUY01] that has already found several bugs (kudos to [@jonahgao], +YYY, and ZZZ for fixing them so fast). + +[SQLancer]: https://github.com/apache/datafusion/issues/11030 +[DataFusion Fuzzing]: https://github.com/datafusion-contrib/datafusion-sqllancer +[@2010YOUY01]: https://github.com/2010YOUY01 + + +## Improved Documentation π + +We continue to improve the documentation to make it easier to get started using DataFusion with +the [Library Users Guide], [API documentation], and [Examples]. + +Some notable new examples include: +* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks [@LorrensP-2158466]) +* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan (thanks [@edmondop]) + +[Library Users Guide]: https://datafusion.apache.org/library-user-guide/index.html +[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html +[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples +[sql_analysis.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs +[plan_to_sql.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs +[@LorrensP-2158466]: https://github.com/LorrensP-2158466 +[@edmondop]: https://github.com/edmondop + +# New Features β¨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +## SQL +* Support for unnest (TODO LINK) +* Support Recursive CTEs https://github.com/apache/datafusion/pull/9619 / https://github.com/apache/datafusion/issues/462 +* Support for `CREATE FUNCTION` (see below) +* New functions: TODO find list +* Improved support for structured types such a `STRUCT`, `LIST`/`ARRAY` and `MAP` + +```rust +> select {'foo': {'bar': 2}}; ++--------------------------------------------------------------+ +| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) | ++--------------------------------------------------------------+ +| {foo: {bar: 2}} | ++--------------------------------------------------------------+ +1 row(s) fetched. +Elapsed 0.002 seconds. +``` + +## SQL Unparser + +DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text. +This can be useful in query federation to push predicates down into other +systems that only accept SQL, and for building systems that generate SQL. + +For example, you can now convert a logical expression back to SQL text: + +```rust +// Form a logical expression that represents the SQL "a < 5 OR a = 8" +let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); +// convert the expression back to SQL text +let sql = expr_to_sql(&expr)?.to_string(); +assert_eq!(sql, "a < 5 OR a = 8"); +``` + +You can also do even more complex things like parsing SQL, +modifying the plan, and then converting it back to SQL: + +```rust +let df = ctx + // Use SQL to read some data from the parquet file + .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM alltypes_plain") + .await?; +// Programmatically add new filters `id > 1 and tinyint_col < double_col` +let df = df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))? +// Convert the new logical plan back to SQL +let sql = plan_to_sql(df.logical_plan())?.to_string(); +assert_eq!(sql, + "SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) \ + FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))") +); +``` + +See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for more details. + +[expr_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html +[plan_to_sql]: https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html +[Plan to SQL example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs + + +BTW it would be great if someone made a demo showing how to do this (see https://github.com/apache/datafusion/issues/9326 ) + +## Low Level APIs for Fast Parquet Access (indexing) + +With the rising prevalence of Parquet files stored on object stores, it is +important for query engines to support efficient access to Parquet files. Part +of doing this efficiently is to minimize the number of requests made to an +object store by caching metadata and skipping over parts of the file that are +not needed (e.g. via an Index). + +DataFusion's Parquet reader has long internally supported advanced predicate +pushdown by reading the parquet metadata from the file footer and pruning based +on row group level statistics as well as data page level statisics. DataFusion +now also supports users supplying their own low level pruning information via +the [`ParquetAccessPlan`] API. + +This API can be used along with index informatio to selectively skip decoding +parts of the file. This feature has been used to add [efficient support] for +reading from DeltaLake tables and handling [deletion vectors]. + +```text + βββββββββββββββββββββββββ If the RowSelection does not include any + β ... β rows from a particular Data Page, that + β β Data Page is not fetched or decoded. + β βββββββββββββββββββββ β Note this requires a PageIndex + β β ββββββββββββ β β +Row β β βDataPage 0β β β ββββββββββββββββββββββ +Groups β β ββββββββββββ β β β β + β β ββββββββββββ β β β ParquetExec β + β β ... βDataPage 1β ββΌ βΌ β β β β (Parquet Reader) β + β β ββββββββββββ β β β β β β β ββ β + β β ββββββββββββ β β β βββββββββββββββββ β + β β βDataPage 2β β β If only rows β βParquetMetadataβ β + β β ββββββββββββ β β from DataPage 1 β βββββββββββββββββ β + β βββββββββββββββββββββ β are selected, ββββββββββββββββββββββ + β β only DataPage 1 + β ... β is fetched and + β β decoded + β βββββββββββββββββββββ β + β β Thrift metadata β β + β βββββββββββββββββββββ β + βββββββββββββββββββββββββ + Parquet File +``` + +See the [parquet_index.rs] and [advanced_parquet_index.rs] examples for more details. + +Thanks to [@alamb] and XXX for this feature. + +[`ParquetAccessPlan`]: https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html +[efficient support]: https://github.com/spiceai/spiceai/pull/1891 +[deletion vectors]: https://docs.delta.io/latest/delta-deletion-vectors.html +[parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs +[advanced_parquet_index.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs + +# Building Systems is Easier with DataFusion π οΈ + +* Faster and easier to use [TreeNode API] for traversing and manipulating plans and expressions. +* All functions now use the same [Scalar User Defined Function API], making it easier to customize + DataFusion's behavior without sacrificing performance. See [ticket] for more details. +* DataFusion can now be compiled to [WASM]. Review Comment: β€οΈ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org