Throne3d commented on code in PR #6:
URL: https://github.com/apache/datafusion-site/pull/6#discussion_r1687365607


##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
+3. ["Why you should keep an eye on Apache DataFusion and its community"]
+4. [Apache DataFusion offline meetup in the Bay Area]
+
+
+[DataFusion Comet]: https://datafusion.apache.org/comet/
+[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
+[SIGMOD '24]: https://2024.sigmod.org/
+
+
+[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: 
https://dl.acm.org/doi/10.1145/3626246.3653368
+["What Goes Around Comes Around... And Around...]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
+["Why you should keep an eye on Apache DataFusion and its community"]: 
https://www.cpard.xyz/posts/datafusion/
+[Apache DataFusion offline meetup in the Bay Area]: 
https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/
+
+
+# Improved Performance πŸš€ 
+
+Performance is a key feature of DataFusion, and the community continues to work
+to keep DataFusion state of the art in this area. One major area DataFusion
+improved is the time it takes to convert a SQL query into a plan that can be
+executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and
+over 10x faster for some queries with many columns.
+
+Here is a chart showing the improvement due to the concerted effort of many
+contributors including [@jackwener], [@alamb], [@Lordworms], [@dmitrybugakov],
+[@appletreeisyellow], [@ClSlaid], [@rohitrastogi], [@emgeee], 
[@kevinmingtarja],
+and [@peter-toth] over several months (see [ticket] for more details)
+
+<img src="{{ site.baseurl 
}}/assets/datafusion-40.0.0/improved-planning-time.png" width="700">
+
+[ticket]: https://github.com/apache/datafusion/issues/9637
+
+DataFusion is now up to 40% faster for queries that `GROUP BY` a single string
+or binary column due to a [specialization for single
+Uft8/LargeUtf8/Binary/LargeBinary]. We are working on improving performance 
when
+there are [multiple variable length columns in the `GROUP BY` clause].
+
+[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: 
https://github.com/apache/datafusion/pull/8827
+
+We are also in the final phases of [integrating] the new [Arrow StringView]
+which significantly improves performance for workloads that scan, filter and
+group by variable length string and binary data. We expect the improvement to 
be
+especially pronounced for Parquet files due to [upstream work in the parquet
+reader]. Kudos to [@XiangpengHong], [@AriesDevil], [@PsiACE], [@Weijun-H],
+[@a10y], and [@RinChanNOWWW] for driving this project.
+
+[integrating]: https://github.com/apache/datafusion/issues/10918
+[Arrow StringView]: 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
+[multiple variable length columns in the `GROUP BY` clause]: 
https://github.com/apache/datafusion/issues/9403
+[upstream work in the parquet reader]: 
https://github.com/apache/arrow-rs/issues/5530
+
+# Improved Quality πŸ“‹
+
+DataFusion continues to improve overall in quality. In addition to ongoing bug
+fixes, one of the most exciting improvements is the addition of a new 
[SQLancer]
+based [DataFusion Fuzzing] suite thanks to [@2010YOUY01] that has already found
+several bugs and thanks to [@jonahgao], [@tshauck], [@xinlifoobar],
+[@LorrensP-2158466] for fixing them so fast.
+
+[DataFusion Fuzzing]: https://github.com/apache/datafusion/issues/11030
+
+
+## Improved Documentation πŸ“š
+
+We continue to improve the documentation to make it easier to get started 
using DataFusion with
+the [Library Users Guide], [API documentation], and [Examples].
+
+Some notable new examples include:
+* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks 
[@LorrensP-2158466])
+* [function_factory.rs] to create custom functions via SQL (thanks 
[@milenkovicm])
+* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan 
(thanks [@edmondop])
+* [parquet_index.rs] and [advanced_parquet_index.rs] for parquet indexing, 
described more below (thanks [@alamb])
+
+[Library Users Guide]: 
https://datafusion.apache.org/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples
+[sql_analysis.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs
+[plan_to_sql.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+# New Features ✨
+
+There are too many new features in the last 6 months to list them all, but here
+are some highlights:
+
+## SQL 
+* Support for`UNNEST` (thanks [@duongcongtoai], [@JasonLi-cn] and 
[@jayzhan211]) 

Review Comment:
   ```suggestion
   * Support for `UNNEST` (thanks [@duongcongtoai], [@JasonLi-cn] and 
[@jayzhan211]) 
   ```



##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
+3. ["Why you should keep an eye on Apache DataFusion and its community"]
+4. [Apache DataFusion offline meetup in the Bay Area]
+
+
+[DataFusion Comet]: https://datafusion.apache.org/comet/
+[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
+[SIGMOD '24]: https://2024.sigmod.org/
+
+
+[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: 
https://dl.acm.org/doi/10.1145/3626246.3653368
+["What Goes Around Comes Around... And Around...]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf

Review Comment:
   ```suggestion
   ["What Goes Around Comes Around... And Around..."]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
   ```



##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
+3. ["Why you should keep an eye on Apache DataFusion and its community"]
+4. [Apache DataFusion offline meetup in the Bay Area]
+
+
+[DataFusion Comet]: https://datafusion.apache.org/comet/
+[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
+[SIGMOD '24]: https://2024.sigmod.org/
+
+
+[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: 
https://dl.acm.org/doi/10.1145/3626246.3653368
+["What Goes Around Comes Around... And Around...]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
+["Why you should keep an eye on Apache DataFusion and its community"]: 
https://www.cpard.xyz/posts/datafusion/
+[Apache DataFusion offline meetup in the Bay Area]: 
https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/
+
+
+# Improved Performance πŸš€ 
+
+Performance is a key feature of DataFusion, and the community continues to work
+to keep DataFusion state of the art in this area. One major area DataFusion
+improved is the time it takes to convert a SQL query into a plan that can be
+executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and
+over 10x faster for some queries with many columns.
+
+Here is a chart showing the improvement due to the concerted effort of many
+contributors including [@jackwener], [@alamb], [@Lordworms], [@dmitrybugakov],
+[@appletreeisyellow], [@ClSlaid], [@rohitrastogi], [@emgeee], 
[@kevinmingtarja],
+and [@peter-toth] over several months (see [ticket] for more details)
+
+<img src="{{ site.baseurl 
}}/assets/datafusion-40.0.0/improved-planning-time.png" width="700">
+
+[ticket]: https://github.com/apache/datafusion/issues/9637
+
+DataFusion is now up to 40% faster for queries that `GROUP BY` a single string
+or binary column due to a [specialization for single
+Uft8/LargeUtf8/Binary/LargeBinary]. We are working on improving performance 
when
+there are [multiple variable length columns in the `GROUP BY` clause].
+
+[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: 
https://github.com/apache/datafusion/pull/8827
+
+We are also in the final phases of [integrating] the new [Arrow StringView]
+which significantly improves performance for workloads that scan, filter and
+group by variable length string and binary data. We expect the improvement to 
be
+especially pronounced for Parquet files due to [upstream work in the parquet
+reader]. Kudos to [@XiangpengHong], [@AriesDevil], [@PsiACE], [@Weijun-H],
+[@a10y], and [@RinChanNOWWW] for driving this project.
+
+[integrating]: https://github.com/apache/datafusion/issues/10918
+[Arrow StringView]: 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
+[multiple variable length columns in the `GROUP BY` clause]: 
https://github.com/apache/datafusion/issues/9403
+[upstream work in the parquet reader]: 
https://github.com/apache/arrow-rs/issues/5530
+
+# Improved Quality πŸ“‹
+
+DataFusion continues to improve overall in quality. In addition to ongoing bug
+fixes, one of the most exciting improvements is the addition of a new 
[SQLancer]
+based [DataFusion Fuzzing] suite thanks to [@2010YOUY01] that has already found
+several bugs and thanks to [@jonahgao], [@tshauck], [@xinlifoobar],
+[@LorrensP-2158466] for fixing them so fast.
+
+[DataFusion Fuzzing]: https://github.com/apache/datafusion/issues/11030
+
+
+## Improved Documentation πŸ“š
+
+We continue to improve the documentation to make it easier to get started 
using DataFusion with
+the [Library Users Guide], [API documentation], and [Examples].
+
+Some notable new examples include:
+* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks 
[@LorrensP-2158466])
+* [function_factory.rs] to create custom functions via SQL (thanks 
[@milenkovicm])
+* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan 
(thanks [@edmondop])
+* [parquet_index.rs] and [advanced_parquet_index.rs] for parquet indexing, 
described more below (thanks [@alamb])
+
+[Library Users Guide]: 
https://datafusion.apache.org/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples
+[sql_analysis.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs
+[plan_to_sql.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+# New Features ✨
+
+There are too many new features in the last 6 months to list them all, but here
+are some highlights:
+
+## SQL 
+* Support for`UNNEST` (thanks [@duongcongtoai], [@JasonLi-cn] and 
[@jayzhan211]) 
+* Support for [Recursive CTEs] (thanks [@jonahgao] and [@matthewgapp]) 
+* Support for `CREATE FUNCTION` (see below) 
+* Many new SQL functions
+
+[Recursive CTEs]: https://github.com/apache/datafusion/issues/462
+
+DataFusion now has much improved support for structured types such a  `STRUCT`,
+`LIST`/`ARRAY` and `MAP`. For example, you can now use syntax like:
+
+```rust
+> select {'foo': {'bar': 2}};
++--------------------------------------------------------------+
+| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) |
++--------------------------------------------------------------+
+| {foo: {bar: 2}}                                              |
++--------------------------------------------------------------+
+1 row(s) fetched.
+Elapsed 0.002 seconds.
+```
+
+## SQL Unparser (SQL Formatter)
+
+DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text.
+This can be useful in query federation to push predicates down into other
+systems that only accept SQL, and for building systems that generate SQL.
+
+For example, you can now convert a logical expression back to SQL text:
+
+```rust
+// Form a logical expression that represents the SQL "a < 5 OR a = 8"
+let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
+// convert the expression back to SQL text
+let sql = expr_to_sql(&expr)?.to_string();
+assert_eq!(sql, "a < 5 OR a = 8");
+```
+
+You can also do complex things like parsing SQL, modifying the plan, and 
convert
+it back to SQL:
+
+```rust
+let df = ctx
+  // Use SQL to read some data from the parquet file
+  .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM 
alltypes_plain")
+  .await?;
+// Programmatically add new filters `id > 1 and tinyint_col < double_col`
+let df = 
df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))?
+// Convert the new logical plan back to SQL
+let sql = plan_to_sql(df.logical_plan())?.to_string();
+assert_eq!(sql, 
+           "SELECT alltypes_plain.int_col, alltypes_plain.double_col, 
CAST(alltypes_plain.date_string_col AS VARCHAR) \
+           FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND 
(alltypes_plain.tinyint_col < alltypes_plain.double_col))")
+);
+```
+
+See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for 
more details.
+
+[expr_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html
+[plan_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html
+[Plan to SQL example]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+
+
+## Low Level APIs for Fast Parquet Access (indexing)
+
+With their rising prevalence, supporting efficient access to Parquet files
+stored remotely on object storage is important. Part of doing this efficiently
+is to minimize the number of object stpre requests made by caching metadata and

Review Comment:
   ```suggestion
   is to minimize the number of object store requests made by caching metadata 
and
   ```



##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
+3. ["Why you should keep an eye on Apache DataFusion and its community"]
+4. [Apache DataFusion offline meetup in the Bay Area]
+
+
+[DataFusion Comet]: https://datafusion.apache.org/comet/
+[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
+[SIGMOD '24]: https://2024.sigmod.org/
+
+
+[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: 
https://dl.acm.org/doi/10.1145/3626246.3653368
+["What Goes Around Comes Around... And Around...]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
+["Why you should keep an eye on Apache DataFusion and its community"]: 
https://www.cpard.xyz/posts/datafusion/
+[Apache DataFusion offline meetup in the Bay Area]: 
https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/
+
+
+# Improved Performance πŸš€ 
+
+Performance is a key feature of DataFusion, and the community continues to work
+to keep DataFusion state of the art in this area. One major area DataFusion
+improved is the time it takes to convert a SQL query into a plan that can be
+executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and
+over 10x faster for some queries with many columns.
+
+Here is a chart showing the improvement due to the concerted effort of many
+contributors including [@jackwener], [@alamb], [@Lordworms], [@dmitrybugakov],
+[@appletreeisyellow], [@ClSlaid], [@rohitrastogi], [@emgeee], 
[@kevinmingtarja],
+and [@peter-toth] over several months (see [ticket] for more details)
+
+<img src="{{ site.baseurl 
}}/assets/datafusion-40.0.0/improved-planning-time.png" width="700">
+
+[ticket]: https://github.com/apache/datafusion/issues/9637
+
+DataFusion is now up to 40% faster for queries that `GROUP BY` a single string
+or binary column due to a [specialization for single
+Uft8/LargeUtf8/Binary/LargeBinary]. We are working on improving performance 
when
+there are [multiple variable length columns in the `GROUP BY` clause].
+
+[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: 
https://github.com/apache/datafusion/pull/8827
+
+We are also in the final phases of [integrating] the new [Arrow StringView]
+which significantly improves performance for workloads that scan, filter and
+group by variable length string and binary data. We expect the improvement to 
be
+especially pronounced for Parquet files due to [upstream work in the parquet
+reader]. Kudos to [@XiangpengHong], [@AriesDevil], [@PsiACE], [@Weijun-H],
+[@a10y], and [@RinChanNOWWW] for driving this project.
+
+[integrating]: https://github.com/apache/datafusion/issues/10918
+[Arrow StringView]: 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
+[multiple variable length columns in the `GROUP BY` clause]: 
https://github.com/apache/datafusion/issues/9403
+[upstream work in the parquet reader]: 
https://github.com/apache/arrow-rs/issues/5530
+
+# Improved Quality πŸ“‹
+
+DataFusion continues to improve overall in quality. In addition to ongoing bug
+fixes, one of the most exciting improvements is the addition of a new 
[SQLancer]
+based [DataFusion Fuzzing] suite thanks to [@2010YOUY01] that has already found
+several bugs and thanks to [@jonahgao], [@tshauck], [@xinlifoobar],
+[@LorrensP-2158466] for fixing them so fast.
+
+[DataFusion Fuzzing]: https://github.com/apache/datafusion/issues/11030
+
+
+## Improved Documentation πŸ“š
+
+We continue to improve the documentation to make it easier to get started 
using DataFusion with
+the [Library Users Guide], [API documentation], and [Examples].
+
+Some notable new examples include:
+* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks 
[@LorrensP-2158466])
+* [function_factory.rs] to create custom functions via SQL (thanks 
[@milenkovicm])
+* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan 
(thanks [@edmondop])
+* [parquet_index.rs] and [advanced_parquet_index.rs] for parquet indexing, 
described more below (thanks [@alamb])
+
+[Library Users Guide]: 
https://datafusion.apache.org/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples
+[sql_analysis.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs
+[plan_to_sql.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+# New Features ✨
+
+There are too many new features in the last 6 months to list them all, but here
+are some highlights:
+
+## SQL 
+* Support for`UNNEST` (thanks [@duongcongtoai], [@JasonLi-cn] and 
[@jayzhan211]) 
+* Support for [Recursive CTEs] (thanks [@jonahgao] and [@matthewgapp]) 
+* Support for `CREATE FUNCTION` (see below) 
+* Many new SQL functions
+
+[Recursive CTEs]: https://github.com/apache/datafusion/issues/462
+
+DataFusion now has much improved support for structured types such a  `STRUCT`,
+`LIST`/`ARRAY` and `MAP`. For example, you can now use syntax like:
+
+```rust
+> select {'foo': {'bar': 2}};
++--------------------------------------------------------------+
+| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) |
++--------------------------------------------------------------+
+| {foo: {bar: 2}}                                              |
++--------------------------------------------------------------+
+1 row(s) fetched.
+Elapsed 0.002 seconds.
+```
+
+## SQL Unparser (SQL Formatter)
+
+DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text.
+This can be useful in query federation to push predicates down into other
+systems that only accept SQL, and for building systems that generate SQL.
+
+For example, you can now convert a logical expression back to SQL text:
+
+```rust
+// Form a logical expression that represents the SQL "a < 5 OR a = 8"
+let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
+// convert the expression back to SQL text
+let sql = expr_to_sql(&expr)?.to_string();
+assert_eq!(sql, "a < 5 OR a = 8");
+```
+
+You can also do complex things like parsing SQL, modifying the plan, and 
convert
+it back to SQL:
+
+```rust
+let df = ctx
+  // Use SQL to read some data from the parquet file
+  .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM 
alltypes_plain")
+  .await?;
+// Programmatically add new filters `id > 1 and tinyint_col < double_col`
+let df = 
df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))?
+// Convert the new logical plan back to SQL
+let sql = plan_to_sql(df.logical_plan())?.to_string();
+assert_eq!(sql, 
+           "SELECT alltypes_plain.int_col, alltypes_plain.double_col, 
CAST(alltypes_plain.date_string_col AS VARCHAR) \
+           FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND 
(alltypes_plain.tinyint_col < alltypes_plain.double_col))")
+);
+```
+
+See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for 
more details.
+
+[expr_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html
+[plan_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html
+[Plan to SQL example]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+
+
+## Low Level APIs for Fast Parquet Access (indexing)
+
+With their rising prevalence, supporting efficient access to Parquet files
+stored remotely on object storage is important. Part of doing this efficiently
+is to minimize the number of object stpre requests made by caching metadata and
+skipping over parts of the file that are not needed (e.g. via an index).
+
+DataFusion's Parquet reader has long internally supported advanced predicate
+pushdown by reading the parquet metadata from the file footer and pruning based
+on row group level statistics as well as data page level statistics. DataFusion
+now also supports users supplying their own low level pruning information via
+the [`ParquetAccessPlan`] API.
+
+This API can be used along with index information to selectively skip decoding
+parts of the file. This feature has been used by SpiceAI to add [efficient
+support] for reading from DeltaLake tables and handling [deletion vectors].
+
+```text
+        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   If the RowSelection does not include any
+        β”‚          ...          β”‚   rows from a particular Data Page, that
+        β”‚                       β”‚   Data Page is not fetched or decoded.
+        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   Note this requires a PageIndex
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
+Row     β”‚ β”‚     β”‚DataPage 0β”‚  β”‚ β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
+Groups  β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚                 β”‚                    β”‚
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚                 β”‚    ParquetExec     β”‚
+        β”‚ β”‚ ... β”‚DataPage 1β”‚ β—€β”Ό β”Ό ─ ─ ─           β”‚  (Parquet Reader)  β”‚
+        β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚      β”” ─ ─ ─ ─ ─│                    β”‚
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚                 β”‚ ╔═══════════════╗  β”‚
+        β”‚ β”‚     β”‚DataPage 2β”‚  β”‚ β”‚ If only rows    β”‚ β•‘ParquetMetadataβ•‘  β”‚
+        β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚ from DataPage 1 β”‚ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
+        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ are selected,   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
+        β”‚                       β”‚ only DataPage 1
+        β”‚          ...          β”‚ is fetched and
+        β”‚                       β”‚ decoded
+        β”‚ ╔═══════════════════╗ β”‚
+        β”‚ β•‘  Thrift metadata  β•‘ β”‚
+        β”‚ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• β”‚
+        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
+         Parquet File
+```
+
+See the [parquet_index.rs] and [advanced_parquet_index.rs] examples for more 
details. 
+
+Thanks to [@alamb] and [@Ted-Jiang] for this feature.  
+
+[`ParquetAccessPlan`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html
+[efficient support]: https://github.com/spiceai/spiceai/pull/1891
+[deletion vectors]: https://docs.delta.io/latest/delta-deletion-vectors.html
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+# Building Systems is Easier with DataFusion πŸ› οΈ
+
+In addition to make small API improvements, there are several new APIs that 
make
+it easier to build systems on top of DataFusion, for example:
+
+* Faster and easier to use [TreeNode API] for traversing and manipulating 
plans and expressions.
+* All functions now use the same [Scalar User Defined Function API], making it 
easier to customize
+  DataFusion's behavior without sacrificing performance. See [ticket] for more 
details.
+* DataFusion can now be compiled to [WASM]. 
+
+[TreeNode API]: 
https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview
+[Scalar User Defined Function API]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
+[ticket]: https://github.com/apache/arrow-datafusion/issues/8045
+[WASM]: https://github.com/apache/datafusion/discussions/9834
+
+## User Defined SQL Parsing Extensions
+
+As of DataFusion 40.0.0, the [`ExprPlanner`] allows easy extension of
+DataFusion's SQL planner to support custom operators or syntax.
+
+For example the [datafusion-functions-json] project uses this API add support
+for JSON operators in SQL queries. It provides a custom implementation for
+planning JSON operators such as `->` and `->>` with code like:
+
+```rust
+struct MyCustomPlanner;
+
+impl ExprPlanner for MyCustomPlanner {
+    // Provide custom implementation for planning a binary operators
+    // such as `->` and `->>`
+    fn plan_binary_op(
+        &self,
+        expr: RawBinaryExpr,
+        _schema: &DFSchema,
+    ) -> Result<PlannerResult<RawBinaryExpr>> {
+        match &expr.op {
+           BinaryOperator::Arrow => { /* plan -> operator */ }
+           BinaryOperator::LongArrow => { /* plan ->> operator */ }
+           ...
+        }
+    }
+}
+```
+
+Thanks to [@samuelcolvin], [@jayzhan211] and [@dharanad] for helping make this
+feature happen.
+
+[datafusion-functions-json]: 
https://github.com/datafusion-contrib/datafusion-functions-json
+[`ExprPlanner`]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html
+
+## Pluggable Support for `CREATE FUNCTION` 
+
+DataFusion's new [`FunctionFactory`] API let's users provide a handler for
+`CREATE FUNCTION` in SQL. This feature lets you build systems that support
+defining functions in SQL such as
+
+```sql
+-- SQL based functions
+CREATE FUNCTION my_func(DOUBLE, DOUBLE) RETURNS DOUBLE
+    RETURN $1 + $3
+;
+
+-- ML Models
+CREATE FUNCTION iris(FLOAT[]) RETURNS FLOAT[] LANGUAGE TORCH AS 
'models:/iris@champion';
+
+-- WebAssembly
+CREATE FUNCTION func(FLOAT[]) RETURNS FLOAT[] LANGUAGE WASM AS 'func.wasm'
+```
+
+Huge thanks to [@milenkovicm] for this feature. There is an example of how to
+make macro like functions in the [function_factory.rs] example. It would be
+great if [someone made a demo] showing how easy WASM udfs is to do 🎣.
+
+[fucntion_factory.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/function_factory.rs

Review Comment:
   ```suggestion
   [function_factory.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/function_factory.rs
   ```



##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker
+3. ["Why you should keep an eye on Apache DataFusion and its community"]
+4. [Apache DataFusion offline meetup in the Bay Area]
+
+
+[DataFusion Comet]: https://datafusion.apache.org/comet/
+[donated]: https://arrow.apache.org/blog/2024/03/06/comet-donation/
+[SIGMOD '24]: https://2024.sigmod.org/
+
+
+[Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine]: 
https://dl.acm.org/doi/10.1145/3626246.3653368
+["What Goes Around Comes Around... And Around...]: 
https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
+["Why you should keep an eye on Apache DataFusion and its community"]: 
https://www.cpard.xyz/posts/datafusion/
+[Apache DataFusion offline meetup in the Bay Area]: 
https://www.tisonkun.org/2024/07/15/datafusion-meetup-san-francisco/
+
+
+# Improved Performance πŸš€ 
+
+Performance is a key feature of DataFusion, and the community continues to work
+to keep DataFusion state of the art in this area. One major area DataFusion
+improved is the time it takes to convert a SQL query into a plan that can be
+executed. Planning is now almost 2x faster for TPC-DS and TPC-H queries, and
+over 10x faster for some queries with many columns.
+
+Here is a chart showing the improvement due to the concerted effort of many
+contributors including [@jackwener], [@alamb], [@Lordworms], [@dmitrybugakov],
+[@appletreeisyellow], [@ClSlaid], [@rohitrastogi], [@emgeee], 
[@kevinmingtarja],
+and [@peter-toth] over several months (see [ticket] for more details)
+
+<img src="{{ site.baseurl 
}}/assets/datafusion-40.0.0/improved-planning-time.png" width="700">
+
+[ticket]: https://github.com/apache/datafusion/issues/9637
+
+DataFusion is now up to 40% faster for queries that `GROUP BY` a single string
+or binary column due to a [specialization for single
+Uft8/LargeUtf8/Binary/LargeBinary]. We are working on improving performance 
when
+there are [multiple variable length columns in the `GROUP BY` clause].
+
+[specialization for single Uft8/LargeUtf8/Binary/LargeBinary]: 
https://github.com/apache/datafusion/pull/8827
+
+We are also in the final phases of [integrating] the new [Arrow StringView]
+which significantly improves performance for workloads that scan, filter and
+group by variable length string and binary data. We expect the improvement to 
be
+especially pronounced for Parquet files due to [upstream work in the parquet
+reader]. Kudos to [@XiangpengHong], [@AriesDevil], [@PsiACE], [@Weijun-H],
+[@a10y], and [@RinChanNOWWW] for driving this project.
+
+[integrating]: https://github.com/apache/datafusion/issues/10918
+[Arrow StringView]: 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
+[multiple variable length columns in the `GROUP BY` clause]: 
https://github.com/apache/datafusion/issues/9403
+[upstream work in the parquet reader]: 
https://github.com/apache/arrow-rs/issues/5530
+
+# Improved Quality πŸ“‹
+
+DataFusion continues to improve overall in quality. In addition to ongoing bug
+fixes, one of the most exciting improvements is the addition of a new 
[SQLancer]
+based [DataFusion Fuzzing] suite thanks to [@2010YOUY01] that has already found
+several bugs and thanks to [@jonahgao], [@tshauck], [@xinlifoobar],
+[@LorrensP-2158466] for fixing them so fast.
+
+[DataFusion Fuzzing]: https://github.com/apache/datafusion/issues/11030
+
+
+## Improved Documentation πŸ“š
+
+We continue to improve the documentation to make it easier to get started 
using DataFusion with
+the [Library Users Guide], [API documentation], and [Examples].
+
+Some notable new examples include:
+* [sql_analysis.rs] to analyse SQL queries with DataFusion structures (thanks 
[@LorrensP-2158466])
+* [function_factory.rs] to create custom functions via SQL (thanks 
[@milenkovicm])
+* [plan_to_sql.rs] to generate SQL from DataFusion Expr and LogicalPlan 
(thanks [@edmondop])
+* [parquet_index.rs] and [advanced_parquet_index.rs] for parquet indexing, 
described more below (thanks [@alamb])
+
+[Library Users Guide]: 
https://datafusion.apache.org/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+[Examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples
+[sql_analysis.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_analysis.rs
+[plan_to_sql.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+# New Features ✨
+
+There are too many new features in the last 6 months to list them all, but here
+are some highlights:
+
+## SQL 
+* Support for`UNNEST` (thanks [@duongcongtoai], [@JasonLi-cn] and 
[@jayzhan211]) 
+* Support for [Recursive CTEs] (thanks [@jonahgao] and [@matthewgapp]) 
+* Support for `CREATE FUNCTION` (see below) 
+* Many new SQL functions
+
+[Recursive CTEs]: https://github.com/apache/datafusion/issues/462
+
+DataFusion now has much improved support for structured types such a  `STRUCT`,
+`LIST`/`ARRAY` and `MAP`. For example, you can now use syntax like:
+
+```rust
+> select {'foo': {'bar': 2}};
++--------------------------------------------------------------+
+| named_struct(Utf8("foo"),named_struct(Utf8("bar"),Int64(2))) |
++--------------------------------------------------------------+
+| {foo: {bar: 2}}                                              |
++--------------------------------------------------------------+
+1 row(s) fetched.
+Elapsed 0.002 seconds.
+```
+
+## SQL Unparser (SQL Formatter)
+
+DataFusion now supports converting `Expr`s and `LogicalPlan`s BACK to SQL text.
+This can be useful in query federation to push predicates down into other
+systems that only accept SQL, and for building systems that generate SQL.
+
+For example, you can now convert a logical expression back to SQL text:
+
+```rust
+// Form a logical expression that represents the SQL "a < 5 OR a = 8"
+let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
+// convert the expression back to SQL text
+let sql = expr_to_sql(&expr)?.to_string();
+assert_eq!(sql, "a < 5 OR a = 8");
+```
+
+You can also do complex things like parsing SQL, modifying the plan, and 
convert
+it back to SQL:
+
+```rust
+let df = ctx
+  // Use SQL to read some data from the parquet file
+  .sql("SELECT int_col, double_col, CAST(date_string_col as VARCHAR) FROM 
alltypes_plain")
+  .await?;
+// Programmatically add new filters `id > 1 and tinyint_col < double_col`
+let df = 
df.filter(col("id").gt(lit(1)).and(col("tinyint_col").lt(col("double_col"))))?
+// Convert the new logical plan back to SQL
+let sql = plan_to_sql(df.logical_plan())?.to_string();
+assert_eq!(sql, 
+           "SELECT alltypes_plain.int_col, alltypes_plain.double_col, 
CAST(alltypes_plain.date_string_col AS VARCHAR) \
+           FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND 
(alltypes_plain.tinyint_col < alltypes_plain.double_col))")
+);
+```
+
+See the [Plan to SQL example] or the APIs [expr_to_sql] and [plan_to_sql] for 
more details.
+
+[expr_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.expr_to_sql.html
+[plan_to_sql]: 
https://docs.rs/datafusion/latest/datafusion/sql/unparser/fn.plan_to_sql.html
+[Plan to SQL example]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs
+
+
+
+## Low Level APIs for Fast Parquet Access (indexing)
+
+With their rising prevalence, supporting efficient access to Parquet files
+stored remotely on object storage is important. Part of doing this efficiently
+is to minimize the number of object stpre requests made by caching metadata and
+skipping over parts of the file that are not needed (e.g. via an index).
+
+DataFusion's Parquet reader has long internally supported advanced predicate
+pushdown by reading the parquet metadata from the file footer and pruning based
+on row group level statistics as well as data page level statistics. DataFusion
+now also supports users supplying their own low level pruning information via
+the [`ParquetAccessPlan`] API.
+
+This API can be used along with index information to selectively skip decoding
+parts of the file. This feature has been used by SpiceAI to add [efficient
+support] for reading from DeltaLake tables and handling [deletion vectors].
+
+```text
+        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   If the RowSelection does not include any
+        β”‚          ...          β”‚   rows from a particular Data Page, that
+        β”‚                       β”‚   Data Page is not fetched or decoded.
+        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   Note this requires a PageIndex
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
+Row     β”‚ β”‚     β”‚DataPage 0β”‚  β”‚ β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
+Groups  β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚                 β”‚                    β”‚
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚                 β”‚    ParquetExec     β”‚
+        β”‚ β”‚ ... β”‚DataPage 1β”‚ β—€β”Ό β”Ό ─ ─ ─           β”‚  (Parquet Reader)  β”‚
+        β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚      β”” ─ ─ ─ ─ ─│                    β”‚
+        β”‚ β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚                 β”‚ ╔═══════════════╗  β”‚
+        β”‚ β”‚     β”‚DataPage 2β”‚  β”‚ β”‚ If only rows    β”‚ β•‘ParquetMetadataβ•‘  β”‚
+        β”‚ β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚ from DataPage 1 β”‚ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
+        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ are selected,   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
+        β”‚                       β”‚ only DataPage 1
+        β”‚          ...          β”‚ is fetched and
+        β”‚                       β”‚ decoded
+        β”‚ ╔═══════════════════╗ β”‚
+        β”‚ β•‘  Thrift metadata  β•‘ β”‚
+        β”‚ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• β”‚
+        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
+         Parquet File
+```
+
+See the [parquet_index.rs] and [advanced_parquet_index.rs] examples for more 
details. 
+
+Thanks to [@alamb] and [@Ted-Jiang] for this feature.  
+
+[`ParquetAccessPlan`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetAccessPlan.html
+[efficient support]: https://github.com/spiceai/spiceai/pull/1891
+[deletion vectors]: https://docs.delta.io/latest/delta-deletion-vectors.html
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+# Building Systems is Easier with DataFusion πŸ› οΈ
+
+In addition to make small API improvements, there are several new APIs that 
make
+it easier to build systems on top of DataFusion, for example:
+
+* Faster and easier to use [TreeNode API] for traversing and manipulating 
plans and expressions.
+* All functions now use the same [Scalar User Defined Function API], making it 
easier to customize
+  DataFusion's behavior without sacrificing performance. See [ticket] for more 
details.
+* DataFusion can now be compiled to [WASM]. 
+
+[TreeNode API]: 
https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview
+[Scalar User Defined Function API]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
+[ticket]: https://github.com/apache/arrow-datafusion/issues/8045
+[WASM]: https://github.com/apache/datafusion/discussions/9834
+
+## User Defined SQL Parsing Extensions
+
+As of DataFusion 40.0.0, the [`ExprPlanner`] allows easy extension of
+DataFusion's SQL planner to support custom operators or syntax.
+
+For example the [datafusion-functions-json] project uses this API add support
+for JSON operators in SQL queries. It provides a custom implementation for
+planning JSON operators such as `->` and `->>` with code like:
+
+```rust
+struct MyCustomPlanner;
+
+impl ExprPlanner for MyCustomPlanner {
+    // Provide custom implementation for planning a binary operators
+    // such as `->` and `->>`
+    fn plan_binary_op(
+        &self,
+        expr: RawBinaryExpr,
+        _schema: &DFSchema,
+    ) -> Result<PlannerResult<RawBinaryExpr>> {
+        match &expr.op {
+           BinaryOperator::Arrow => { /* plan -> operator */ }
+           BinaryOperator::LongArrow => { /* plan ->> operator */ }
+           ...
+        }
+    }
+}
+```
+
+Thanks to [@samuelcolvin], [@jayzhan211] and [@dharanad] for helping make this
+feature happen.
+
+[datafusion-functions-json]: 
https://github.com/datafusion-contrib/datafusion-functions-json
+[`ExprPlanner`]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html
+
+## Pluggable Support for `CREATE FUNCTION` 
+
+DataFusion's new [`FunctionFactory`] API let's users provide a handler for
+`CREATE FUNCTION` in SQL. This feature lets you build systems that support
+defining functions in SQL such as
+
+```sql
+-- SQL based functions
+CREATE FUNCTION my_func(DOUBLE, DOUBLE) RETURNS DOUBLE
+    RETURN $1 + $3
+;
+
+-- ML Models
+CREATE FUNCTION iris(FLOAT[]) RETURNS FLOAT[] LANGUAGE TORCH AS 
'models:/iris@champion';
+
+-- WebAssembly
+CREATE FUNCTION func(FLOAT[]) RETURNS FLOAT[] LANGUAGE WASM AS 'func.wasm'
+```
+
+Huge thanks to [@milenkovicm] for this feature. There is an example of how to
+make macro like functions in the [function_factory.rs] example. It would be
+great if [someone made a demo] showing how easy WASM udfs is to do 🎣.
+
+[fucntion_factory.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/function_factory.rs
+[someone made a demo]: https://github.com/apache/datafusion/issues/9326 
+
+[`FunctionFactory`]: 
https://docs.rs/datafusion/latest/datafusion/execution/context/trait.FunctionFactory.html
+
+# Looking Ahead: The Next Six Months πŸ”­ 
+
+The community has been [discussing what we will work on in the next six 
months].
+Some major initiatives from that discussion are:
+
+1. *Aggregate Performance*: Improve the speed of [aggregating "high 
cardinality"]
+  data when there are many (e.g. millions) of distinct groups.
+
+2. *Modularity*: Make DataFusion even more modular, by completely unifying
+   built in and user [aggregate functions] and [window functions].
+
+3. *LogicalTypes*: [Introducing Logical Types] to make it easier to use
+   different encodings like `StringView`, `RunEnd` and `Dictionary` arrays as 
well
+   as user defined types. Thanks [@notfilippo] for driving this. 
+
+4. *Improved Documentation*: Write blog posts and videos explaining
+   how to use DataFusion for real-world use cases.
+
+5. *Testing*: Improve CI infrastructure and test coverage, more fuzz
+   testing, and better functional and performance regression testing.
+
+
+[discussing what we will work on in the next six months]: 
https://github.com/apache/datafusion/issues/11442
+[aggregating "high cardinality"]: 
https://github.com/apache/arrow-datafusion/issues/7000
+[Improved statistics handling]: 
https://github.com/apache/arrow-datafusion/issues/8227

Review Comment:
   I don't think this is referenced



##########
_posts/2024-07-23-datafusion-40.0.0.md:
##########
@@ -0,0 +1,492 @@
+---
+layout: post
+title: "Apache DataFusion 40.0.0 Released"
+date: "2024-07-21 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/9602 for details -->
+
+## Introduction
+
+We are proud to announce [DataFusion 40.0.0]. This blog highlights some of the
+many major improvements since we released [DataFusion 34.0.0] and a preview of
+what the community is thinking about in the next 6 months. We are hoping to 
make
+more regular blog posts -- if you are interested in helping write them, please
+reach out!
+
+[DataFusion 34.0.0]: 
https://datafusion.apache.org/blog/2024/01/19/datafusion-34.0.0/
+[DataFusion 40.0.0]: https://crates.io/crates/datafusion/40.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+[apache datafusion]: https://datafusion.apache.org/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion's core thesis is that as a community together, we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+# Community Growth  πŸ“ˆ 
+
+In the last 6 months, between `34.0.0` and `40.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. DataFusion became a top level Apache Software Foundation project (read the
+   [press release] and [blog post]).
+2. We added several PMC members and new
+   committers [@comphead], [@mustafasrepo], [@ozankabak] joined the PMC,
+   [@jonahgao] and [@lewiszlw] joined as a committer. See the [mailing list] 
for
+   more details.
+3. [DataFusion Comet] was [donated] and is nearing its first release.
+4. In the [core DataFusion repo] alone we reviewed and accepted almost 1500 
PRs from 182 different
+   committers, created over 1000 issues and closed 781 of them πŸš€. This is up
+   almost 50% from our last post (1000 PRs from 124 committers with 650 issues
+   created in our last post) 🀯. All changes are listed in the detailed
+   [CHANGELOG].
+5. DataFusion focused meetups happened or are happening in multiple cities 
+   around the world: [Austin], [San Francisco], [Hangzhou], [New York], and
+   [Belgrade].
+6. Many new projects started in the [datafusion-contrib] organization, 
including
+   [Table Providers], [SQLancer], [Open Variant], [JSON], and [ORC].  
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[CHANGELOG]: 
https://github.com/apache/datafusion/blob/main/datafusion/CHANGELOG.md
+[press release]: 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
+[blog post]: https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Austin]: https://github.com/apache/datafusion/discussions/8522
+[San Francisco]: https://github.com/apache/datafusion/discussions/10800
+[Hangzhou]: https://www.huodongxing.com/event/5761971909400?td=1965290734055
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[datafusion-contrib]: https://github.com/datafusion-contrib
+[Table Providers]: 
https://github.com/datafusion-contrib/datafusion-table-providers
+[SQLancer]: https://github.com/datafusion-contrib/datafusion-sqlancer
+[Open Variant]: 
https://github.com/datafusion-contrib/datafusion-functions-variant
+[JSON]: https://github.com/datafusion-contrib/datafusion-functions-json
+[ORC]: https://github.com/datafusion-contrib/datafusion-orc
+
+<!--
+$ git log --pretty=oneline 34.0.0..40.0.0 . | wc -l
+     1453 (up from 1009)
+
+$ git shortlog -sn 34.0.0..40.0.0 . | wc -l
+      182 (up from 124)
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 34 released July 12, 2024
+
+Issues created in this time: 321 open, 781 closed (up from 214 open, 437 
closed)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-12-17..2024-07-12
+
+Issues closed: 911 (up from 517)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-12-17..2024-07-12
+
+PRs merged in this time 1490 (up from 908)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-12-17..2024-07-12
+
+-->
+
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. [Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query 
Engine], was presented in [SIGMOD '24], one of the major database conferences
+2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around...] from Andy Pavlo and Mike Stonebraker

Review Comment:
   ```suggestion
   2. As part of the trend to define "the POSIX of databases" in ["What Goes 
Around Comes Around... And Around..."] from Andy Pavlo and Mike Stonebraker
   ```
   Missing end quote maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to