Re: [PR] [Website]: DataFusion 26-34 blog [arrow-site]

via GitHub Sun, 14 Jan 2024 10:58:43 -0800


alamb commented on code in PR #457:
URL: https://github.com/apache/arrow-site/pull/457#discussion_r1451737550



##########
_posts/2024-01-25-datafusion-34.0.0.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], 
that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: 
https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the 
major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 
months.
+
+[released DataFusion 26.0.0]: 
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: 
https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: 
https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: 
https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" 
class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture 
overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 
34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various 
reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved Join Performance]
+* Eliminate redundant sorting with sort order aware optimizers
+
+[2-3x Better aggregation performance with many distinct groups]: 
https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
+[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7192
+[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7721
+[Improved Join Performance]: 
https://github.com/apache/arrow-datafusion/pull/8126
+
+# New Features ✨
+
+## DML / Insert / Creating Files

Review Comment:
   @devinjdangelo  is there anything else you think is worth calling about DML 
inserts functionality



##########
_posts/2024-01-25-datafusion-34.0.0.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], 
that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: 
https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the 
major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 
months.
+
+[released DataFusion 26.0.0]: 
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: 
https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: 
https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: 
https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" 
class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture 
overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 
34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various 
reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved Join Performance]
+* Eliminate redundant sorting with sort order aware optimizers
+
+[2-3x Better aggregation performance with many distinct groups]: 
https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
+[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7192
+[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7721
+[Improved Join Performance]: 
https://github.com/apache/arrow-datafusion/pull/8126
+
+# New Features ✨
+
+## DML / Insert / Creating Files
+
+DataFusion now supports writing data in parallel, to individual or multiple
+files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats. 
+
+You can do this using [`CREATE EXTERNAL TABLE` statement] for example:
+
+```sql
+❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION 
'/tmp/my_awesome_table';
+0 rows in set. Query took 0.003 seconds.
+
+❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.024 seconds.
+```
+
+[`CREATE EXTERNAL TABLE` statement]: 
https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table
+
+You can also create files using the [`COPY` command], similarly to [DuckDB’s 
`COPY`] command:
+
+[`COPY` command]: 
https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
+[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html
+
+```sql
+❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.014 seconds.
+```
+
+```shell
+$ cat /tmp/output.json
+{"x":1}
+{"x":2}
+{"x":3}
+
+$ python3
+Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 
(clang-1500.1.0.2.5)] on darwin
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import pyarrow.feather as ft
+>>> table = ft.read_table("/tmp/output.arrow")
+>>> print(table)
+pyarrow.Table
+x: int32
+----
+x: [[1,2,3]]
+```
+
+## Improved `STRUCT` and `ARRAY` support
+
+DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY`
+support, including a full range of [struct functions] and [array functions].
+
+[struct functions]: 
https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
+[array functions]: 
https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions
+
+<!--
+❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
+--> 
+
+For example, you can now use `[]` syntax and `array_length`:
+```sql
+❯ SELECT column1, 
+         column1[1] AS first_element, 
+         array_length(column1) AS len 
+  FROM my_table;
++-----------+---------------+-----+
+| column1   | first_element | len |
++-----------+---------------+-----+
+| [1, 2, 3] | 1             | 3   |
+| [2]       | 2             | 1   |
+| [4, 5]    | 4             | 2   |
++-----------+---------------+-----+
+```
+
+```sql
+❯ SELECT column1, column1['c0'] FROM  my_table;
++------------------+----------------------+
+| column1          | my_table.column1[c0] |
++------------------+----------------------+
+| {c0: foo, c1: 1} | foo                  |
+| {c0: bar, c1: 2} | bar                  |
++------------------+----------------------+
+2 rows in set. Query took 0.002 seconds.
+```
+
+## Other Features
+* Support grouping on datasets that exceed memory size, with [Group by spill 
to disk]
+* All operators now track and limit their memory consumption, including Joins
+
+[Group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400
+
+# Easier to Build Systems with DataFusion 🛠️
+
+## Documentation
+It is easier than ever to get started using DataFusion with 
+new [Library Users Guide] as well as significantly improved the [API 
documentation]. 
+
+[Library Users 
Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+
+## User Defined Window and Table Functions
+Also, in addition to DataFusion's [User Defined Scalar Functions], and [User 
Defined Aggregate Functions], DataFusion now supports [User Defined Window 
Functions] 
+ and [User Defined Table Functions].
+
+For example, the `datafusion-cli` implements the `parquet_metadata` function 
(TODO get a link to doc) command as a
+user defined table function:
+
+```sql
+❯ SELECT 
+      path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, 
total_compressed_size 
+FROM 
+      parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"' 
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min           | 
stats_max           | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID"      | 0            | 450560             | 4611687214012840539 | 
9223369186199968220 | 3883759               |
+| "WatchID"      | 1            | 612174             | 4611689135232456464 | 
9223371478009085789 | 5176803               |
+| "WatchID"      | 2            | 344064             | 4611692774829951781 | 
9223363791697310021 | 3031680               |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+
+[User Defined Scalar Functions]: 
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf
+[User Defined Aggregate Functions]: 
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf
+[User Defined Window Functions]: 
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
+[User Defined Table Functions]: 
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function
+
+
+### Growth of DataFusion 📈
+DataFusion has begun appearing more in the wild, such as:
+* New projects built on DataFusion such as [lancedb], [GlareDB], and [Arroyo].
+* Public talks such as [Apache Arrow Datafusion: Vectorized
+  Execution Framework For Maximum Performance] in [CommunityOverCode Asia 
2023] 
+* Blogs posts such as [Flight, DataFusion, Arrow, and Parquet: Using the FDAP 
Architecture to build InfluxDB 3.0], [Apache Arrow, Arrow/DataFusion, AI-native 
Data Infra] and [A Guide to User-Defined Functions in Apache Arrow DataFusion]
+
+[glaredb]: https://glaredb.com/
+[lancedb]: https://lancedb.com/
+[arroyo]: https://www.arroyo.dev/
+
+[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum 
Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I
+[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178
+[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build 
InfluxDB 3.0]: 
https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
+[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: 
https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan
+[A Guide to User-Defined Functions in Apache Arrow DataFusion]: 
https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/
+
+We also [submitted a paper] to [SIGMOD 2024], one of the
+premiere database conferences, describing DataFusion in a technically formal
+style and making the case that it is possible to create a modular and 
extensive query engine 
+without sacrificing performance. We hope this paper will help people who may be
+considering DataFusion to decide if it is a good fit for their needs.
+
+[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782
+[SIGMOD 2024]: https://2024.sigmod.org/
+
+# DataFusion in 2024 🥳
+
+This year some major initiatives contributors plan to focus on are:
+
+1. *Modularity*: Make Datafusion even more modular, such as [unifying how

Review Comment:
   @ozankabak  are there any plans you and your team may have that you want to 
share publically?



##########
_posts/2024-01-25-datafusion-34.0.0.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], 
that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: 
https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the 
major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 
months.
+
+[released DataFusion 26.0.0]: 
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: 
https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: 
https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: 
https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" 
class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture 
overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 
34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various 
reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved Join Performance]
+* Eliminate redundant sorting with sort order aware optimizers

Review Comment:
   @ozankabak  or @mustafasrepo  are there specific join order optimizations 
that would be good to call out here (or references I can link to)?



##########
_posts/2024-01-25-datafusion-34.0.0.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], 
that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: 
https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the 
major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 
months.
+
+[released DataFusion 26.0.0]: 
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: 
https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: 
https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: 
https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" 
class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture 
overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 
34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various 
reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]

Review Comment:
   @avantgardnerio  do you know of any better ways to describe this 
optimization?



##########
_posts/2024-01-25-datafusion-34.0.0.md:
##########
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], 
that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: 
https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the 
major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 
months.
+
+[released DataFusion 26.0.0]: 
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: 
https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: 
https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: 
https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" 
class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture 
overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 
34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various 
reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved Join Performance]
+* Eliminate redundant sorting with sort order aware optimizers
+
+[2-3x Better aggregation performance with many distinct groups]: 
https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
+[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7192
+[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: 
https://github.com/apache/arrow-datafusion/pull/7721
+[Improved Join Performance]: 
https://github.com/apache/arrow-datafusion/pull/8126
+
+# New Features ✨
+
+## DML / Insert / Creating Files
+
+DataFusion now supports writing data in parallel, to individual or multiple
+files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats. 
+
+You can do this using [`CREATE EXTERNAL TABLE` statement] for example:
+
+```sql
+❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION 
'/tmp/my_awesome_table';
+0 rows in set. Query took 0.003 seconds.
+
+❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.024 seconds.
+```
+
+[`CREATE EXTERNAL TABLE` statement]: 
https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table
+
+You can also create files using the [`COPY` command], similarly to [DuckDB’s 
`COPY`] command:
+
+[`COPY` command]: 
https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
+[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html
+
+```sql
+❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.014 seconds.
+```
+
+```shell
+$ cat /tmp/output.json
+{"x":1}
+{"x":2}
+{"x":3}
+
+$ python3
+Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 
(clang-1500.1.0.2.5)] on darwin
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import pyarrow.feather as ft
+>>> table = ft.read_table("/tmp/output.arrow")
+>>> print(table)
+pyarrow.Table
+x: int32
+----
+x: [[1,2,3]]
+```
+
+## Improved `STRUCT` and `ARRAY` support

Review Comment:
   @jayzhan211 is there any other improvements you think we should call out 
about struct/array support over the last 6 months?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website]: DataFusion 26-34 blog [arrow-site]

Reply via email to