[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

GitBox Sat, 22 Oct 2022 03:12:50 -0700


alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002434601



##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) 
[`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog 
contains an update on the project for the 5 months since our [last update in 
May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used 
to create modern, fast and efficient data pipelines, ETL processes, and 
database systems. You may want to check out DataFusion to extend your Rust 
project with:
+
+- [SQL 
support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame 
API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or 
CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community 
growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial 
projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of 
the early open source projects to provide this capability. 2022 has validated 
our belief in the need for such a ["LLVM for database and AI 
systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf)
 with announcements such as the [release of FaceBook's 
Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the 
major investments in 
[Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as 
the continued popularity of [Apache Calcite](https://calcite.apache.org/) and 
other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the 
entire suite of components needed to build most analytic systems, including a 
SQL frontend, a dataframe API, and  extension points for just about everything. 
Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) 
use a subset of the features such as the frontend (e.g. 
(dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, 
such as [Blaze](https://github.com/blaze-init/blaze), and some users use many 
different components to build both SQL based and customized DSL based systems 
such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and 
[VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in 
[Rust](https://www.rust-lang.org/) and thus its easy integration with the 
broader Rust ecosystem. Rust continues to be a major source of benefit, from 
the [ease of parallelization with the high quality and standardized `async` 
ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)
 , as well as its modern dependency management system and wonderful 
performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion 
continues to be quite speedy (todo quantity this, with some evidence) – maybe 
clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family 
of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured 
support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to 
try it out and get a feel for its power, you can use the 
basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) 
tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files 
on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of 
quarterly. This
+makes it easier for the increasing number of projects that now depend on 
DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level 
arrow-ballista 
repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of 
the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and 
`FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / 
`ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and 
`FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  
(#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object 
storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late 
materialization after filtering during the scan (we plan another blog post on 
this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus 
interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff 
on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet 
Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time 
support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+

Review Comment:
   Absolutely -- done in 824648efdf



##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) 
[`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog 
contains an update on the project for the 5 months since our [last update in 
May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used 
to create modern, fast and efficient data pipelines, ETL processes, and 
database systems. You may want to check out DataFusion to extend your Rust 
project with:
+
+- [SQL 
support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame 
API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or 
CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community 
growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial 
projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of 
the early open source projects to provide this capability. 2022 has validated 
our belief in the need for such a ["LLVM for database and AI 
systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf)
 with announcements such as the [release of FaceBook's 
Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the 
major investments in 
[Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as 
the continued popularity of [Apache Calcite](https://calcite.apache.org/) and 
other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the 
entire suite of components needed to build most analytic systems, including a 
SQL frontend, a dataframe API, and  extension points for just about everything. 
Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) 
use a subset of the features such as the frontend (e.g. 
(dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, 
such as [Blaze](https://github.com/blaze-init/blaze), and some users use many 
different components to build both SQL based and customized DSL based systems 
such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and 
[VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in 
[Rust](https://www.rust-lang.org/) and thus its easy integration with the 
broader Rust ecosystem. Rust continues to be a major source of benefit, from 
the [ease of parallelization with the high quality and standardized `async` 
ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/)
 , as well as its modern dependency management system and wonderful 
performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion 
continues to be quite speedy (todo quantity this, with some evidence) – maybe 
clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family 
of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured 
support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to 
try it out and get a feel for its power, you can use the 
basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) 
tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files 
on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of 
quarterly. This
+makes it easier for the increasing number of projects that now depend on 
DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level 
arrow-ballista 
repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of 
the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and 
`FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / 
`ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and 
`FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  
(#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object 
storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late 
materialization after filtering during the scan (we plan another blog post on 
this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus 
interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff 
on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet 
Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time 
support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+

Review Comment:
   Absolutely -- done in 824648efdf



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Reply via email to