This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/master by this push:
new 7c9d338 Add DataFusion 6.0.0 blog (#160)
7c9d338 is described below
commit 7c9d338606aeb8bbe975c2ba5a3f3652d7f06b59
Author: Andrew Lamb <[email protected]>
AuthorDate: Fri Nov 19 13:53:00 2021 -0500
Add DataFusion 6.0.0 blog (#160)
* Add DataFusion 6.0.0 blog
* Add direct links to async planning PR and doc
* Update _posts/2021-11-8-datafusion-6.0.0.md
Co-authored-by: Carlos <[email protected]>
* Clean up comments / commands
* Update _posts/2021-11-8-datafusion-6.0.0.md
* add more changelog
* Update _posts/2021-11-8-datafusion-6.0.0.md
Co-authored-by: Carlos <[email protected]>
Co-authored-by: Qingping Hou <[email protected]>
---
_posts/2021-11-8-datafusion-6.0.0.md | 159 +++++++++++++++++++++++++++++++++++
1 file changed, 159 insertions(+)
diff --git a/_posts/2021-11-8-datafusion-6.0.0.md
b/_posts/2021-11-8-datafusion-6.0.0.md
new file mode 100644
index 0000000..49da4ae
--- /dev/null
+++ b/_posts/2021-11-8-datafusion-6.0.0.md
@@ -0,0 +1,159 @@
+---
+layout: post
+title: Apache Arrow DataFusion 6.0.0 Release
+date: "2021-11-19 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an embedded
+query engine which leverages the unique features of
+[Rust](https://www.rust-lang.org/) and [Apache
+Arrow](https://arrow.apache.org/) to provide a system that is high
+performance, easy to connect, easy to embed, and high quality.
+
+The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release.
This covers 4 months of development work
+and includes 134 commits from the following 28 distinct contributors.
+
+<!--
+git log --pretty=oneline 5.0.0..6.0.0 datafusion datafusion-cli
datafusion-examples | wc -l
+ 134
+
+git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli datafusion-examples |
wc -l
+ 29
+
+ Carlos and xudong963 are same individual
+-->
+
+```
+ 28 Andrew Lamb
+ 26 Jiayu Liu
+ 13 xudong963
+ 9 rdettai
+ 9 QP Hou
+ 6 Matthew Turner
+ 5 Daniël Heres
+ 4 Guillaume Balaine
+ 3 Francis Du
+ 3 Marco Neumann
+ 3 Jon Mease
+ 3 Nga Tran
+ 2 Yijie Shen
+ 2 Ruihang Xia
+ 2 Liang-Chi Hsieh
+ 2 baishen
+ 2 Andy Grove
+ 2 Jason Tianyi Wang
+ 1 Nan Zhu
+ 1 Antoine Wendlinger
+ 1 Krisztián Szűcs
+ 1 Mike Seddon
+ 1 Conner Murphy
+ 1 Patrick More
+ 1 Taehoon Moon
+ 1 Tiphaine Ruy
+ 1 adsharma
+ 1 lichuan6
+```
+
+The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bug fixes
+and improvements have been made: we refer you to the complete
+[changelog](https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/CHANGELOG.md).
+
+# New Website
+
+Befitting a growing project, DataFusion now has its
+[own website](https://arrow.apache.org/datafusion/) hosted as part of the
+main [Apache Arrow Website](https://arrow.apache.org)
+
+# Roadmap
+
+The community worked to gather their thoughts about where we are
+taking DataFusion into a public
+[Roadmap](https://arrow.apache.org/datafusion/specification/roadmap.html)
+for the first time
+
+# New Features
+
+- Runtime operator metrics collection framework
+- Object store abstraction for unified access to local or remote storage
+- Hive style table partitioning support, for Parquet, CSV, Avro and Json files
+- DataFrame API support for: `except`, `intersect`, `show`, `limit` and window
functions
+- SQL
+ - `EXPLAIN ANALYZE` with runtime metrics
+ - `trim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters
text ] )` syntax
+ - Postgres style regular expression matching operators `~`, `~*`, `!~`, and
`!~*`
+ - SQL set operators `UNION`, `INTERSECT`, and `EXCEPT`
+ - `cume_dist`, `percent_rank` window functions
+ - `digest`, `blake2s`, `blake2b`, `blake3` crypto functions
+ - HyperLogLog based `approx_distinct`
+ - `is distinct from` and `is not distinct from`
+ - `CREATE TABLE AS SELECT`
+ - Accessing elements of nested `Struct` and `List` columns (e.g. `SELECT
struct_column['field_name'], array_column[0] FROM ...`)
+ - Boolean expressions in `CASE` statement
+ - `DROP TABLE`
+ - `VALUES` List
+ - Postgres regex match operators
+- Support for Avro format
+- Support for `ScalarValue::Struct`
+- Automatic schema inference for CSV files
+- Better interactive editing support in `datafusion-cli` as well as `psql`
style commands such as `\d`, `\?`, and `\q`
+- Generic constant evaluation and simplification framework
+- Added common subexpression eliminate query plan optimization rule
+- Python binding 0.4.0 with all Datafusion 6.0.0 features
+
+With these new features, we are also now passing TPC-H queries 8, 13 and 21.
+
+For the full list of new features with their relevant PRs, see the
+[enhancements
section](https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/CHANGELOG.md)
+in the changelog.
+
+# `async` planning and decoupling file format from table layout
+
+Driven by the need to support Hive style table partitioning, @rdettai
+introduced the following design change to the Datafusion core.
+
+- The code for reading specific file formats (`Parquet`, `Avro`, `CSV`, and
+`JSON`) was separated from the logic that handles grouping sets of
+files into execution partitions.
+- The query planning process was made `async`.
+
+As a result, we are able to replace the old `Parquet`, `CSV` and `JSON` table
+providers with a single `ListingTable` table provider.
+
+This also sets up DataFusion and its plug-in ecosystem to
+supporting a wide range of catalogs and various object store implementations.
+You can read more about this change in the
+[design
document](https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ)
+and on the [arrow-datafusion#1010
PR](https://github.com/apache/arrow-datafusion/pull/1010).
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion, we would love to have
you! You
+can help by trying out DataFusion on some of your own data and projects and
filing bug reports and helping to
+improve the documentation, or contribute to the documentation, tests or code.
A list of open issues suitable for
+beginners is
[here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+and the full list is [here](https://github.com/apache/arrow-datafusion/issues).
+
+Check out our new [Communication
Doc](https://arrow.apache.org/datafusion/community/communication.html) on more
+ways to engage with the community.