This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch alamb/arrow-9-blog in repository https://gitbox.apache.org/repos/asf/arrow-site.git
commit 6249e8d840f37ee594826ba5944727f03878270b Author: Andrew Lamb <[email protected]> AuthorDate: Fri Feb 4 09:03:55 2022 -0500 Blog post for arrow 9 release --- _posts/2022-02-04-rust-9.0.0.md | 139 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) diff --git a/_posts/2022-02-04-rust-9.0.0.md b/_posts/2022-02-04-rust-9.0.0.md new file mode 100644 index 0000000..12c8c54 --- /dev/null +++ b/_posts/2022-02-04-rust-9.0.0.md @@ -0,0 +1,139 @@ +--- +layout: post +title: "Recent Rust Apache Arrow and Parquet Highlights" +date: "2022-02-04 00:00:00 -0600" +author: pmc +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +The Rust implementation of [Apache Arrow] has just released version `9.0.0`. + +While a major version of this magnitude may shock some in the Rust +community to whom it implies a slow moving 20 year old piece of +software, nothing could be further from the truth! + +With regular and predictable bi-weekly releases, the library continues +to evolve and `9.0.0` is no exception. Some recent highlights + + +# `parquet`: async, performance, safety and nested types + +The parquet `9.0.0` release includes an `async` reader (TODO link to rustdoc +when published), a long time requested feature. Using the `async` +reader it is now possible to read only the relevant parts of a parquet +file from a networked source such as object storage. Previously the +entire file had to be buffered locally. We are hoping to add a `async` +writer in a future release and would love some +[help](https://github.com/apache/arrow-rs/issues/1269). + +It is also significantly faster to read parquet data (up to +[60x](https://github.com/apache/arrow-rs/pull/1180#issuecomment-1018518863) +in some cases) than with previous versions of the `parquet` +crate. Kudos to [tustvold](https://github.com/tustvold) and +[yordan-pavlov](https://github.com/yordan-pavlov) for their +contributions in these areas. + +With `8.0.0` and later, the code that reads and writes `RecordBatch`es +to and from Parquet now supports all types, including deeply nested +structs and lists. Thanks [helgikrs](https://github.com/helgikrs) for +cleaning up the last corner cases! + +Other notable recent additions to parquet are `UTF-8` validation on +string data for improved security against malicious inputs. + +Planned upcoming work includes [pushing more +filtering](https://github.com/apache/arrow-rs/issues/1191) directly +into the parquet scan as well as an `async` writer. + + +# `arrow`: performance, dyn kernels, and DecimalArray + +The [compute](https://docs.rs/arrow/latest/arrow/compute/index.html) +kernels have been improved significantly in arrow `9.0.0`. Some [filter +benchmarks](https://github.com/apache/arrow-rs/pull/1228#issue-1111889246) +are twice as fast and the SIMD kernels are also [significantly +faster](https://github.com/apache/arrow-rs/pull/1221). Many thanks to tustvold +[tustvold](https://github.com/tustvold) and +[jhorstmann](https://github.com/jhorstmann) + +We are working on new set of "dynamic" `dyn_` kernels (for example, +[`eq_dyn`](https://docs.rs/arrow/8.0.0/arrow/compute/kernels/comparison/fn.eq_dyn.html)) +that make it easier to invoke the heavily optimized kernels provided +by the `arrow` crate. Work is underway to expand the breadth of types +supported by these new kernels to make them even more useful. Thanks +to [matthewmturner](https://github.com/matthewmturner) and +[viirya](https://github.com/viirya) for their help in this +effort. + +While `arrow` has had basic support for `DecimalArray` since version +`3.0.0`, support has been expanded for `Decimal` type in calculation +kernels such as `sort`, `take` and `filter` thanks to some great +contributions from [liukun4515](https://github.com/liukun4515). There +is [ongoing work](https://github.com/apache/arrow-rs/pull/1223) to +improve the API ergonomics and performance of `DecimalArray` as well. + +# Security + +The `6.4.0` release resolved the last outstanding +[RUSTSEC](https://rustsec.org/) +[advisory](https://github.com/rustsec/advisory-db/pull/1131) on the +arrow crate and the `8.0.0` release resolved the last outstanding +known security issues. While these security issues were mostly limited +misuse of the low level "power user" APIs which most users do not (and +should not) be using, it is good to tighten up that area. + +Now that `arrow-rs` is releasing major versions every other week, we +are also able to update dependencies at the same pace, helping to +ensure that security fixes upstream can flow more quickly to +downstream projects. + +# Final shoutout +It takes a community to build great software, and we would like to +thank everyone who has contributed to the arrow-rs repository since +the `7.0.0` release: + +```console +git shortlog -sn 7.0.0..9.0.0 + 22 Raphael Taylor-Davies + 18 Andrew Lamb + 6 Helgi Kristvin Sigurbjarnarson + 6 Remzi Yang + 5 Jörn Horstmann + 4 Liang-Chi Hsieh + 3 Jiayu Liu + 2 dependabot[bot] + 2 Yijie Shen + 1 Matthew Turner + 1 Kun Liu + 1 Yang + 1 Edd Robinson + 1 Patrick More +``` + + +# How to Get Involved + +If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues +suitable for beginners [here](https://github.com/apache/arrow-rs/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) +and the full list [here](https://github.com/apache/arrow-rs/issues). + +Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to +improve the documentation.
