[GitHub] [arrow-site] vertexclique commented on a change in pull request #81: [Website] Rust release notes / blog post
vertexclique commented on a change in pull request #81: URL: https://github.com/apache/arrow-site/pull/81#discussion_r509183577 ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a +very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0. + +## Improved Variable-sized Arrays + +- Variable sized arrays (e.g. array of strings) have been internally refactored to more easily support their larger (64 +bit size offset) version. This allowed us to generalize some of the kernels to both (32 and 64) versions, and also +perform type checks when building them. + +## Kernels + +There have been numerous improvements in the Arrow compute kernels, including: + +- New kernels have been added for string operations, including substring, min, max, concat, and length. +- Aggregate sum is now implemented for SIMD with a 5x improvement over the non-SIMD operation +- Many kernels have been improved to support dictionary-encoded arrays +- Some kernels were optimized for arrays without nulls, making them significantly faster in that case. +- Many kernels were optimized in the number of memory copies that are needed to apply them, and also on their +implementation. + +## Other Improvements + +The Array trait now has `get_buffer_memory_size` and `get_array_memory_size` methods for determining the amount of +memory allocated for the array. + +# Parquet + +Significant effort is underway to create a Parquet writer for Arrow data. This work has not been released as part of +2.0.0, and is planned for the 3.0.0 release. The development of this writer is being carried out on the +[rust-parquet-arrow-writer][2] branch, and the branch is regularly synchronized with the main branch. +As part of the writer, the necessary improvements and features are being added to the reader. + +The main focus areas are: +- Supporting nested Arrow types, such as `List>` +- Ensuring correct round-trip between the reader and writer by encoding Arrow schemas in the Parquet metadata +- Improve null value writing for Parquet + +A new `parquet_derive` crate has been created, which allows users to derive Parquet records for simple structs. Refer to +the [parquet_derive crate][3] for usage examples. + +# DataFusion + +DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of base Arrow support. + +## DataFrame API + +DataFusion now has a richer [DataFrame API][4] with improved documentation showing example usage, +supporting the following operations: + +- select_columns +- select +- filter +- aggregate +- limit +- sort +- collect +- explain + +## Performance & Scalability + +DataFusion query execution now uses `async`/`await` with the tokio threaded runtime rather than launching dedicated +threads, making queries scale much better across available cores. + +The hash aggregate physical operator has been largely re-written, resulting in significant performance improvements. + +## Expressions and Compute + +### Improved Scalar Functions + +DataFusion has many new functions, both in the SQL and the DataFrame API: +- Length of an string +- COUNT(DISTINCT column) +- to_timestamp +- IsNull and IsNotNull +- Min/Max for strings (lexicographic order) +- Array of columns +- Concatenation of strings +- Aliases of aggregate expressions + +Many existing expressions were also significantly optimized (2-3x speedups) by avoiding memory copies and leveraging +Arrow format’s invariants. + +Unary mathematical functions (such as sqrt) now support both 32 and 64 bit floats, and return the corresponding type, +thereby allowing faster operations when higher precision is not needed. + +### Improved User-defined Functions (UDFs) +The API to use and register UDFs has been significantly improved, allowing users to register UDFs and call them both +via SQL and the DataFrame API. UDFs now also have the same generality as DataFusion’s own functions, including variadic +and dynamically typed arguments. + +### User-defined Aggregate Functions (UDAFs)
[GitHub] [arrow-site] vertexclique commented on a change in pull request #81: [Website] Rust release notes / blog post
vertexclique commented on a change in pull request #81: URL: https://github.com/apache/arrow-site/pull/81#discussion_r508892936 ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a +very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0. + +## Improved Variable-sized Arrays + +- Variable sized arrays (e.g. array of strings) have been internally refactored to more easily support their larger (64 +bit size offset) version. This allowed us to generalize some of the kernels to both (32 and 64) versions, and also +perform type checks when building them. + +## Kernels + +There have been numerous improvements in the Arrow compute kernels, including: + +- New kernels have been added for string operations, including substring, min, max, concat, and length. +- Aggregate sum is now implemented for SIMD with a 5x improvement over the non-SIMD operation +- Many kernels have been improved to support dictionary-encoded arrays +- Some kernels were optimized for arrays without nulls, making them significantly faster in that case. +- Many kernels were optimized in the number of memory copies that are needed to apply them, and also on their +implementation. + +## Other Improvements + +The Array trait now has `get_buffer_memory_size` and `get_array_memory_size` methods for determining the amount of +memory allocated for the array. + +# Parquet + +Significant effort is underway to create a Parquet writer for Arrow data. This work has not been released as part of +2.0.0, and is planned for the 3.0.0 release. The development of this writer is being carried out on the +[rust-parquet-arrow-writer][2] branch, and the branch is regularly synchronized with the main branch. +As part of the writer, the necessary improvements and features are being added to the reader. + +The main focus areas are: +- Supporting nested Arrow types, such as `List>` +- Ensuring correct round-trip between the reader and writer by encoding Arrow schemas in the Parquet metadata +- Improve null value writing for Parquet + +A new `parquet_derive` crate has been created, which allows users to derive Parquet records for simple structs. Refer to +the [parquet_derive crate][3] for usage examples. + +# DataFusion + +DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of base Arrow support. + +## DataFrame API + +DataFusion now has a richer [DataFrame API][4] with improved documentation showing example usage, +supporting the following operations: + +- select_columns +- select +- filter +- aggregate +- limit +- sort +- collect +- explain + +## Performance & Scalability + +DataFusion query execution now uses `async`/`await` with the tokio threaded runtime rather than launching dedicated +threads, making queries scale much better across available cores. + +The hash aggregate physical operator has been largely re-written, resulting in significant performance improvements. + +## Expressions and Compute + +### Improved Scalar Functions + +DataFusion has many new functions, both in the SQL and the DataFrame API: +- Length of an string +- COUNT(DISTINCT column) +- to_timestamp +- IsNull and IsNotNull +- Min/Max for strings (lexicographic order) +- Array of columns +- Concatenation of strings +- Aliases of aggregate expressions + +Many existing expressions were also significantly optimized (2-3x speedups) by avoiding memory copies and leveraging +Arrow format’s invariants. + +Unary mathematical functions (such as sqrt) now support both 32 and 64 bit floats, and return the corresponding type, +thereby allowing faster operations when higher precision is not needed. + +### Improved User-defined Functions (UDFs) +The API to use and register UDFs has been significantly improved, allowing users to register UDFs and call them both +via SQL and the DataFrame API. UDFs now also have the same generality as DataFusion’s own functions, including variadic +and dynamically typed arguments. + +### User-defined Aggregate Functions (UDAFs)
[GitHub] [arrow-site] vertexclique commented on a change in pull request #81: [Website] Rust release notes / blog post
vertexclique commented on a change in pull request #81: URL: https://github.com/apache/arrow-site/pull/81#discussion_r509177428 ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a +very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0. + +## Improved Variable-sized Arrays + +- Variable sized arrays (e.g. array of strings) have been internally refactored to more easily support their larger (64 +bit size offset) version. This allowed us to generalize some of the kernels to both (32 and 64) versions, and also +perform type checks when building them. + +## Kernels + +There have been numerous improvements in the Arrow compute kernels, including: + +- New kernels have been added for string operations, including substring, min, max, concat, and length. +- Aggregate sum is now implemented for SIMD with a 5x improvement over the non-SIMD operation +- Many kernels have been improved to support dictionary-encoded arrays +- Some kernels were optimized for arrays without nulls, making them significantly faster in that case. +- Many kernels were optimized in the number of memory copies that are needed to apply them, and also on their +implementation. + +## Other Improvements + +The Array trait now has `get_buffer_memory_size` and `get_array_memory_size` methods for determining the amount of +memory allocated for the array. + +# Parquet + +Significant effort is underway to create a Parquet writer for Arrow data. This work has not been released as part of +2.0.0, and is planned for the 3.0.0 release. The development of this writer is being carried out on the +[rust-parquet-arrow-writer][2] branch, and the branch is regularly synchronized with the main branch. +As part of the writer, the necessary improvements and features are being added to the reader. + +The main focus areas are: +- Supporting nested Arrow types, such as `List>` +- Ensuring correct round-trip between the reader and writer by encoding Arrow schemas in the Parquet metadata +- Improve null value writing for Parquet + +A new `parquet_derive` crate has been created, which allows users to derive Parquet records for simple structs. Refer to +the [parquet_derive crate][3] for usage examples. + +# DataFusion + +DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of base Arrow support. + +## DataFrame API + +DataFusion now has a richer [DataFrame API][4] with improved documentation showing example usage, +supporting the following operations: + +- select_columns +- select +- filter +- aggregate +- limit +- sort +- collect +- explain + +## Performance & Scalability + +DataFusion query execution now uses `async`/`await` with the tokio threaded runtime rather than launching dedicated +threads, making queries scale much better across available cores. + +The hash aggregate physical operator has been largely re-written, resulting in significant performance improvements. + +## Expressions and Compute + +### Improved Scalar Functions + +DataFusion has many new functions, both in the SQL and the DataFrame API: +- Length of an string +- COUNT(DISTINCT column) +- to_timestamp +- IsNull and IsNotNull +- Min/Max for strings (lexicographic order) +- Array of columns +- Concatenation of strings +- Aliases of aggregate expressions + +Many existing expressions were also significantly optimized (2-3x speedups) by avoiding memory copies and leveraging +Arrow format’s invariants. + +Unary mathematical functions (such as sqrt) now support both 32 and 64 bit floats, and return the corresponding type, +thereby allowing faster operations when higher precision is not needed. + +### Improved User-defined Functions (UDFs) +The API to use and register UDFs has been significantly improved, allowing users to register UDFs and call them both +via SQL and the DataFrame API. UDFs now also have the same generality as DataFusion’s own functions, including variadic +and dynamically typed arguments. + +### User-defined Aggregate Functions (UDAFs)
[GitHub] [arrow-site] vertexclique commented on a change in pull request #81: [Website] Rust release notes / blog post
vertexclique commented on a change in pull request #81: URL: https://github.com/apache/arrow-site/pull/81#discussion_r508889318 ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a +very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0. + +## Improved Variable-sized Arrays + +- Variable sized arrays (e.g. array of strings) have been internally refactored to more easily support their larger (64 Review comment: ```suggestion - Variable sized arrays (e.g., array of strings) have been internally refactored to more easily support their larger (64-bit ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] vertexclique commented on a change in pull request #81: [Website] Rust release notes / blog post
vertexclique commented on a change in pull request #81: URL: https://github.com/apache/arrow-site/pull/81#discussion_r508887309 ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes Review comment: ```suggestion particularly with almost 200 issues resolved by 15 contributors. In this blog post, we will go through the main changes ``` ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the Review comment: ```suggestion While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature-rich, with the ``` ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a Review comment: ```suggestion - Primitive arrays (e.g., array of integers) can now be converted to and initialized from an iterator. This exposes a ``` ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this +release. + +# Core Arrow Crate + +## Iterator Trait + +- Primitive arrays (e.g. array of integers) can now be converted to, and initialized from, an iterator. This exposes a +very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0. + +## Improved Variable-sized Arrays + +- Variable sized arrays (e.g. array of strings) have been internally refactored to more easily support their larger (64 Review comment: ```suggestion - Variable sized arrays (e.g., array of strings) have been internally refactored to more easily aupport their larger (64-bit ``` ## File path: _posts/2020-10-19-rust-2.0.0-release.md ## @@ -0,0 +1,203 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 (Rust)" +date: "2020-10-19 00:00:00 -0600" +author: pmc +categories: [release] +--- + + +Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general, and the Rust subproject in +particular with almost 200 issues resolved by 15 contributors. In this blog post we will go through the main changes +affecting core Arrow, Parquet support and DataFusion query engine. The full list of resolved issues can be found +[here][1]. + +While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature rich, with the +2.0.0 release, the Rust