[GitHub] [arrow-site] nevi-me commented on a change in pull request #81: [Website] Rust release notes / blog post

2020-10-25 Thread GitBox


nevi-me commented on a change in pull request #81:
URL: https://github.com/apache/arrow-site/pull/81#discussion_r511649461



##
File path: _posts/2020-10-19-rust-2.0.0-release.md
##
@@ -0,0 +1,203 @@
+---
+layout: post
+title: "Apache Arrow 2.0.0 (Rust)"

Review comment:
   I like "Apache Arrow 2.0.0 Rust Highlights"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-site] nevi-me commented on a change in pull request #81: [Website] Rust release notes / blog post

2020-10-25 Thread GitBox


nevi-me commented on a change in pull request #81:
URL: https://github.com/apache/arrow-site/pull/81#discussion_r511649044



##
File path: _posts/2020-10-19-rust-2.0.0-release.md
##
@@ -0,0 +1,207 @@
+---
+layout: post
+title: "Apache Arrow 2.0.0 Rust Highlights"
+date: "2020-10-23 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+
+
+Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in 
general, and the Rust subproject
+in particular, with almost 200 issues resolved by 15 contributors. In this 
blog post, we will go through the main changes 
+affecting core Arrow, Parquet support, and DataFusion query engine. The full 
list of resolved issues can be found 
+[here][1].
+
+While the Java and C/C++ (used by Python and R) Arrow implementations likely 
remain the most feature-rich, with the 
+2.0.0 release, the Rust implementation is closing the feature gap quickly. 
Here are some of the highlights for this 
+release.
+
+# Core Arrow Crate
+
+## Iterator Trait
+
+- Primitive arrays (e.g., array of integers) can now be converted to and 
initialized from an iterator. This exposes a 
+very popular API - iterators - to arrow arrays. Work for other types will 
continue throughout 3.0.0.
+
+## Improved Variable-sized Arrays
+
+- Variable sized arrays (e.g., array of strings) have been internally 
refactored to more easily support their larger (64-bit 
+size offset) version. This allowed us to generalize some of the kernels to 
both (32 and 64) versions and 
+perform type checks when building them.
+
+## Kernels
+
+There have been numerous improvements in the Arrow compute kernels, including:
+
+- New kernels have been added for string operations, including substring, min, 
max, concat, and length.
+- Aggregate sum is now implemented for SIMD with a 5x improvement over the 
non-SIMD operation
+- Many kernels have been improved to support dictionary-encoded arrays
+- Some kernels were optimized for arrays without nulls, making them 
significantly faster in that case.
+- Many kernels were optimized in the number of memory copies that are needed 
to apply them and also on their 
+implementation.
+
+## Other Improvements
+
+The Array trait now has `get_buffer_memory_size` and `get_array_memory_size` 
methods for determining the amount of 
+memory allocated for the array.
+
+# Parquet
+
+A significant effort is underway to create a Parquet writer for Arrow data. 
This work has not been released as part of 
+2.0.0, and is planned for the 3.0.0 release. The development of this writer is 
being carried out on the 
+[rust-parquet-arrow-writer][2] branch, and the branch is regularly 
synchronized with the main branch.
+As part of the writer, the necessary improvements and features are being added 
to the reader.
+
+The main focus areas are:
+- Supporting nested Arrow types, such as `List>`
+- Ensuring correct round-trip between the reader and writer by encoding Arrow 
schemas in the Parquet metadata
+- Improve null value writing for Parquet
+
+A new `parquet_derive` crate has been created, which allows users to derive 
Parquet records for simple structs. Refer to 
+the [parquet_derive crate][3] for usage examples.
+
+# DataFusion
+
+DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on 
top of base Arrow support.
+
+## DataFrame API
+
+DataFusion now has a richer [DataFrame API][4] with improved documentation 
showing example usage, 
+supporting the following operations:
+
+- select_columns
+- select
+- filter
+- aggregate
+- limit
+- sort
+- collect
+- explain
+
+## Performance & Scalability
+
+DataFusion query execution now uses `async`/`await` with the tokio threaded 
runtime rather than launching dedicated 
+threads, making queries scale much better across available cores.
+
+The hash aggregate physical operator has been largely re-written, resulting in 
significant performance improvements.
+
+## Expressions and Compute
+
+### Improved Scalar Functions
+
+DataFusion has many new functions, both in the SQL and the DataFrame API:
+- Length of an string
+- COUNT(DISTINCT column)
+- to_timestamp
+- IsNull and IsNotNull
+- Min/Max for strings (lexicographic order)
+- Array of columns
+- Concatenation of strings
+- Aliases of aggregate expressions
+
+Many existing expressions were also significantly optimized (2-3x speedups) by 
avoiding memory copies and leveraging 
+Arrow format’s invariants.
+
+Unary mathematical functions (such as sqrt) now support both 32 and 64-bit 
floats and return the corresponding type, 
+thereby allowing faster operations when higher precision is not needed.
+
+### Improved User-defined Functions (UDFs)
+The API to use and register UDFs has been significantly improved, allowing 
users to register UDFs and call them both 
+via SQL and the DataFrame API. UDFs now also have the same generality as 
DataFusion’s functions, including variadic 
+and dynamically typed arguments.
+
+### User-defined Aggregate Functions (UDAFs)

[arrow] branch master updated: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

2020-10-25 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new 7155cd5  ARROW-10135: [Rust] [Parquet] Refactor file module to help 
adding sources
7155cd5 is described below

commit 7155cd5488310c15d864428252ca71dd9ebd3b48
Author: rdettai 
AuthorDate: Sun Oct 25 22:25:14 2020 +0200

ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

https://issues.apache.org/jira/browse/ARROW-10135

Closes #8300 from rdettai/ARROW-10135-parquet-file-reader

Authored-by: rdettai 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/array_reader.rs |  16 +-
 rust/parquet/src/file/footer.rs| 263 ++
 rust/parquet/src/file/mod.rs   |   5 +
 rust/parquet/src/file/reader.rs| 992 +
 .../src/file/{reader.rs => serialized_reader.rs}   | 569 ++--
 rust/parquet/src/file/writer.rs|   4 +-
 rust/parquet/src/util/cursor.rs| 113 +++
 rust/parquet/src/util/io.rs|  22 +-
 rust/parquet/src/util/mod.rs   |   1 +
 9 files changed, 542 insertions(+), 1443 deletions(-)

diff --git a/rust/parquet/src/arrow/array_reader.rs 
b/rust/parquet/src/arrow/array_reader.rs
index 14bf7d2..b9db4f8 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -953,7 +953,7 @@ mod tests {
 use std::rc::Rc;
 use std::sync::Arc;
 
-fn make_column_chuncks(
+fn make_column_chunks(
 column_desc: ColumnDescPtr,
 encoding: Encoding,
 num_levels: usize,
@@ -964,11 +964,11 @@ mod tests {
 values:  Vec,
 page_lists:  Vec>,
 use_v2: bool,
-num_chuncks: usize,
+num_chunks: usize,
 ) where
 T::T: PartialOrd + SampleUniform + Copy,
 {
-for _i in 0..num_chuncks {
+for _i in 0..num_chunks {
 let mut pages = VecDeque::new();
 let mut data = Vec::new();
 let mut page_def_levels = Vec::new();
@@ -1039,7 +1039,7 @@ mod tests {
 {
 let mut data = Vec::new();
 let mut page_lists = Vec::new();
-make_column_chuncks::(
+make_column_chunks::(
 column_desc.clone(),
 Encoding::PLAIN,
 100,
@@ -1061,7 +1061,7 @@ mod tests {
 )
 .unwrap();
 
-// Read first 50 values, which are all from the first column chunck
+// Read first 50 values, which are all from the first column chunk
 let array = array_reader.next_batch(50).unwrap();
 let array = array
 .as_any()
@@ -1120,7 +1120,7 @@ mod tests {
 {
 let mut data = Vec::new();
 let mut page_lists = Vec::new();
-make_column_chuncks::<$arrow_parquet_type>(
+make_column_chunks::<$arrow_parquet_type>(
 column_desc.clone(),
 Encoding::PLAIN,
 100,
@@ -1225,7 +1225,7 @@ mod tests {
 let mut def_levels = Vec::new();
 let mut rep_levels = Vec::new();
 let mut page_lists = Vec::new();
-make_column_chuncks::(
+make_column_chunks::(
 column_desc.clone(),
 Encoding::PLAIN,
 100,
@@ -1250,7 +1250,7 @@ mod tests {
 
 let mut accu_len: usize = 0;
 
-// Read first 50 values, which are all from the first column chunck
+// Read first 50 values, which are all from the first column chunk
 let array = array_reader.next_batch(50).unwrap();
 assert_eq!(
 Some(_levels[accu_len..(accu_len + array.len())]),
diff --git a/rust/parquet/src/file/footer.rs b/rust/parquet/src/file/footer.rs
new file mode 100644
index 000..240381c
--- /dev/null
+++ b/rust/parquet/src/file/footer.rs
@@ -0,0 +1,263 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions 

[arrow] branch master updated: ARROW-10332: [Rust] Allow CSV reader to iterate from start up to end

2020-10-25 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
 new a764d3b  ARROW-10332: [Rust] Allow CSV reader to iterate from start up 
to end
a764d3b is described below

commit a764d3bafaaf593e5b3fe418975cd039c28f8494
Author: Jorge C. Leitao 
AuthorDate: Sun Oct 25 22:23:20 2020 +0200

ARROW-10332: [Rust] Allow CSV reader to iterate from start up to end

This PR proposes the following changes:

1. Make the CSV reader accept an optional argument to bound its iteration
2. Simplify the `next` code via iterators
3. Add a new struct to perform buffered iterations (useful to any reader)

Closes #8482 from jorgecarleitao/csv_many

Authored-by: Jorge C. Leitao 
Signed-off-by: Neville Dipale 
---
 rust/arrow/examples/read_csv.rs  |   2 +-
 rust/arrow/src/array/builder.rs  | 159 +++--
 rust/arrow/src/csv/reader.rs | 395 +++
 rust/arrow/src/util/buffered_iterator.rs | 138 +++
 rust/arrow/src/util/mod.rs   |   1 +
 rust/datafusion/src/physical_plan/csv.rs |   1 +
 6 files changed, 460 insertions(+), 236 deletions(-)

diff --git a/rust/arrow/examples/read_csv.rs b/rust/arrow/examples/read_csv.rs
index dcbc44c..8c8dfa0 100644
--- a/rust/arrow/examples/read_csv.rs
+++ b/rust/arrow/examples/read_csv.rs
@@ -35,7 +35,7 @@ fn main() -> Result<()> {
 
 let file = File::open("test/data/uk_cities.csv").unwrap();
 
-let mut csv = csv::Reader::new(file, Arc::new(schema), false, None, 1024, 
None);
+let mut csv = csv::Reader::new(file, Arc::new(schema), false, None, 1024, 
None, None);
 let _batch = csv.next().unwrap().unwrap();
 #[cfg(feature = "prettyprint")]
 {
diff --git a/rust/arrow/src/array/builder.rs b/rust/arrow/src/array/builder.rs
index ca45f9e..7d1122f 100644
--- a/rust/arrow/src/array/builder.rs
+++ b/rust/arrow/src/array/builder.rs
@@ -1990,6 +1990,79 @@ impl ArrayBuilder for StructBuilder {
 }
 }
 
+/// Returns a builder with capacity `capacity` that corresponds to the 
datatype `DataType`
+/// This function is useful to construct arrays from an arbitrary vectors with 
known/expected
+/// schema.
+pub fn make_builder(datatype: , capacity: usize) -> Box 
{
+match datatype {
+DataType::Null => unimplemented!(),
+DataType::Boolean => Box::new(BooleanBuilder::new(capacity)),
+DataType::Int8 => Box::new(Int8Builder::new(capacity)),
+DataType::Int16 => Box::new(Int16Builder::new(capacity)),
+DataType::Int32 => Box::new(Int32Builder::new(capacity)),
+DataType::Int64 => Box::new(Int64Builder::new(capacity)),
+DataType::UInt8 => Box::new(UInt8Builder::new(capacity)),
+DataType::UInt16 => Box::new(UInt16Builder::new(capacity)),
+DataType::UInt32 => Box::new(UInt32Builder::new(capacity)),
+DataType::UInt64 => Box::new(UInt64Builder::new(capacity)),
+DataType::Float32 => Box::new(Float32Builder::new(capacity)),
+DataType::Float64 => Box::new(Float64Builder::new(capacity)),
+DataType::Binary => Box::new(BinaryBuilder::new(capacity)),
+DataType::FixedSizeBinary(len) => {
+Box::new(FixedSizeBinaryBuilder::new(capacity, *len))
+}
+DataType::Utf8 => Box::new(StringBuilder::new(capacity)),
+DataType::Date32(DateUnit::Day) => 
Box::new(Date32Builder::new(capacity)),
+DataType::Date64(DateUnit::Millisecond) => 
Box::new(Date64Builder::new(capacity)),
+DataType::Time32(TimeUnit::Second) => {
+Box::new(Time32SecondBuilder::new(capacity))
+}
+DataType::Time32(TimeUnit::Millisecond) => {
+Box::new(Time32MillisecondBuilder::new(capacity))
+}
+DataType::Time64(TimeUnit::Microsecond) => {
+Box::new(Time64MicrosecondBuilder::new(capacity))
+}
+DataType::Time64(TimeUnit::Nanosecond) => {
+Box::new(Time64NanosecondBuilder::new(capacity))
+}
+DataType::Timestamp(TimeUnit::Second, _) => {
+Box::new(TimestampSecondBuilder::new(capacity))
+}
+DataType::Timestamp(TimeUnit::Millisecond, _) => {
+Box::new(TimestampMillisecondBuilder::new(capacity))
+}
+DataType::Timestamp(TimeUnit::Microsecond, _) => {
+Box::new(TimestampMicrosecondBuilder::new(capacity))
+}
+DataType::Timestamp(TimeUnit::Nanosecond, _) => {
+Box::new(TimestampNanosecondBuilder::new(capacity))
+}
+DataType::Interval(IntervalUnit::YearMonth) => {
+Box::new(IntervalYearMonthBuilder::new(capacity))
+}
+DataType::Interval(IntervalUnit::DayTime) => {
+Box::new(IntervalDayTimeBuilder::new(capacity))
+}
+