[arrow-rs] branch llvm-cov created (now 2a3d561c9)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch llvm-cov in repository https://gitbox.apache.org/repos/asf/arrow-rs.git at 2a3d561c9 Check if llvm-cov will run on CI This branch includes the following new commits: new 2a3d561c9 Check if llvm-cov will run on CI The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow-rs] 01/01: Check if llvm-cov will run on CI
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch llvm-cov in repository https://gitbox.apache.org/repos/asf/arrow-rs.git commit 2a3d561c9e79381230ff9bf4d5670f4e549d5e74 Author: Wakahisa AuthorDate: Mon Aug 1 00:10:05 2022 +0200 Check if llvm-cov will run on CI --- .github/workflows/rust.yml | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml index 8464a22b6..bd63efe02 100644 --- a/.github/workflows/rust.yml +++ b/.github/workflows/rust.yml @@ -76,24 +76,22 @@ jobs: arch: [ amd64 ] rust: [ stable ] steps: - - uses: actions/checkout@v2 + - uses: actions/checkout@v3 with: submodules: true - name: Setup Rust toolchain run: | - rustup toolchain install ${{ matrix.rust }} + rustup toolchain install ${{ matrix.rust }} --component llvm-tools-preview rustup default ${{ matrix.rust }} - name: Cache Cargo uses: actions/cache@v3 with: path: /home/runner/.cargo key: cargo-coverage-cache3- + - name: Install cargo-llvm-cov +uses: taiki-e/install-action@cargo-llvm-cov - name: Run coverage -run: | - rustup toolchain install stable - rustup default stable - cargo install --version 0.18.2 cargo-tarpaulin - cargo tarpaulin --all --out Xml +run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info - name: Report coverage continue-on-error: true run: bash <(curl -s https://codecov.io/bash)
[arrow-rs] branch master updated: Make `Schema::fields` and `Schema::metadata` `pub` (#2239)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 3032a521c Make `Schema::fields` and `Schema::metadata` `pub` (#2239) 3032a521c is described below commit 3032a521c9691d4569a9d277046304bd4e4098fb Author: Andrew Lamb AuthorDate: Sun Jul 31 18:05:00 2022 -0400 Make `Schema::fields` and `Schema::metadata` `pub` (#2239) --- arrow/src/datatypes/schema.rs | 4 ++-- arrow/tests/schema.rs | 46 +++ 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arrow/src/datatypes/schema.rs b/arrow/src/datatypes/schema.rs index 1574b1654..f1f28d611 100644 --- a/arrow/src/datatypes/schema.rs +++ b/arrow/src/datatypes/schema.rs @@ -33,11 +33,11 @@ use super::Field; /// memory layout. #[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)] pub struct Schema { -pub(crate) fields: Vec, +pub fields: Vec, /// A map of key-value pairs containing additional meta data. #[serde(skip_serializing_if = "HashMap::is_empty")] #[serde(default)] -pub(crate) metadata: HashMap, +pub metadata: HashMap, } impl Schema { diff --git a/arrow/tests/schema.rs b/arrow/tests/schema.rs new file mode 100644 index 0..ff544b689 --- /dev/null +++ b/arrow/tests/schema.rs @@ -0,0 +1,46 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow::datatypes::{DataType, Field, Schema}; +use std::collections::HashMap; +/// The tests in this file ensure a `Schema` can be manipulated +/// outside of the arrow crate + +#[test] +fn schema_destructure() { +let meta = [("foo".to_string(), "baz".to_string())] +.into_iter() +.collect::>(); + +let field = Field::new("c1", DataType::Utf8, false); +let schema = Schema::new(vec![field]).with_metadata(meta); + +// Destructuring a Schema allows rewriting fields and metadata +// without copying +// +// Model this usecase below: + +let Schema { +mut fields, +metadata, +} = schema; +fields.push(Field::new("c2", DataType::Utf8, false)); + +let new_schema = Schema::new(fields).with_metadata(metadata); + +assert_eq!(new_schema.fields().len(), 2); +}
[arrow-rs] branch master updated: fix the doc of value_length (#1957)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new d6fc77870 fix the doc of value_length (#1957) d6fc77870 is described below commit d6fc77870974e8d468689aab94179e738072314e Author: Remzi Yang <59198230+haoyang...@users.noreply.github.com> AuthorDate: Wed Jun 29 12:21:05 2022 +0800 fix the doc of value_length (#1957) Signed-off-by: remzi <1371656737...@gmail.com> --- arrow/src/array/array_list.rs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arrow/src/array/array_list.rs b/arrow/src/array/array_list.rs index 709e4e7ba..36ad30715 100644 --- a/arrow/src/array/array_list.rs +++ b/arrow/src/array/array_list.rs @@ -381,9 +381,9 @@ impl FixedSizeListArray { self.value_offset_at(self.data.offset() + i) } -/// Returns the length for value at index `i`. +/// Returns the length for an element. /// -/// Note this doesn't do any bound checking, for performance reason. +/// All elements have the same length as the array is a fixed size. #[inline] pub const fn value_length() -> i32 { self.length
[arrow-rs] branch master updated: Update indexmap dependency (#1929)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 27963e758 Update indexmap dependency (#1929) 27963e758 is described below commit 27963e758cf14b437c6ba40016f5ac732a4bca6d Author: Raphael Taylor-Davies <1781103+tustv...@users.noreply.github.com> AuthorDate: Thu Jun 23 21:03:13 2022 +0100 Update indexmap dependency (#1929) --- arrow/Cargo.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml index 136a2ae02..944dda9eb 100644 --- a/arrow/Cargo.toml +++ b/arrow/Cargo.toml @@ -41,7 +41,7 @@ bench = false serde = { version = "1.0", default-features = false } serde_derive = { version = "1.0", default-features = false } serde_json = { version = "1.0", default-features = false, features = ["preserve_order"] } -indexmap = { version = "1.6", default-features = false, features = ["std"] } +indexmap = { version = "1.9", default-features = false, features = ["std"] } rand = { version = "0.8", default-features = false, features = ["std", "std_rng"], optional = true } num = { version = "0.4", default-features = false, features = ["std"] } half = { version = "2.0", default-features = false }
[arrow-rs] branch master updated: Add ArrowWriter doctest (#1927) (#1930)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new f8afc1424 Add ArrowWriter doctest (#1927) (#1930) f8afc1424 is described below commit f8afc1424a729c390df6b69e585db7274498106b Author: Raphael Taylor-Davies <1781103+tustv...@users.noreply.github.com> AuthorDate: Thu Jun 23 20:03:24 2022 +0100 Add ArrowWriter doctest (#1927) (#1930) --- parquet/src/arrow/arrow_writer/mod.rs | 21 + 1 file changed, 21 insertions(+) diff --git a/parquet/src/arrow/arrow_writer/mod.rs b/parquet/src/arrow/arrow_writer/mod.rs index 83f1bc70b..a18098ff1 100644 --- a/parquet/src/arrow/arrow_writer/mod.rs +++ b/parquet/src/arrow/arrow_writer/mod.rs @@ -48,6 +48,27 @@ mod levels; /// to produce row groups with `max_row_group_size` rows. Any remaining rows will be /// flushed on close, leading the final row group in the output file to potentially /// contain fewer than `max_row_group_size` rows +/// +/// ``` +/// # use std::sync::Arc; +/// # use bytes::Bytes; +/// # use arrow::array::{ArrayRef, Int64Array}; +/// # use arrow::record_batch::RecordBatch; +/// # use parquet::arrow::{ArrowReader, ArrowWriter, ParquetFileArrowReader}; +/// let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef; +/// let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap(); +/// +/// let mut buffer = Vec::new(); +/// let mut writer = ArrowWriter::try_new( buffer, to_write.schema(), None).unwrap(); +/// writer.write(_write).unwrap(); +/// writer.close().unwrap(); +/// +/// let mut reader = ParquetFileArrowReader::try_new(Bytes::from(buffer)).unwrap(); +/// let mut reader = reader.get_record_reader(1024).unwrap(); +/// let read = reader.next().unwrap().unwrap(); +/// +/// assert_eq!(to_write, read); +/// ``` pub struct ArrowWriter { /// Underlying Parquet writer writer: SerializedFileWriter,
[arrow-rs] branch master updated (fcf655e19 -> 9bcd052bd)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git from fcf655e19 Zero copy page decoding from bytes (#1810) add 9bcd052bd Omit validity buffer in PrimitiveArray::from_iter when all values are valid (#1859) No new revisions were added by this update. Summary of changes: arrow/src/array/array.rs | 4 +++- arrow/src/array/array_primitive.rs | 30 +++--- 2 files changed, 26 insertions(+), 8 deletions(-)
[arrow-rs] branch master updated: Remove simd and avx512 bitwise kernels in favor of autovectorization (#1830)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new fb697ce43 Remove simd and avx512 bitwise kernels in favor of autovectorization (#1830) fb697ce43 is described below commit fb697ce4351fae39ebac810508ecc31583c6cdd7 Author: Jörn Horstmann AuthorDate: Sun Jun 12 19:09:02 2022 +0200 Remove simd and avx512 bitwise kernels in favor of autovectorization (#1830) * Remove simd and avx512 bitwise kernels since they are actually slightly slower than the autovectorized version * Add notes about target-cpu to README --- arrow/Cargo.toml| 1 - arrow/README.md | 14 ++ arrow/benches/buffer_bit_ops.rs | 61 ++-- arrow/src/arch/avx512.rs| 73 -- arrow/src/arch/mod.rs | 22 --- arrow/src/buffer/ops.rs | 307 +--- arrow/src/lib.rs| 4 - 7 files changed, 69 insertions(+), 413 deletions(-) diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml index ebcdd9e7a..3f69888d5 100644 --- a/arrow/Cargo.toml +++ b/arrow/Cargo.toml @@ -61,7 +61,6 @@ bitflags = "1.2.1" [features] default = ["csv", "ipc", "test_utils"] -avx512 = [] csv = ["csv_crate"] ipc = ["flatbuffers"] simd = ["packed_simd"] diff --git a/arrow/README.md b/arrow/README.md index 67de57ff0..28240e77d 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -100,3 +100,17 @@ cargo run --example read_csv ``` [arrow]: https://arrow.apache.org/ + + +## Performance + +Most of the compute kernels benefit a lot from being optimized for a specific CPU target. +This is especially so on x86-64 since without specifying a target the compiler can only assume support for SSE2 vector instructions. +One of the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` can therefore improve performance significantly: + + - `native`: Target the exact features of the cpu that the build is running on. + This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. + - `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or Amd cpu. + - `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. + +These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. \ No newline at end of file diff --git a/arrow/benches/buffer_bit_ops.rs b/arrow/benches/buffer_bit_ops.rs index 063f39c92..6c6bb0463 100644 --- a/arrow/benches/buffer_bit_ops.rs +++ b/arrow/benches/buffer_bit_ops.rs @@ -17,11 +17,14 @@ #[macro_use] extern crate criterion; -use criterion::Criterion; + +use criterion::{Criterion, Throughput}; extern crate arrow; -use arrow::buffer::{Buffer, MutableBuffer}; +use arrow::buffer::{ +buffer_bin_and, buffer_bin_or, buffer_unary_not, Buffer, MutableBuffer, +}; /// Helper function to create arrays fn create_buffer(size: usize) -> Buffer { @@ -42,17 +45,59 @@ fn bench_buffer_or(left: , right: ) { criterion::black_box((left | right).unwrap()); } +fn bench_buffer_not(buffer: ) { +criterion::black_box(!buffer); +} + +fn bench_buffer_and_with_offsets( +left: , +left_offset: usize, +right: , +right_offset: usize, +len: usize, +) { +criterion::black_box(buffer_bin_and(left, left_offset, right, right_offset, len)); +} + +fn bench_buffer_or_with_offsets( +left: , +left_offset: usize, +right: , +right_offset: usize, +len: usize, +) { +criterion::black_box(buffer_bin_or(left, left_offset, right, right_offset, len)); +} + +fn bench_buffer_not_with_offsets(buffer: , offset: usize, len: usize) { +criterion::black_box(buffer_unary_not(buffer, offset, len)); +} + fn bit_ops_benchmark(c: Criterion) { let left = create_buffer(512 * 10); let right = create_buffer(512 * 10); -c.bench_function("buffer_bit_ops and", |b| { -b.iter(|| bench_buffer_and(, )) -}); +c.benchmark_group("buffer_binary_ops") +.throughput(Throughput::Bytes(3 * left.len() as u64)) +.bench_function("and", |b| b.iter(|| bench_buffer_and(, ))) +.bench_function("or", |b| b.iter(|| bench_buffer_or(, ))) +.bench_function("and_with_offset", |b| { +b.iter(|| { +bench_buffer_and_with_offsets(, 1, , 2, left.len() * 8 - 5) +}) +}) +.bench_function("or_with_offset", |b| { +b.iter(|
[arrow-rs] branch master updated: Read and skip validity buffer of UnionType Array for V4 ipc message (#1789)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 73d552a7c Read and skip validity buffer of UnionType Array for V4 ipc message (#1789) 73d552a7c is described below commit 73d552a7cc794d0e3eaa3e5333e5bc1c98deeb45 Author: Liang-Chi Hsieh AuthorDate: Sun Jun 5 02:00:44 2022 -0700 Read and skip validity buffer of UnionType Array for V4 ipc message (#1789) * Read valididy buffer for V4 ipc message * Add unit test * Fix clippy --- arrow-flight/src/utils.rs | 1 + arrow/src/ipc/reader.rs| 31 -- arrow/src/ipc/writer.rs| 48 ++ .../flight_client_scenarios/integration_test.rs| 1 + .../flight_server_scenarios/integration_test.rs| 10 - 5 files changed, 86 insertions(+), 5 deletions(-) diff --git a/arrow-flight/src/utils.rs b/arrow-flight/src/utils.rs index 77526917f..dda3fc7fe 100644 --- a/arrow-flight/src/utils.rs +++ b/arrow-flight/src/utils.rs @@ -71,6 +71,7 @@ pub fn flight_data_to_arrow_batch( schema, dictionaries_by_id, None, +(), ) })? } diff --git a/arrow/src/ipc/reader.rs b/arrow/src/ipc/reader.rs index 03a960c4c..868098327 100644 --- a/arrow/src/ipc/reader.rs +++ b/arrow/src/ipc/reader.rs @@ -52,6 +52,7 @@ fn read_buffer(buf: ::Buffer, a_data: &[u8]) -> Buffer { /// - check if the bit width of non-64-bit numbers is 64, and /// - read the buffer as 64-bit (signed integer or float), and /// - cast the 64-bit array to the appropriate data type +#[allow(clippy::too_many_arguments)] fn create_array( nodes: &[ipc::FieldNode], field: , @@ -60,6 +61,7 @@ fn create_array( dictionaries_by_id: , mut node_index: usize, mut buffer_index: usize, +metadata: ::MetadataVersion, ) -> Result<(ArrayRef, usize, usize)> { use DataType::*; let data_type = field.data_type(); @@ -106,6 +108,7 @@ fn create_array( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; buffer_index = triple.2; @@ -128,6 +131,7 @@ fn create_array( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; buffer_index = triple.2; @@ -153,6 +157,7 @@ fn create_array( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; buffer_index = triple.2; @@ -201,6 +206,13 @@ fn create_array( let len = union_node.length() as usize; +// In V4, union types has validity bitmap +// In V5 and later, union types have no validity bitmap +if metadata < ::MetadataVersion::V5 { +read_buffer([buffer_index], data); +buffer_index += 1; +} + let type_ids: Buffer = read_buffer([buffer_index], data)[..len].into(); @@ -226,6 +238,7 @@ fn create_array( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; @@ -582,6 +595,7 @@ pub fn read_record_batch( schema: SchemaRef, dictionaries_by_id: , projection: Option<&[usize]>, +metadata: ::MetadataVersion, ) -> Result { let buffers = batch.buffers().ok_or_else(|| { ArrowError::IoError("Unable to get buffers from IPC RecordBatch".to_string()) @@ -607,6 +621,7 @@ pub fn read_record_batch( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; buffer_index = triple.2; @@ -640,6 +655,7 @@ pub fn read_record_batch( dictionaries_by_id, node_index, buffer_index, +metadata, )?; node_index = triple.1; buffer_index = triple.2; @@ -656,6 +672,7 @@ pub fn read_dictionary( batch: ipc::DictionaryBatch, schema: , dictionaries_by_id: HashMap, +metadata: ::MetadataVersion, ) -> Result<()> { if batch.isDelta() { return Err(ArrowError::IoError( @@ -686,6 +703,7 @@ pub fn read_dictionary( Arc::new(schema), dictionaries_
[arrow-rs] branch master updated (2a12e5043 -> c1a91dc6d)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git from 2a12e5043 Revert "Pin nightly version to bypass packed_simd build error (#1743)" (#1771) add c1a91dc6d Improve ParquetFileArrowReader UX (#1773) No new revisions were added by this update. Summary of changes: parquet/src/arrow/arrow_reader.rs | 12 1 file changed, 12 insertions(+)
[arrow-rs] branch master updated: Support casting Utf8 to Boolean (#1738)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 486118cfa Support casting Utf8 to Boolean (#1738) 486118cfa is described below commit 486118cfa9bc1435edc1745f4025f963712bf631 Author: Alex Qyoun-ae <4062971+mazterq...@users.noreply.github.com> AuthorDate: Mon May 30 10:51:45 2022 +0400 Support casting Utf8 to Boolean (#1738) --- arrow/src/compute/kernels/cast.rs | 69 +++ 1 file changed, 62 insertions(+), 7 deletions(-) diff --git a/arrow/src/compute/kernels/cast.rs b/arrow/src/compute/kernels/cast.rs index 26aacff0b..93a8ebcb6 100644 --- a/arrow/src/compute/kernels/cast.rs +++ b/arrow/src/compute/kernels/cast.rs @@ -161,7 +161,7 @@ pub fn can_cast_types(from_type: , to_type: ) -> bool { (Dictionary(_, value_type), _) => can_cast_types(value_type, to_type), (_, Dictionary(_, value_type)) => can_cast_types(from_type, value_type), -(_, Boolean) => DataType::is_numeric(from_type), +(_, Boolean) => DataType::is_numeric(from_type) || from_type == , (Boolean, _) => DataType::is_numeric(to_type) || to_type == , (Utf8, LargeUtf8) => true, @@ -280,6 +280,8 @@ pub fn can_cast_types(from_type: , to_type: ) -> bool { /// /// Behavior: /// * Boolean to Utf8: `true` => '1', `false` => `0` +/// * Utf8 to boolean: `true`, `yes`, `on`, `1` => `true`, `false`, `no`, `off`, `0` => `false`, +/// short variants are accepted, other strings return null or error /// * Utf8 to numeric: strings that can't be parsed to numbers return null, float strings /// in integer casts return null /// * Numeric to boolean: 0 returns `false`, any other value returns `true` @@ -293,7 +295,6 @@ pub fn can_cast_types(from_type: , to_type: ) -> bool { /// Unsupported Casts /// * To or from `StructArray` /// * List to primitive -/// * Utf8 to boolean /// * Interval and duration pub fn cast(array: , to_type: ) -> Result { cast_with_options(array, to_type, _CAST_OPTIONS) @@ -396,6 +397,8 @@ macro_rules! cast_decimal_to_float { /// /// Behavior: /// * Boolean to Utf8: `true` => '1', `false` => `0` +/// * Utf8 to boolean: `true`, `yes`, `on`, `1` => `true`, `false`, `no`, `off`, `0` => `false`, +/// short variants are accepted, other strings return null or error /// * Utf8 to numeric: strings that can't be parsed to numbers return null, float strings /// in integer casts return null /// * Numeric to boolean: 0 returns `false`, any other value returns `true` @@ -409,7 +412,6 @@ macro_rules! cast_decimal_to_float { /// Unsupported Casts /// * To or from `StructArray` /// * List to primitive -/// * Utf8 to boolean pub fn cast_with_options( array: , to_type: , @@ -643,10 +645,7 @@ pub fn cast_with_options( Int64 => cast_numeric_to_bool::(array), Float32 => cast_numeric_to_bool::(array), Float64 => cast_numeric_to_bool::(array), -Utf8 => Err(ArrowError::CastError(format!( -"Casting from {:?} to {:?} not supported", -from_type, to_type, -))), +Utf8 => cast_utf8_to_boolean(array, cast_options), _ => Err(ArrowError::CastError(format!( "Casting from {:?} to {:?} not supported", from_type, to_type, @@ -1661,6 +1660,34 @@ fn cast_string_to_timestamp_ns( Ok(Arc::new(array) as ArrayRef) } +/// Casts Utf8 to Boolean +fn cast_utf8_to_boolean(from: , cast_options: ) -> Result { +let array = as_string_array(from); + +let output_array = array +.iter() +.map(|value| match value { +Some(value) => match value.to_ascii_lowercase().trim() { +"t" | "tr" | "tru" | "true" | "y" | "ye" | "yes" | "on" | "1" => { +Ok(Some(true)) +} +"f" | "fa" | "fal" | "fals" | "false" | "n" | "no" | "of" | "off" +| "0" => Ok(Some(false)), +invalid_value => match cast_options.safe { +true => Ok(None), +false => Err(ArrowError::CastError(format!( +"Cannot cast string '{}' to value of Boolean type", +invalid_value, +))), +}, +}, +None => Ok(None), +}) +.collect::>()?; + +Ok(Arc::new(output_array)) +} + /// Cast numeric types to Boolean /// /// Any zero value retu
[arrow-rs] branch master updated: Read/Write nested dictionaries under FixedSizeList in IPC (#1610)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new cbd0303c6 Read/Write nested dictionaries under FixedSizeList in IPC (#1610) cbd0303c6 is described below commit cbd0303c69d66d4c683fea29787c8d03c8942568 Author: Liang-Chi Hsieh AuthorDate: Sun Apr 24 23:15:57 2022 -0700 Read/Write nested dictionaries under FixedSizeList in IPC (#1610) * Read/Write nested dictionaries under FixedSizeList in IPC * Fix clippy --- arrow/src/ipc/reader.rs | 39 +++ arrow/src/ipc/writer.rs | 16 ++-- 2 files changed, 53 insertions(+), 2 deletions(-) diff --git a/arrow/src/ipc/reader.rs b/arrow/src/ipc/reader.rs index 33d608576..8a26167db 100644 --- a/arrow/src/ipc/reader.rs +++ b/arrow/src/ipc/reader.rs @@ -1573,4 +1573,43 @@ mod tests { offsets, ); } + +#[test] +fn test_roundtrip_stream_dict_of_fixed_size_list_of_dict() { +let values = StringArray::from(vec![Some("a"), None, Some("c"), None]); +let keys = Int8Array::from_iter_values([0, 0, 1, 2, 0, 1, 3, 1, 2]); +let dict_array = DictionaryArraytry_new(, ).unwrap(); +let dict_data = dict_array.data(); + +let list_data_type = DataType::FixedSizeList( +Box::new(Field::new_dict( +"item", +DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8)), +true, +1, +false, +)), +3, +); +let list_data = ArrayData::builder(list_data_type) +.len(3) +.add_child_data(dict_data.clone()) +.build() +.unwrap(); +let list_array = FixedSizeListArray::from(list_data); + +let keys_for_dict = Int8Array::from_iter_values([0, 1, 0, 1, 1, 2, 0, 1, 2]); +let dict_dict_array = +DictionaryArraytry_new(_for_dict, _array).unwrap(); + +let schema = Arc::new(Schema::new(vec![Field::new( +"f1", +dict_dict_array.data_type().clone(), +false, +)])); +let input_batch = +RecordBatch::try_new(schema, vec![Arc::new(dict_dict_array)]).unwrap(); +let output_batch = roundtrip_ipc_stream(_batch); +assert_eq!(input_batch, output_batch); +} } diff --git a/arrow/src/ipc/writer.rs b/arrow/src/ipc/writer.rs index 1f73d16d2..efc878a12 100644 --- a/arrow/src/ipc/writer.rs +++ b/arrow/src/ipc/writer.rs @@ -27,7 +27,7 @@ use flatbuffers::FlatBufferBuilder; use crate::array::{ as_large_list_array, as_list_array, as_map_array, as_struct_array, as_union_array, -make_array, Array, ArrayData, ArrayRef, +make_array, Array, ArrayData, ArrayRef, FixedSizeListArray, }; use crate::buffer::{Buffer, MutableBuffer}; use crate::datatypes::*; @@ -147,7 +147,6 @@ impl IpcDataGenerator { dictionary_tracker: DictionaryTracker, write_options: , ) -> Result<()> { -// TODO: Handle other nested types (FixedSizeList) match column.data_type() { DataType::Struct(fields) => { let s = as_struct_array(column); @@ -181,6 +180,19 @@ impl IpcDataGenerator { write_options, )?; } +DataType::FixedSizeList(field, _) => { +let list = column +.as_any() +.downcast_ref::() +.expect("Unable to downcast to fixed size list array"); +self.encode_dictionaries( +field, +(), +encoded_dictionaries, +dictionary_tracker, +write_options, +)?; +} DataType::Map(field, _) => { let map_array = as_map_array(column);
[arrow-rs] branch master updated: Parquet: schema validation should allow scale == precision for decimal type (#1607)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 4e22b8901 Parquet: schema validation should allow scale == precision for decimal type (#1607) 4e22b8901 is described below commit 4e22b890189762c22a1d33ad8fc9662c8582977c Author: Chao Sun AuthorDate: Fri Apr 22 23:42:01 2022 -0700 Parquet: schema validation should allow scale == precision for decimal type (#1607) --- parquet/src/schema/types.rs | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/parquet/src/schema/types.rs b/parquet/src/schema/types.rs index 8ae3c4c6e..b156bb671 100644 --- a/parquet/src/schema/types.rs +++ b/parquet/src/schema/types.rs @@ -467,13 +467,13 @@ impl<'a> PrimitiveTypeBuilder<'a> { return Err(general_err!("Invalid DECIMAL scale: {}", self.scale)); } -if self.scale >= self.precision { +if self.scale > self.precision { return Err(general_err!( -"Invalid DECIMAL: scale ({}) cannot be greater than or equal to precision \ +"Invalid DECIMAL: scale ({}) cannot be greater than precision \ ({})", -self.scale, -self.precision -)); +self.scale, +self.precision +)); } // Check precision and scale based on physical type limitations. @@ -1345,10 +1345,19 @@ mod tests { if let Err(e) = result { assert_eq!( format!("{}", e), -"Parquet error: Invalid DECIMAL: scale (2) cannot be greater than or equal to precision (1)" +"Parquet error: Invalid DECIMAL: scale (2) cannot be greater than precision (1)" ); } +// It is OK if precision == scale +result = Type::primitive_type_builder("foo", PhysicalType::BYTE_ARRAY) +.with_repetition(Repetition::REQUIRED) +.with_converted_type(ConvertedType::DECIMAL) +.with_precision(1) +.with_scale(1) +.build(); +assert!(result.is_ok()); + result = Type::primitive_type_builder("foo", PhysicalType::INT32) .with_repetition(Repetition::REQUIRED) .with_converted_type(ConvertedType::DECIMAL)
[arrow-datafusion] 01/01: add a Tablesource
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git commit 724f4e3363289607fed44ce30e9a1992df55d58a Author: Wakahisa AuthorDate: Mon Feb 14 22:50:05 2022 +0200 add a Tablesource Tablesource contains more information about the source of the table. It can be a relational table, file(s), in-memory or unspecified. --- datafusion/core/src/datasource/datasource.rs | 34 1 file changed, 34 insertions(+) diff --git a/datafusion/core/src/datasource/datasource.rs b/datafusion/core/src/datasource/datasource.rs index 1b59c857f..48a2dc09e 100644 --- a/datafusion/core/src/datasource/datasource.rs +++ b/datafusion/core/src/datasource/datasource.rs @@ -55,6 +55,35 @@ pub enum TableType { Temporary, } +/// Indicates the source of this table for metadata/catalog purposes. +#[derive(Debug, Clone, PartialEq)] +pub enum TableSource { +/// An ordinary physical table. +Relational { +/// +server: Option, +/// +database: Option, +/// +schema: Option, +/// +table: String +}, +/// A file on some file system +File { +/// +protocol: String, +/// +path: String, +/// +format: String, +}, +/// A transient table. +InMemory, +/// An unspecified source, used as the default +Unspecified, +} + /// Source table #[async_trait] pub trait TableProvider: Sync + Send { @@ -70,6 +99,11 @@ pub trait TableProvider: Sync + Send { TableType::Base } +/// The source of this table +fn table_source() -> TableSource { +TableSource::Unspecified +} + /// Create an ExecutionPlan that will scan the table. /// The table provider will be usually responsible of grouping /// the source data into partitions that can be efficiently
[arrow-datafusion] branch rdbms-changes updated (e6614aa8f -> 724f4e336)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git discard e6614aa8f add a Tablesource add bed81eade MINOR: fix concat_ws corner bug (#2128) add 536210d73 fix df union all bug (#2108) add d54ba4e64 feat: 2061 create external table ddl table partition cols (#2099) add 88dd6ca3d Update sqlparser requirement from 0.15 to 0.16 (#2152) add a0d8b6633 cli: add cargo.lock (#2112) add fa5cef8c9 Fixed parquet path partitioning when only selecting partitioned columns (#2000) add 69ba713c4 #2109 schema infer max (#2139) add 5ae343404 [MINOR] after sqlparser update to 0.16, enable EXTRACT week. (#2157) add f99c2719a Update quarterly roadmap for Q2 (#2133) add 2a4a835bd fix: incorrect memory usage track for sort (#2135) add ceffb2fca Reduce SortExec memory usage by void constructing single huge batch (#2132) add 823011590 Add IF NOT EXISTS to `CREATE TABLE` and `CREATE EXTERNAL TABLE` (#2143) add 38498b7bf Reduce repetition in Decimal binary kernels, upgrade to arrow 11.1 (#2107) add 8b09a5c6c Add CREATE DATABASE command to SQL (#2094) add b890190a6 Add Coalesce function (#1969) add 0c4ffd4f7 Add delimiter for create external table (#2162) add ea16c30ed [MINOR] ignore suspicious slow test in Ballista (#2167) add e5e8125a1 Serialize scalar UDFs in physical plan (#2130) add f0200b0a9 [CLI] Add show tables for datafusion-cli (#2137) add 0da1f370f minor: Avoid per cell evaluation in Coalesce, use zip in CaseWhen (#2171) add 6504d2a78 enable explain for ballista (#2163) add fa9e01641 Implement fast path of with_new_children() in ExecutionPlan (#2168) add ddf29f112 implement 'StringConcat' operator to support sql like "select 'aa' || 'b' " (#2142) add 9815ac6ec Handle merged schemas in parquet pruning (#2170) add 70f2b1a9b add ballista plugin manager and udf plugin (#2131) add 9cbde6d0e cli: update lockfile (#2178) add dec9adcbe Optimize the evaluation of `IN` for large lists using InSet (#2156) add a63751494 fix: Sort with a lot of repetition values (#2182) add 2d908405f fix 'not' expression will 'NULL' constants (#2144) add 41d2ff2aa Make PhysicalAggregateExprNode has repeated PhysicalExprNode (#2184) add 73ed545b7 refactor: simplify `prepare_select_exprs` (#2190) add 7558a5591 make nightly clippy happy (#2186) add c46c91ff3 Multiple row-layout support, part-1: Restructure code for clearness (#2189) add 28a6da3d2 MINOR: handle `NULL` in advance to avoid value copy in `string_concat` (#2183) add f3360d30b Remove tokio::spawn from WindowAggExec (#2201) (#2203) add ee95d41cc Add LogicalPlan::SubqueryAlias (#2172) add 6d75948b6 Use `filter` (filter_record_batch) instead of `take` to avoid using indices (#2218) add 231027274 feat: Support simple Arrays with Literals (#2194) add d81657de0 `case when` supports `NULL` constant (#2197) add 7a6317a0e Add single line description of ExecutionPlan (#2216) (#2217) add f39692932 Make ParquetExec usable outside of a tokio runtime (#2201) (#2202) add 8058fbb38 Remove tokio::spawn from HashAggregateExec (#2201) (#2215) add 774b91bad minor refactor to avoid repeated code (#) add e7b08ed0e Range scan support for ParquetExec (#1990) add b1a28d077 update cli readme (#2220) add 8d5bb47f5 add sql level test for decimal data type (#2200) add d631a9ca2 chore: add `debug!` log in some execution operators (#2231) add 7e7b3ea02 minor: add editor config file (#2224) add 3d2e7b0bf Add type coercion rule for date + interval (#2235) new 724f4e336 add a Tablesource This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (e6614aa8f) \ N -- N -- N refs/heads/rdbms-changes (724f4e336) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../integration_hiveserver2.sh => .editorconfig| 23 +- .github/workflows/rust.yml |8 +- .gitignore
[arrow-datafusion] branch rdbms-changes updated (307abcc -> e6614aa)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git. discard 307abcc add a Tablesource add 12996ce revise document of installing ballista pinned to specified version (#2034) add 503618f chore: rearrange the code and add comment (#2037) add 74bf7ab fix bug the optimizer rule filter push down (#2039) add c1f6269 Use SessionContext to parse Expr protobuf (#2024) add 7ed3be6 I think using info in formal code is better than using println. (#2020) add d02d969 use cargo-tomlfmt to check Cargo.toml formatting in CI (#2033) add 2dcdb1f Minor: tune log level, lint (#2046) add 8de2a76 minor: format the annotation (#2047) add 5936edc Refactor SessionContext, SessionState and SessionConfig to support multi-tenancy configurations - Part 2 (#2029) add f5c0cea fix panic in register_catalog if default catalog not named "datafusion" and information schema enabled (#2050) add 2e6833c Update to arrow/parquet 11.0 (#2048) add 59c6d93 Add `write_json`, `read_json`, `register_json`, and `JsonFormat` to `CREATE EXTERNAL TABLE` functionality (#2023) add afbeaa6 Allow `CatalogProvider::register_catalog` to return an error (#2052) add 29d0a65 [Ballista][Scheduler] Change log level for noisy logs (#2060) add 634252b Qualified wildcard (#2012) add 257d030 Change the DataFusion explain plans to make it clearer in the predicate/filter (#2063) add 0194a27 Split datafusion-object-store module (#2065) add d3c45c2 [MINOR] fix doc in `EXTRACT(field FROM source) (#2074) add e8ed603 #2004 approx percentile with weight (#2031) add 04da6a6 [Bug][Datafusion] fix TaskContext session_config bug (#2070) add 122837d *: fix #1727 (#2085) add 3d31915 Fix lost filters and projections in ParquetExec, CSVExec etc (#2077) add d644fae Remove dependency of common for the storage crate (#2076) add 703c789 *: remove duplicate test (#2089) add 8159294 fix issue#2058 file_format/json.rs attempt to subtract with overflow (#2066) add ff110d6 Short-circuit evaluation for `CaseWhen` (#2068) add 73ea6e1 [Ballista] Support Union in ballista. (#2098) add a09e1ae add docs for approx functions (#2082) add 2598893 doc: separate and fix link for `extract` and `date_part` (#2104) add 2d6addd Refactor SessionContext, BallistaContext to support multi-tenancy configurations - Part 3 (#2091) add 22fdca3 update zlib version to 1.2.12 (#2106) add 41b4e49 Reorganize the project folders (#2081) add b7d3bb1 Create jit-expression from datafusion expression (#2103) add 86df7ee minor: replace array_equals in case evaluation with eq_dyn from arrow-rs (#2121) add 91673b3 Serialize timezone in timestamp scalar values (#2120) add 57a3a6a minor: fix clippy on nightly rust (#2119) add f313e43 doc: update release schedule (#2110) add 9e3bec8 Fix case evaluation with NULLs (#2118) add 3063105 Minor: make disk_manager pub (#2126) add c43b9ab issue#1967 ignore channel close (#2113) add f619d43 Minor add clarifying comment in parquet (#2127) add 4c2320e JIT-compille DataFusion expression with column name (#2124) new e6614aa add a Tablesource This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (307abcc) \ N -- N -- N refs/heads/rdbms-changes (e6614aa) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .github/workflows/rust.yml | 57 +- Cargo.toml | 15 +- ballista-examples/Cargo.toml | 16 +- .../bin/ballista-sql.rs => examples/test_sql.rs} | 24 +- ballista-examples/src/bin/ballista-dataframe.rs|2 +- ballista-examples/src/bin/ballista-sql.rs |2 +- ballista/rust/client/Cargo.toml| 10 +- ballista/rust/client/README.md |6 +- ballista/rust/client/src/context.rs| 275 +++-- ball
[arrow-datafusion] 01/01: add a Tablesource
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git commit e6614aa8ff84ffc6d36d19ae5eaa3e71602df949 Author: Wakahisa AuthorDate: Mon Feb 14 22:50:05 2022 +0200 add a Tablesource Tablesource contains more information about the source of the table. It can be a relational table, file(s), in-memory or unspecified. --- datafusion/core/src/datasource/datasource.rs | 34 1 file changed, 34 insertions(+) diff --git a/datafusion/core/src/datasource/datasource.rs b/datafusion/core/src/datasource/datasource.rs index 1b59c85..48a2dc0 100644 --- a/datafusion/core/src/datasource/datasource.rs +++ b/datafusion/core/src/datasource/datasource.rs @@ -55,6 +55,35 @@ pub enum TableType { Temporary, } +/// Indicates the source of this table for metadata/catalog purposes. +#[derive(Debug, Clone, PartialEq)] +pub enum TableSource { +/// An ordinary physical table. +Relational { +/// +server: Option, +/// +database: Option, +/// +schema: Option, +/// +table: String +}, +/// A file on some file system +File { +/// +protocol: String, +/// +path: String, +/// +format: String, +}, +/// A transient table. +InMemory, +/// An unspecified source, used as the default +Unspecified, +} + /// Source table #[async_trait] pub trait TableProvider: Sync + Send { @@ -70,6 +99,11 @@ pub trait TableProvider: Sync + Send { TableType::Base } +/// The source of this table +fn table_source() -> TableSource { +TableSource::Unspecified +} + /// Create an ExecutionPlan that will scan the table. /// The table provider will be usually responsible of grouping /// the source data into partitions that can be efficiently
[arrow-datafusion] 01/01: add a Tablesource
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git commit 307abcc3cc63ddf589b985c335a06cdd5f25650c Author: Wakahisa AuthorDate: Mon Feb 14 22:50:05 2022 +0200 add a Tablesource Tablesource contains more information about the source of the table. It can be a relational table, file(s), in-memory or unspecified. --- datafusion/src/datasource/datasource.rs | 34 + 1 file changed, 34 insertions(+) diff --git a/datafusion/src/datasource/datasource.rs b/datafusion/src/datasource/datasource.rs index 1b59c85..48a2dc0 100644 --- a/datafusion/src/datasource/datasource.rs +++ b/datafusion/src/datasource/datasource.rs @@ -55,6 +55,35 @@ pub enum TableType { Temporary, } +/// Indicates the source of this table for metadata/catalog purposes. +#[derive(Debug, Clone, PartialEq)] +pub enum TableSource { +/// An ordinary physical table. +Relational { +/// +server: Option, +/// +database: Option, +/// +schema: Option, +/// +table: String +}, +/// A file on some file system +File { +/// +protocol: String, +/// +path: String, +/// +format: String, +}, +/// A transient table. +InMemory, +/// An unspecified source, used as the default +Unspecified, +} + /// Source table #[async_trait] pub trait TableProvider: Sync + Send { @@ -70,6 +99,11 @@ pub trait TableProvider: Sync + Send { TableType::Base } +/// The source of this table +fn table_source() -> TableSource { +TableSource::Unspecified +} + /// Create an ExecutionPlan that will scan the table. /// The table provider will be usually responsible of grouping /// the source data into partitions that can be efficiently
[arrow-datafusion] branch rdbms-changes created (now 307abcc)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch rdbms-changes in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git. at 307abcc add a Tablesource This branch includes the following new commits: new 307abcc add a Tablesource The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow-rs] branch master updated (6b0956a -> a2e629d)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 6b0956a Publicly export arrow::array::MapBuilder (#1355) add a2e629d Remove delimiter from csv Writer (#1342) No new revisions were added by this update. Summary of changes: arrow/src/csv/writer.rs | 5 - 1 file changed, 5 deletions(-)
[arrow-rs] branch master updated (bae3087 -> 6b0956a)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from bae3087 Make bounds configurable in csv ReaderBuilder (#1341) add 6b0956a Publicly export arrow::array::MapBuilder (#1355) No new revisions were added by this update. Summary of changes: arrow/src/array/mod.rs | 1 + 1 file changed, 1 insertion(+)
[arrow-rs] branch master updated (57545b0 -> bae3087)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 57545b0 Refactor `StructArray::from` (#1360) add bae3087 Make bounds configurable in csv ReaderBuilder (#1341) No new revisions were added by this update. Summary of changes: arrow/src/csv/reader.rs | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-)
[arrow-rs] branch master updated: Refactor `StructArray::from` (#1360)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 57545b0 Refactor `StructArray::from` (#1360) 57545b0 is described below commit 57545b01f4a784b38a2fdaeda0cfefb4ccbdc5de Author: Remzi Yang <59198230+haoyang...@users.noreply.github.com> AuthorDate: Thu Feb 24 15:38:49 2022 +0800 Refactor `StructArray::from` (#1360) * add async to default features Signed-off-by: remzi <1371656737...@gmail.com> * rewrite Signed-off-by: remzi <1371656737...@gmail.com> * update Signed-off-by: remzi <1371656737...@gmail.com> --- arrow/src/array/array_struct.rs | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arrow/src/array/array_struct.rs b/arrow/src/array/array_struct.rs index 316ffc6..b82ee03 100644 --- a/arrow/src/array/array_struct.rs +++ b/arrow/src/array/array_struct.rs @@ -108,10 +108,12 @@ impl StructArray { impl From for StructArray { fn from(data: ArrayData) -> Self { -let mut boxed_fields = vec![]; -for cd in data.child_data() { -boxed_fields.push(make_array(cd.clone())); -} +let boxed_fields = data +.child_data() +.iter() +.map(|cd| make_array(cd.clone())) +.collect(); + Self { data, boxed_fields } } }
[arrow-rs] branch master updated: Add with_datetime_format to csv WriterBuilder (#1347)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new ef95e52 Add with_datetime_format to csv WriterBuilder (#1347) ef95e52 is described below commit ef95e52c012a97facafbba9bc9eaa4ba3fcee8a3 Author: Sergey Glushchenko AuthorDate: Wed Feb 23 20:15:28 2022 +0100 Add with_datetime_format to csv WriterBuilder (#1347) --- arrow/src/csv/writer.rs | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arrow/src/csv/writer.rs b/arrow/src/csv/writer.rs index 7752367..18e5c59 100644 --- a/arrow/src/csv/writer.rs +++ b/arrow/src/csv/writer.rs @@ -456,6 +456,12 @@ impl WriterBuilder { self } +/// Set the CSV file's datetime format +pub fn with_datetime_format(mut self, format: String) -> Self { +self.datetime_format = Some(format); +self +} + /// Set the CSV file's time format pub fn with_time_format(mut self, format: String) -> Self { self.time_format = Some(format);
svn commit: r52698 - in /release/arrow/arrow-rs-9.1.0: ./ apache-arrow-rs-9.1.0.tar.gz apache-arrow-rs-9.1.0.tar.gz.asc apache-arrow-rs-9.1.0.tar.gz.sha256 apache-arrow-rs-9.1.0.tar.gz.sha512
Author: nevime Date: Tue Feb 22 17:16:30 2022 New Revision: 52698 Log: Apache Arrow Rust 9.1.0 Added: release/arrow/arrow-rs-9.1.0/ release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz (with props) release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.asc release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha256 release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha512 Added: release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz == Binary file - no diff available. Propchange: release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz -- svn:mime-type = application/octet-stream Added: release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.asc == --- release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.asc (added) +++ release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.asc Tue Feb 22 17:16:30 2022 @@ -0,0 +1,7 @@ +-BEGIN PGP SIGNATURE- + +iHUEABYKAB0WIQQ5BfJU+eUEtA//bPYABIjXcX0/sgUCYhEidQAKCRAABIjXcX0/ +somcAQDZT4ZXRV8g+Lv6WMf5Sn8KiJYmicwC2B2oouNMeiWLtQEA7WL/zMR2KEM9 +9RhX08BC9ljw+PIalrHHlLeZakbUOwo= +=Z23e +-END PGP SIGNATURE- Added: release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha256 == --- release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha256 (added) +++ release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha256 Tue Feb 22 17:16:30 2022 @@ -0,0 +1 @@ +3a60df0d820e3be77a99644fe443e108ff161c3da5227234e5807489eaec9561 apache-arrow-rs-9.1.0.tar.gz Added: release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha512 == --- release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha512 (added) +++ release/arrow/arrow-rs-9.1.0/apache-arrow-rs-9.1.0.tar.gz.sha512 Tue Feb 22 17:16:30 2022 @@ -0,0 +1 @@ +44adb67bf3559560fdeeadd5ae9188d022b818cb0da3d27cec7ad67d87660d0baab7eb4510c18bf3b8901f45b313de7b19553d42362b9b5867c29519432fcfd8 apache-arrow-rs-9.1.0.tar.gz
svn commit: r52697 - /release/arrow/KEYS
Author: nevime Date: Tue Feb 22 16:16:27 2022 New Revision: 52697 Log: Insert Neville Dipale keys to release Modified: release/arrow/KEYS Modified: release/arrow/KEYS == --- release/arrow/KEYS (original) +++ release/arrow/KEYS Tue Feb 22 16:16:27 2022 @@ -1167,3 +1167,23 @@ HoHsSwWTuz2UvPmxhH0LwKHBBmPOZWVF/2iN+cGN 0rT1eQ== =awom -END PGP PUBLIC KEY BLOCK- +pub ed25519 2022-02-19 [SC] [expires: 2024-02-19] + 3905F254F9E504B40FFF6CF6000488D7717D3FB2 +uid [ultimate] Neville Dipale +sig 3000488D7717D3FB2 2022-02-19 Neville Dipale +sub cv25519 2022-02-19 [E] [expires: 2024-02-19] +sig 000488D7717D3FB2 2022-02-19 Neville Dipale + +-BEGIN PGP PUBLIC KEY BLOCK- + +mDMEYhEgWBYJKwYBBAHaRw8BAQdAXN9r2gDzqnm3M14+5gjzOQGfE9Y7syUZPkZK +IXFGigS0Ik5ldmlsbGUgRGlwYWxlIDxuZXZpbWVAYXBhY2hlLm9yZz6ImgQTFgoA +QhYhBDkF8lT55QS0D/9s9gAEiNdxfT+yBQJiESBYAhsDBQkDwmcABQsJCAcCAyIC +AQYVCgkICwIEFgIDAQIeBwIXgAAKCRAABIjXcX0/ssb7AP96RAhkNNRuaQa2uwbL +jOSWZipmeW7flCxVKrEhntTIaAEA8oYIwNxuo73+zM9azRNCZbvvZIFlN+09qQMC +xfkssAm4OARiESBYEgorBgEEAZdVAQUBAQdA2PqrNkrWXfOHuPrj1xeNfIG37fW8 +JXPzqy4/MaIUGSsDAQgHiH4EGBYKACYWIQQ5BfJU+eUEtA//bPYABIjXcX0/sgUC +YhEgWAIbDAUJA8JnAAAKCRAABIjXcX0/sp36AQCS2vIDq364qtOQzWbotWgjgWH2 +yW1iX/b2CJSl0CZHTgD8CuqXjMk3WequwZhLb61ZqdeUWXvVqny4dxkSg3LFsQw= +=4aGL +-END PGP PUBLIC KEY BLOCK-
[arrow-rs] branch master updated: Arrow Rust + Conbench Integration (#1289)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 4b89f7e Arrow Rust + Conbench Integration (#1289) 4b89f7e is described below commit 4b89f7ee3549c24fa5997056b16a1cde60ce7043 Author: diana AuthorDate: Tue Feb 22 02:20:54 2022 -0700 Arrow Rust + Conbench Integration (#1289) * Arrow Rust + Conbench Integration * remove --src-dir --- conbench/.flake8 | 2 + conbench/.gitignore | 130 conbench/.isort.cfg | 2 + conbench/README.md| 251 ++ conbench/_criterion.py| 98 +++ conbench/benchmarks.json | 8 ++ conbench/benchmarks.py| 41 +++ conbench/requirements-test.txt| 3 + conbench/requirements.txt | 1 + dev/release/rat_exclude_files.txt | 5 + 10 files changed, 541 insertions(+) diff --git a/conbench/.flake8 b/conbench/.flake8 new file mode 100644 index 000..e44b810 --- /dev/null +++ b/conbench/.flake8 @@ -0,0 +1,2 @@ +[flake8] +ignore = E501 diff --git a/conbench/.gitignore b/conbench/.gitignore new file mode 100755 index 000..aa44ee2 --- /dev/null +++ b/conbench/.gitignore @@ -0,0 +1,130 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +pip-wheel-metadata/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py,cover +.hypothesis/ +.pytest_cache/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +.python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +#Pipfile.lock + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# SageMath parsed files +*.sage.py + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + diff --git a/conbench/.isort.cfg b/conbench/.isort.cfg new file mode 100644 index 000..f238bf7 --- /dev/null +++ b/conbench/.isort.cfg @@ -0,0 +1,2 @@ +[settings] +profile = black diff --git a/conbench/README.md b/conbench/README.md new file mode 100644 index 000..8c7f38c --- /dev/null +++ b/conbench/README.md @@ -0,0 +1,251 @@ + + +# Arrow Rust + Conbench Integration + + +## Quick start + +``` +$ cd ~/arrow-rs/conbench/ +$ conda create -y -n conbench python=3.9 +$ conda activate conbench +(conbench) $ pip install -r requirements.txt +(conbench) $ conbench arrow-rs +``` + +## Example output + +``` +{ +"batch_id": "b68c559358cc43a3aab02d893d2693f4", +"context": { +"benchmark_language": "Rust" +}, +"github": { +"commit": "ca33a0a50494f95840ade2e9509c3c3d4df35249", +"repository": "https://github.com/dianaclarke/arrow-rs; +}, +"info": {}, +"machine_info": { +"architecture_name": "x86_64", +"cpu_core_count": "8", +"cpu_frequency_max_hz": "24", +"cpu_l1d_cache_bytes": "65536", +"cpu_l1i_cache_bytes": "131072", +"cpu_l2_cache_bytes": "4194304", +"cpu_l3_cache_bytes": "0", +"cpu_model_name": "Apple M1", +"cpu_thread_count": "8", +"gpu_count": "0&qu
svn commit: r52639 - /dev/arrow/KEYS
Author: nevime Date: Sun Feb 20 08:40:26 2022 New Revision: 52639 Log: Add Neville Dipale keys Modified: dev/arrow/KEYS Modified: dev/arrow/KEYS == --- dev/arrow/KEYS (original) +++ dev/arrow/KEYS Sun Feb 20 08:40:26 2022 @@ -1263,3 +1263,23 @@ HoHsSwWTuz2UvPmxhH0LwKHBBmPOZWVF/2iN+cGN 0rT1eQ== =awom -END PGP PUBLIC KEY BLOCK- +pub ed25519 2022-02-19 [SC] [expires: 2024-02-19] + 3905F254F9E504B40FFF6CF6000488D7717D3FB2 +uid [ultimate] Neville Dipale +sig 3000488D7717D3FB2 2022-02-19 Neville Dipale +sub cv25519 2022-02-19 [E] [expires: 2024-02-19] +sig 000488D7717D3FB2 2022-02-19 Neville Dipale + +-BEGIN PGP PUBLIC KEY BLOCK- + +mDMEYhEgWBYJKwYBBAHaRw8BAQdAXN9r2gDzqnm3M14+5gjzOQGfE9Y7syUZPkZK +IXFGigS0Ik5ldmlsbGUgRGlwYWxlIDxuZXZpbWVAYXBhY2hlLm9yZz6ImgQTFgoA +QhYhBDkF8lT55QS0D/9s9gAEiNdxfT+yBQJiESBYAhsDBQkDwmcABQsJCAcCAyIC +AQYVCgkICwIEFgIDAQIeBwIXgAAKCRAABIjXcX0/ssb7AP96RAhkNNRuaQa2uwbL +jOSWZipmeW7flCxVKrEhntTIaAEA8oYIwNxuo73+zM9azRNCZbvvZIFlN+09qQMC +xfkssAm4OARiESBYEgorBgEEAZdVAQUBAQdA2PqrNkrWXfOHuPrj1xeNfIG37fW8 +JXPzqy4/MaIUGSsDAQgHiH4EGBYKACYWIQQ5BfJU+eUEtA//bPYABIjXcX0/sgUC +YhEgWAIbDAUJA8JnAAAKCRAABIjXcX0/sp36AQCS2vIDq364qtOQzWbotWgjgWH2 +yW1iX/b2CJSl0CZHTgD8CuqXjMk3WequwZhLb61ZqdeUWXvVqny4dxkSg3LFsQw= +=4aGL +-END PGP PUBLIC KEY BLOCK-
svn commit: r52634 - in /dev/arrow/apache-arrow-rs-9.1.0-rc1: ./ apache-arrow-rs-9.1.0.tar.gz apache-arrow-rs-9.1.0.tar.gz.asc apache-arrow-rs-9.1.0.tar.gz.sha256 apache-arrow-rs-9.1.0.tar.gz.sha512
Author: nevime Date: Sat Feb 19 17:02:05 2022 New Revision: 52634 Log: Apache Arrow Rust 9.1.0 1 Added: dev/arrow/apache-arrow-rs-9.1.0-rc1/ dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz (with props) dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.asc dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha256 dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha512 Added: dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz == Binary file - no diff available. Propchange: dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.asc == --- dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.asc (added) +++ dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.asc Sat Feb 19 17:02:05 2022 @@ -0,0 +1,7 @@ +-BEGIN PGP SIGNATURE- + +iHUEABYKAB0WIQQ5BfJU+eUEtA//bPYABIjXcX0/sgUCYhEidQAKCRAABIjXcX0/ +somcAQDZT4ZXRV8g+Lv6WMf5Sn8KiJYmicwC2B2oouNMeiWLtQEA7WL/zMR2KEM9 +9RhX08BC9ljw+PIalrHHlLeZakbUOwo= +=Z23e +-END PGP SIGNATURE- Added: dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha256 == --- dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha256 (added) +++ dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha256 Sat Feb 19 17:02:05 2022 @@ -0,0 +1 @@ +3a60df0d820e3be77a99644fe443e108ff161c3da5227234e5807489eaec9561 apache-arrow-rs-9.1.0.tar.gz Added: dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha512 == --- dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha512 (added) +++ dev/arrow/apache-arrow-rs-9.1.0-rc1/apache-arrow-rs-9.1.0.tar.gz.sha512 Sat Feb 19 17:02:05 2022 @@ -0,0 +1 @@ +44adb67bf3559560fdeeadd5ae9188d022b818cb0da3d27cec7ad67d87660d0baab7eb4510c18bf3b8901f45b313de7b19553d42362b9b5867c29519432fcfd8 apache-arrow-rs-9.1.0.tar.gz
[arrow-rs] tag 9.1.0 created (now ecba7dc)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to tag 9.1.0 in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. at ecba7dc (commit) No new revisions were added by this update.
[arrow-rs] branch master updated (041b77d -> ecba7dc)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 041b77d Update the document of function `MutableArrayData::extend` (#1336) add ecba7dc Update versions and CHANGELOG for 9.1.0 release (#1325) No new revisions were added by this update. Summary of changes: CHANGELOG.md | 65 +- arrow-flight/Cargo.toml| 4 +- arrow-pyarrow-integration-testing/Cargo.toml | 4 +- arrow/Cargo.toml | 2 +- arrow/README.md| 2 +- arrow/test/dependency/default-features/Cargo.toml | 2 +- .../test/dependency/no-default-features/Cargo.toml | 2 +- arrow/test/dependency/simd/Cargo.toml | 2 +- dev/release/update_change_log.sh | 4 +- integration-testing/Cargo.toml | 2 +- parquet/Cargo.toml | 6 +- parquet_derive/Cargo.toml | 4 +- parquet_derive/README.md | 4 +- .../test/dependency/default-features/Cargo.toml| 2 +- parquet_derive_test/Cargo.toml | 6 +- 15 files changed, 87 insertions(+), 24 deletions(-)
[arrow-rs] branch master updated (193b64c -> 041b77d)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 193b64c Clean up DictionaryArray construction in test (#1314) add 041b77d Update the document of function `MutableArrayData::extend` (#1336) No new revisions were added by this update. Summary of changes: arrow/src/array/transform/mod.rs | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-)
[arrow-rs] branch master updated: Clean up DictionaryArray construction in test (#1314)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 193b64c Clean up DictionaryArray construction in test (#1314) 193b64c is described below commit 193b64c69f0560a1a01ae4c04004b81afb02fab6 Author: Andrew Lamb AuthorDate: Sat Feb 19 11:02:28 2022 -0500 Clean up DictionaryArray construction in test (#1314) --- arrow/src/array/array_dictionary.rs | 25 - 1 file changed, 4 insertions(+), 21 deletions(-) diff --git a/arrow/src/array/array_dictionary.rs b/arrow/src/array/array_dictionary.rs index 57153f1..7e82ad2 100644 --- a/arrow/src/array/array_dictionary.rs +++ b/arrow/src/array/array_dictionary.rs @@ -302,6 +302,7 @@ mod tests { use super::*; use crate::array::Int8Array; +use crate::datatypes::Int16Type; use crate::{ array::Int16DictionaryArray, array::PrimitiveDictionaryBuilder, datatypes::DataType, @@ -472,29 +473,11 @@ mod tests { #[test] fn test_dictionary_iter() { // Construct a value array -let value_data = ArrayData::builder(DataType::Int8) -.len(8) -.add_buffer(Buffer::from( -&[10_i8, 11, 12, 13, 14, 15, 16, 17].to_byte_slice(), -)) -.build() -.unwrap(); - -// Construct a buffer for value offsets, for the nested array: -let keys = Buffer::from(&[2_i16, 3, 4].to_byte_slice()); +let values = Int8Array::from_iter_values([10_i8, 11, 12, 13, 14, 15, 16, 17]); +let keys = Int16Array::from_iter_values([2_i16, 3, 4]); // Construct a dictionary array from the above two -let key_type = DataType::Int16; -let value_type = DataType::Int8; -let dict_data_type = -DataType::Dictionary(Box::new(key_type), Box::new(value_type)); -let dict_data = ArrayData::builder(dict_data_type) -.len(3) -.add_buffer(keys) -.add_child_data(value_data) -.build() -.unwrap(); -let dict_array = Int16DictionaryArray::from(dict_data); +let dict_array = DictionaryArraytry_new(, ).unwrap(); let mut key_iter = dict_array.keys_iter(); assert_eq!(2, key_iter.next().unwrap().unwrap());
[arrow-rs] branch master updated: Cleanup: remove some dead / test only code (#1331)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new c0351f8 Cleanup: remove some dead / test only code (#1331) c0351f8 is described below commit c0351f84e61172f8403c468868919ec01538ce09 Author: Andrew Lamb AuthorDate: Sat Feb 19 10:57:52 2022 -0500 Cleanup: remove some dead / test only code (#1331) --- arrow/src/array/data.rs | 23 arrow/src/compute/util.rs | 93 ++- 2 files changed, 35 insertions(+), 81 deletions(-) diff --git a/arrow/src/array/data.rs b/arrow/src/array/data.rs index d2db0d0..cbbc56a 100644 --- a/arrow/src/array/data.rs +++ b/arrow/src/array/data.rs @@ -198,29 +198,6 @@ pub(crate) fn new_buffers(data_type: , capacity: usize) -> [MutableBuff } } -/// Ensures that at least `min_size` elements of type `data_type` can -/// be stored in a buffer of `buffer_size`. -/// -/// `buffer_index` is used in error messages to identify which buffer -/// had the invalid index -#[allow(dead_code)] -fn ensure_size( -data_type: , -min_size: usize, -buffer_size: usize, -buffer_index: usize, -) -> Result<()> { -// if min_size is zero, may not have buffers (e.g. NullArray) -if min_size > 0 && buffer_size < min_size { -Err(ArrowError::InvalidArgumentError(format!( -"Need at least {} bytes in buffers[{}] in array of type {:?}, but got {}", -buffer_size, buffer_index, data_type, min_size -))) -} else { -Ok(()) -} -} - /// Maps 2 [`MutableBuffer`]s into a vector of [Buffer]s whose size depends on `data_type`. #[inline] pub(crate) fn into_buffers( diff --git a/arrow/src/compute/util.rs b/arrow/src/compute/util.rs index 3f168c1..62c3be6 100644 --- a/arrow/src/compute/util.rs +++ b/arrow/src/compute/util.rs @@ -18,7 +18,7 @@ //! Common utilities for computation kernels. use crate::array::*; -use crate::buffer::{buffer_bin_and, buffer_bin_or, Buffer}; +use crate::buffer::{buffer_bin_and, Buffer}; use crate::datatypes::*; use crate::error::{ArrowError, Result}; use num::{One, ToPrimitive, Zero}; @@ -58,41 +58,6 @@ pub(super) fn combine_option_bitmap( } } -/// Compares the null bitmaps of two arrays using a bitwise `or` operation. -/// -/// This function is useful when implementing operations on higher level arrays. -#[allow(clippy::unnecessary_wraps)] -#[allow(dead_code)] -pub(super) fn compare_option_bitmap( -left_data: , -right_data: , -len_in_bits: usize, -) -> Result> { -let left_offset_in_bits = left_data.offset(); -let right_offset_in_bits = right_data.offset(); - -let left = left_data.null_buffer(); -let right = right_data.null_buffer(); - -match left { -None => match right { -None => Ok(None), -Some(r) => Ok(Some(r.bit_slice(right_offset_in_bits, len_in_bits))), -}, -Some(l) => match right { -None => Ok(Some(l.bit_slice(left_offset_in_bits, len_in_bits))), - -Some(r) => Ok(Some(buffer_bin_or( -l, -left_offset_in_bits, -r, -right_offset_in_bits, -len_in_bits, -))), -}, -} -} - /// Takes/filters a list array's inner data using the offsets of the list array. /// /// Where a list array has indices `[0,2,5,10]`, taking indices of `[2,0]` returns @@ -176,10 +141,44 @@ pub(super) mod tests { use std::sync::Arc; +use crate::buffer::buffer_bin_or; use crate::datatypes::DataType; use crate::util::bit_util; use crate::{array::ArrayData, buffer::MutableBuffer}; +/// Compares the null bitmaps of two arrays using a bitwise `or` operation. +/// +/// This function is useful when implementing operations on higher level arrays. +pub(super) fn compare_option_bitmap( +left_data: , +right_data: , +len_in_bits: usize, +) -> Result> { +let left_offset_in_bits = left_data.offset(); +let right_offset_in_bits = right_data.offset(); + +let left = left_data.null_buffer(); +let right = right_data.null_buffer(); + +match left { +None => match right { +None => Ok(None), +Some(r) => Ok(Some(r.bit_slice(right_offset_in_bits, len_in_bits))), +}, +Some(l) => match right { +None => Ok(Some(l.bit_slice(left_offset_in_bits, len_in_bits))), + +Some(r) => Ok(Some(buffer_bin_or( +l, +left_offset_in_bits, +r, +right_offset_in_bits, +len_in_bits, +))), +
[arrow-rs] branch master updated: fix failing csv_writer bench (#1293)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 39f3f71 fix failing csv_writer bench (#1293) 39f3f71 is described below commit 39f3f711876ff113545b1a2d7023f66de77bb731 Author: Andy Grove AuthorDate: Thu Feb 10 00:58:08 2022 -0700 fix failing csv_writer bench (#1293) --- arrow/benches/csv_writer.rs | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arrow/benches/csv_writer.rs b/arrow/benches/csv_writer.rs index 62c5da9..3ecf514 100644 --- a/arrow/benches/csv_writer.rs +++ b/arrow/benches/csv_writer.rs @@ -25,6 +25,7 @@ use arrow::array::*; use arrow::csv; use arrow::datatypes::*; use arrow::record_batch::RecordBatch; +use std::env; use std::fs::File; use std::sync::Arc; @@ -56,7 +57,8 @@ fn criterion_benchmark(c: Criterion) { vec![Arc::new(c1), Arc::new(c2), Arc::new(c3), Arc::new(c4)], ) .unwrap(); -let file = File::create("target/bench_write_csv.csv").unwrap(); +let path = env::temp_dir().join("bench_write_csv.csv"); +let file = File::create(path).unwrap(); let mut writer = csv::Writer::new(file); let batches = vec![, , , , , , , , , , ];
[arrow-rs] branch master updated: JSON reader - empty nested list should not create child value (#826)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new e898de5 JSON reader - empty nested list should not create child value (#826) e898de5 is described below commit e898de57e4587c64387939f8a557bc5fa2dffeb8 Author: Wakahisa AuthorDate: Wed Oct 13 15:46:07 2021 +0200 JSON reader - empty nested list should not create child value (#826) * JSON reader - empty nested list should not create child value * PR review --- arrow/src/json/reader.rs | 41 ++ arrow/src/json/writer.rs | 52 2 files changed, 71 insertions(+), 22 deletions(-) diff --git a/arrow/src/json/reader.rs b/arrow/src/json/reader.rs index 9592b59..c2a2de9 100644 --- a/arrow/src/json/reader.rs +++ b/arrow/src/json/reader.rs @@ -1048,31 +1048,27 @@ impl Decoder { } DataType::Struct(fields) => { // extract list values, with non-lists converted to Value::Null -let array_item_count = rows -.iter() -.map(|row| match row { -Value::Array(values) => values.len(), -_ => 1, -}) -.sum(); +let array_item_count = cur_offset.to_usize().unwrap(); let num_bytes = bit_util::ceil(array_item_count, 8); let mut null_buffer = MutableBuffer::from_len_zeroed(num_bytes); let mut struct_index = 0; let rows: Vec = rows .iter() -.flat_map(|row| { -if let Value::Array(values) = row { -values.iter().for_each(|_| { -bit_util::set_bit( -null_buffer.as_slice_mut(), -struct_index, -); +.flat_map(|row| match row { +Value::Array(values) if !values.is_empty() => { +values.iter().for_each(|value| { +if !value.is_null() { +bit_util::set_bit( +null_buffer.as_slice_mut(), +struct_index, +); +} struct_index += 1; }); values.clone() -} else { -struct_index += 1; -vec![Value::Null] +} +_ => { +vec![] } }) .collect(); @@ -2209,6 +2205,7 @@ mod tests { {"a": [{"b": true, "c": {"d": "c_text"}}, {"b": null, "c": {"d": "d_text"}}, {"b": true, "c": {"d": null}}]} {"a": null} {"a": []} +{"a": [null]} "#; let mut reader = builder.build(Cursor::new(json_content)).unwrap(); @@ -2243,23 +2240,23 @@ mod tests { .null_bit_buffer(Buffer::from(vec![0b0011])) .build(); let a_list = ArrayDataBuilder::new(a_field.data_type().clone()) -.len(5) -.add_buffer(Buffer::from_slice_ref(&[0i32, 2, 3, 6, 6, 6])) +.len(6) +.add_buffer(Buffer::from_slice_ref(&[0i32, 2, 3, 6, 6, 6, 7])) .add_child_data(a) -.null_bit_buffer(Buffer::from(vec![0b00010111])) +.null_bit_buffer(Buffer::from(vec![0b00110111])) .build(); let expected = make_array(a_list); // compare `a` with result from json reader let batch = reader.next().unwrap().unwrap(); let read = batch.column(0); -assert_eq!(read.len(), 5); +assert_eq!(read.len(), 6); // compare the arrays the long way around, to better detect differences let read: = read.as_any().downcast_ref::().unwrap(); let expected = expected.as_any().downcast_ref::().unwrap(); assert_eq!( read.data().buffers()[0], -Buffer::from_slice_ref(&[0i32, 2, 3, 6, 6, 6]) +Buffer::from_slice_ref(&[0i32, 2, 3, 6, 6, 6, 7]) ); // compare list null buffers assert_eq!(read.data().null_buffer(), expected.data().null_buffer()); diff --git a/arrow/src/json/writer.rs b/arrow/src/json/wr
[arrow-rs] branch master updated: Fix null count when casting ListArray (#816)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new a835f2c Fix null count when casting ListArray (#816) a835f2c is described below commit a835f2cd1c1c8f7aca092eeafab16f76a07f285f Author: Andrew Lamb AuthorDate: Wed Oct 6 20:29:04 2021 -0400 Fix null count when casting ListArray (#816) --- arrow/src/compute/kernels/cast.rs | 38 ++ 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/arrow/src/compute/kernels/cast.rs b/arrow/src/compute/kernels/cast.rs index 593adec..a0847d1 100644 --- a/arrow/src/compute/kernels/cast.rs +++ b/arrow/src/compute/kernels/cast.rs @@ -1680,12 +1680,8 @@ fn cast_list_inner( let array_data = ArrayData::new( to_type.clone(), array.len(), -Some(cast_array.null_count()), -cast_array -.data() -.null_bitmap() -.clone() -.map(|bitmap| bitmap.bits), +Some(data.null_count()), +data.null_bitmap().clone().map(|bitmap| bitmap.bits), array.offset(), // reuse offset buffer data.buffers().to_vec(), @@ -2025,7 +2021,6 @@ mod tests { #[test] fn test_cast_list_i32_to_list_u16() { -// Construct a value array let value_data = Int32Array::from(vec![0, 0, 0, -1, -2, -1, 2, 1]) .data() .clone(); @@ -2033,6 +2028,7 @@ mod tests { let value_offsets = Buffer::from_slice_ref(&[0, 3, 6, 8]); // Construct a list array from the above two +// [[0,0,0], [-1, -2, -1], [2, 1]] let list_data_type = DataType::List(Box::new(Field::new("item", DataType::Int32, true))); let list_data = ArrayData::builder(list_data_type) @@ -2047,9 +2043,13 @@ mod tests { ::List(Box::new(Field::new("item", DataType::UInt16, true))), ) .unwrap(); + +// For the ListArray itself, there are no null values (as there were no nulls when they went in) +// // 3 negative values should get lost when casting to unsigned, // 1 value should overflow -assert_eq!(4, cast_array.null_count()); +assert_eq!(0, cast_array.null_count()); + // offsets should be the same assert_eq!( list_array.data().buffers().to_vec(), @@ -2061,23 +2061,21 @@ mod tests { .downcast_ref::() .unwrap(); assert_eq!(DataType::UInt16, array.value_type()); -assert_eq!(4, array.values().null_count()); assert_eq!(3, array.value_length(0)); assert_eq!(3, array.value_length(1)); assert_eq!(2, array.value_length(2)); + +// expect 4 nulls: negative numbers and overflow let values = array.values(); +assert_eq!(4, values.null_count()); let u16arr = values.as_any().downcast_ref::().unwrap(); -assert_eq!(8, u16arr.len()); -assert_eq!(4, u16arr.null_count()); - -assert_eq!(0, u16arr.value(0)); -assert_eq!(0, u16arr.value(1)); -assert_eq!(0, u16arr.value(2)); -assert!(!u16arr.is_valid(3)); -assert!(!u16arr.is_valid(4)); -assert!(!u16arr.is_valid(5)); -assert_eq!(2, u16arr.value(6)); -assert!(!u16arr.is_valid(7)); + +let expected: UInt16Array = +vec![Some(0), Some(0), Some(0), None, None, None, Some(2), None] +.into_iter() +.collect(); + +assert_eq!(u16arr, ); } #[test]
[arrow-datafusion] branch master updated: reduce ScalarValue from trait boilerplate with macro (#989)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git The following commit(s) were added to refs/heads/master by this push: new bb82ca1 reduce ScalarValue from trait boilerplate with macro (#989) bb82ca1 is described below commit bb82ca100b233653811d14a4cc18cad5e5bd7536 Author: QP Hou AuthorDate: Sat Sep 11 13:58:48 2021 -0700 reduce ScalarValue from trait boilerplate with macro (#989) Co-authored-by: Jorge Leitao Co-authored-by: Jorge Leitao --- datafusion/src/scalar.rs | 152 --- 1 file changed, 24 insertions(+), 128 deletions(-) diff --git a/datafusion/src/scalar.rs b/datafusion/src/scalar.rs index 86d1765..77d4c82 100644 --- a/datafusion/src/scalar.rs +++ b/datafusion/src/scalar.rs @@ -1122,137 +1122,33 @@ impl ScalarValue { } } -impl From for ScalarValue { -fn from(value: f64) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Float64(value) -} -} - -impl From for ScalarValue { -fn from(value: f32) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Float32(value) -} -} - -impl From for ScalarValue { -fn from(value: i8) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Int8(value) -} -} - -impl From for ScalarValue { -fn from(value: i16) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Int16(value) -} -} - -impl From for ScalarValue { -fn from(value: i32) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Int32(value) -} -} - -impl From for ScalarValue { -fn from(value: i64) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Int64(value) -} -} - -impl From for ScalarValue { -fn from(value: bool) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::Boolean(value) -} -} - -impl From for ScalarValue { -fn from(value: u8) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::UInt8(value) -} -} - -impl From for ScalarValue { -fn from(value: u16) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::UInt16(value) -} -} - -impl From for ScalarValue { -fn from(value: u32) -> Self { -Some(value).into() -} -} - -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::UInt32(value) -} -} +macro_rules! impl_scalar { +($ty:ty, $scalar:tt) => { +impl From<$ty> for ScalarValue { +fn from(value: $ty) -> Self { +ScalarValue::$scalar(Some(value)) +} +} -impl From for ScalarValue { -fn from(value: u64) -> Self { -Some(value).into() -} +impl From> for ScalarValue { +fn from(value: Option<$ty>) -> Self { +ScalarValue::$scalar(value) +} +} +}; } -impl From> for ScalarValue { -fn from(value: Option) -> Self { -ScalarValue::UInt64(value) -} -} +impl_scalar!(f64, Float64); +impl_scalar!(f32, Float32); +impl_scalar!(i8, Int8); +impl_scalar!(i16, Int16); +impl_scalar!(i32, Int32); +impl_scalar!(i64, Int64); +impl_scalar!(bool, Boolean); +impl_scalar!(u8, UInt8); +impl_scalar!(u16, UInt16); +impl_scalar!(u32, UInt32); +impl_scalar!(u64, UInt64); impl From<> for ScalarValue { fn from(value: ) -> Self {
[arrow-rs] branch master updated: Added PartialEq to RecordBatch (#750)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 0e4e75b Added PartialEq to RecordBatch (#750) 0e4e75b is described below commit 0e4e75b7cc5ac8e934b5846df75612ce8e641bfb Author: Matthew Turner AuthorDate: Sat Sep 11 12:52:23 2021 -0400 Added PartialEq to RecordBatch (#750) * Added PartialEq to RecordBatch * derive PartialEq and add tests --- arrow/src/record_batch.rs | 159 +- 1 file changed, 158 insertions(+), 1 deletion(-) diff --git a/arrow/src/record_batch.rs b/arrow/src/record_batch.rs index bb4b301..b6e5566 100644 --- a/arrow/src/record_batch.rs +++ b/arrow/src/record_batch.rs @@ -37,7 +37,7 @@ use crate::error::{ArrowError, Result}; /// serialization and computation functions, possibly incremental. /// See also [CSV reader](crate::csv::Reader) and /// [JSON reader](crate::json::Reader). -#[derive(Clone, Debug)] +#[derive(Clone, Debug, PartialEq)] pub struct RecordBatch { schema: SchemaRef, columns: Vec>, @@ -741,4 +741,161 @@ mod tests { "Invalid argument error: batches[1] schema is different with argument schema.", ); } + +#[test] +fn record_batch_equality() { +let id_arr1 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr1 = Int32Array::from(vec![5, 6, 7, 8]); +let schema1 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let id_arr2 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr2 = Int32Array::from(vec![5, 6, 7, 8]); +let schema2 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let batch1 = RecordBatch::try_new( +Arc::new(schema1), +vec![Arc::new(id_arr1), Arc::new(val_arr1)], +) +.unwrap(); + +let batch2 = RecordBatch::try_new( +Arc::new(schema2), +vec![Arc::new(id_arr2), Arc::new(val_arr2)], +) +.unwrap(); + +assert_eq!(batch1, batch2); +} + +#[test] +fn record_batch_vals_ne() { +let id_arr1 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr1 = Int32Array::from(vec![5, 6, 7, 8]); +let schema1 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let id_arr2 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr2 = Int32Array::from(vec![1, 2, 3, 4]); +let schema2 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let batch1 = RecordBatch::try_new( +Arc::new(schema1), +vec![Arc::new(id_arr1), Arc::new(val_arr1)], +) +.unwrap(); + +let batch2 = RecordBatch::try_new( +Arc::new(schema2), +vec![Arc::new(id_arr2), Arc::new(val_arr2)], +) +.unwrap(); + +assert_ne!(batch1, batch2); +} + +#[test] +fn record_batch_column_names_ne() { +let id_arr1 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr1 = Int32Array::from(vec![5, 6, 7, 8]); +let schema1 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let id_arr2 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr2 = Int32Array::from(vec![5, 6, 7, 8]); +let schema2 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("num", DataType::Int32, false), +]); + +let batch1 = RecordBatch::try_new( +Arc::new(schema1), +vec![Arc::new(id_arr1), Arc::new(val_arr1)], +) +.unwrap(); + +let batch2 = RecordBatch::try_new( +Arc::new(schema2), +vec![Arc::new(id_arr2), Arc::new(val_arr2)], +) +.unwrap(); + +assert_ne!(batch1, batch2); +} + +#[test] +fn record_batch_column_number_ne() { +let id_arr1 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr1 = Int32Array::from(vec![5, 6, 7, 8]); +let schema1 = Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("val", DataType::Int32, false), +]); + +let id_arr2 = Int32Array::from(vec![1, 2, 3, 4]); +let val_arr2 = Int32Array::from(vec![5, 6, 7, 8]);
[arrow-rs] branch master updated: fix: Handle slices in unary kernel (#739)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 7ae6910 fix: Handle slices in unary kernel (#739) 7ae6910 is described below commit 7ae691049b89e2ae54c4315021f305560ff167b6 Author: Ben Chambers <35960+bjchamb...@users.noreply.github.com> AuthorDate: Thu Sep 2 17:12:47 2021 -0700 fix: Handle slices in unary kernel (#739) --- arrow/src/buffer/immutable.rs | 2 +- arrow/src/compute/kernels/arity.rs | 24 +++- 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arrow/src/buffer/immutable.rs b/arrow/src/buffer/immutable.rs index c00af6e..f0aefd9 100644 --- a/arrow/src/buffer/immutable.rs +++ b/arrow/src/buffer/immutable.rs @@ -184,7 +184,7 @@ impl Buffer { /// If the offset is byte-aligned the returned buffer is a shallow clone, /// otherwise a new buffer is allocated and filled with a copy of the bits in the range. pub fn bit_slice(, offset: usize, len: usize) -> Self { -if offset % 8 == 0 && len % 8 == 0 { +if offset % 8 == 0 { return self.slice(offset / 8); } diff --git a/arrow/src/compute/kernels/arity.rs b/arrow/src/compute/kernels/arity.rs index 4aa7f3d..d7beae6 100644 --- a/arrow/src/compute/kernels/arity.rs +++ b/arrow/src/compute/kernels/arity.rs @@ -30,7 +30,10 @@ fn into_primitive_array_data( O::DATA_TYPE, array.len(), None, -array.data_ref().null_buffer().cloned(), +array +.data_ref() +.null_buffer() +.map(|b| b.bit_slice(array.offset(), array.len())), 0, vec![buffer], vec![], @@ -72,3 +75,22 @@ where let data = into_primitive_array_data::<_, O>(array, buffer); PrimitiveArrayfrom(data) } + +#[cfg(test)] +mod tests { +use super::*; +use crate::array::{as_primitive_array, Float64Array}; + +#[test] +fn test_unary_f64_slice() { +let input = +Float64Array::from(vec![Some(5.1f64), None, Some(6.8), None, Some(7.2)]); +let input_slice = input.slice(1, 4); +let input_slice: = as_primitive_array(_slice); +let result = unary(input_slice, |n| n.round()); +assert_eq!( +result, +Float64Array::from(vec![None, Some(7.0), None, Some(7.0)]) +) +} +}
[arrow-rs] branch master updated: Write boolean stats for boolean columns (not i32 stats) (#661)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 857dbaf Write boolean stats for boolean columns (not i32 stats) (#661) 857dbaf is described below commit 857dbafcbaa721f22ac485f38ccaff3faf8d2ab9 Author: Andrew Lamb AuthorDate: Sun Aug 8 08:32:47 2021 -0400 Write boolean stats for boolean columns (not i32 stats) (#661) --- parquet/src/column/writer.rs | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/parquet/src/column/writer.rs b/parquet/src/column/writer.rs index 3cb17e1..af76c84 100644 --- a/parquet/src/column/writer.rs +++ b/parquet/src/column/writer.rs @@ -919,7 +919,7 @@ impl ColumnWriterImpl { }; match self.descr.physical_type() { Type::INT32 => gen_stats_section!(i32, int32, min, max, distinct, nulls), -Type::BOOLEAN => gen_stats_section!(i32, int32, min, max, distinct, nulls), +Type::BOOLEAN => gen_stats_section!(bool, boolean, min, max, distinct, nulls), Type::INT64 => gen_stats_section!(i64, int64, min, max, distinct, nulls), Type::INT96 => gen_stats_section!(Int96, int96, min, max, distinct, nulls), Type::FLOAT => gen_stats_section!(f32, float, min, max, distinct, nulls), @@ -1691,13 +1691,11 @@ mod tests { fn test_bool_statistics() { let stats = statistics_roundtrip::(&[true, false, false, true]); assert!(stats.has_min_max_set()); -// should it be BooleanStatistics?? -// https://github.com/apache/arrow-rs/issues/659 -if let Statistics::Int32(stats) = stats { -assert_eq!(stats.min(), &0); -assert_eq!(stats.max(), &1); +if let Statistics::Boolean(stats) = stats { +assert_eq!(stats.min(), ); +assert_eq!(stats.max(), ); } else { -panic!("expecting Statistics::Int32, got {:?}", stats); +panic!("expecting Statistics::Boolean, got {:?}", stats); } }
[arrow-rs] branch master updated: allocate enough bytes when writing booleans (#658)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 75432ed allocate enough bytes when writing booleans (#658) 75432ed is described below commit 75432edb05ff001481df728607fc5b9be969c266 Author: Ben Chambers <35960+bjchamb...@users.noreply.github.com> AuthorDate: Sun Aug 8 00:57:17 2021 -0700 allocate enough bytes when writing booleans (#658) * allocate enough bytes when writing booleans * round up to nearest multiple of 256 --- parquet/src/arrow/arrow_writer.rs | 28 +++- parquet/src/data_type.rs | 8 +++- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/parquet/src/arrow/arrow_writer.rs b/parquet/src/arrow/arrow_writer.rs index 4726734..7728cd4 100644 --- a/parquet/src/arrow/arrow_writer.rs +++ b/parquet/src/arrow/arrow_writer.rs @@ -227,7 +227,7 @@ fn write_leaves( ArrowDataType::FixedSizeList(_, _) | ArrowDataType::Union(_) => { Err(ParquetError::NYI( format!( -"Attempting to write an Arrow type {:?} to parquet that is not yet implemented", +"Attempting to write an Arrow type {:?} to parquet that is not yet implemented", array.data_type() ) )) @@ -1200,6 +1200,32 @@ mod tests { } #[test] +fn bool_large_single_column() { +let values = Arc::new( +[None, Some(true), Some(false)] +.iter() +.cycle() +.copied() +.take(200_000) +.collect::(), +); +let schema = +Schema::new(vec![Field::new("col", values.data_type().clone(), true)]); +let expected_batch = +RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap(); +let file = get_temp_file("bool_large_single_column", &[]); + +let mut writer = ArrowWriter::try_new( +file.try_clone().unwrap(), +expected_batch.schema(), +None, +) +.expect("Unable to write file"); +writer.write(_batch).unwrap(); +writer.close().unwrap(); +} + +#[test] fn i8_single_column() { required_and_optional::(0..SMALL_SIZE as i8, "i8_single_column"); } diff --git a/parquet/src/data_type.rs b/parquet/src/data_type.rs index 127ba95..3573362 100644 --- a/parquet/src/data_type.rs +++ b/parquet/src/data_type.rs @@ -588,6 +588,7 @@ pub(crate) mod private { use crate::util::bit_util::{BitReader, BitWriter}; use crate::util::memory::ByteBufferPtr; +use arrow::util::bit_util::round_upto_power_of_2; use byteorder::ByteOrder; use std::convert::TryInto; @@ -669,7 +670,12 @@ pub(crate) mod private { bit_writer: BitWriter, ) -> Result<()> { if bit_writer.bytes_written() + values.len() / 8 >= bit_writer.capacity() { -bit_writer.extend(256); +let bits_available = +(bit_writer.capacity() - bit_writer.bytes_written()) * 8; +let bits_needed = values.len() - bits_available; +let bytes_needed = (bits_needed + 7) / 8; +let bytes_needed = round_upto_power_of_2(bytes_needed, 256); +bit_writer.extend(bytes_needed); } for value in values { if !bit_writer.put_value(*value as u64, 1) {
[arrow-rs] branch master updated: Fix parquet string statistics generation (#643)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 4618ef5 Fix parquet string statistics generation (#643) 4618ef5 is described below commit 4618ef539521a09a1e46246a29ea31807e98bb7c Author: Andrew Lamb AuthorDate: Sun Aug 8 03:46:14 2021 -0400 Fix parquet string statistics generation (#643) * Fix string statistics generation, add tests * fix Int96 stats test * Add notes for additional tickets --- parquet/src/column/writer.rs | 122 +++ parquet/src/data_type.rs | 29 +- 2 files changed, 134 insertions(+), 17 deletions(-) diff --git a/parquet/src/column/writer.rs b/parquet/src/column/writer.rs index d5b8457..3cb17e1 100644 --- a/parquet/src/column/writer.rs +++ b/parquet/src/column/writer.rs @@ -1688,6 +1688,128 @@ mod tests { } #[test] +fn test_bool_statistics() { +let stats = statistics_roundtrip::(&[true, false, false, true]); +assert!(stats.has_min_max_set()); +// should it be BooleanStatistics?? +// https://github.com/apache/arrow-rs/issues/659 +if let Statistics::Int32(stats) = stats { +assert_eq!(stats.min(), &0); +assert_eq!(stats.max(), &1); +} else { +panic!("expecting Statistics::Int32, got {:?}", stats); +} +} + +#[test] +fn test_int32_statistics() { +let stats = statistics_roundtrip::(&[-1, 3, -2, 2]); +assert!(stats.has_min_max_set()); +if let Statistics::Int32(stats) = stats { +assert_eq!(stats.min(), &-2); +assert_eq!(stats.max(), &3); +} else { +panic!("expecting Statistics::Int32, got {:?}", stats); +} +} + +#[test] +fn test_int64_statistics() { +let stats = statistics_roundtrip::(&[-1, 3, -2, 2]); +assert!(stats.has_min_max_set()); +if let Statistics::Int64(stats) = stats { +assert_eq!(stats.min(), &-2); +assert_eq!(stats.max(), &3); +} else { +panic!("expecting Statistics::Int64, got {:?}", stats); +} +} + +#[test] +fn test_int96_statistics() { +let input = vec![ +Int96::from(vec![1, 20, 30]), +Int96::from(vec![3, 20, 10]), +Int96::from(vec![0, 20, 30]), +Int96::from(vec![2, 20, 30]), +] +.into_iter() +.collect::>(); + +let stats = statistics_roundtrip::(); +assert!(stats.has_min_max_set()); +if let Statistics::Int96(stats) = stats { +assert_eq!(stats.min(), ::from(vec![0, 20, 30])); +assert_eq!(stats.max(), ::from(vec![3, 20, 10])); +} else { +panic!("expecting Statistics::Int96, got {:?}", stats); +} +} + +#[test] +fn test_float_statistics() { +let stats = statistics_roundtrip::(&[-1.0, 3.0, -2.0, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Float(stats) = stats { +assert_eq!(stats.min(), &-2.0); +assert_eq!(stats.max(), &3.0); +} else { +panic!("expecting Statistics::Float, got {:?}", stats); +} +} + +#[test] +fn test_double_statistics() { +let stats = statistics_roundtrip::(&[-1.0, 3.0, -2.0, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Double(stats) = stats { +assert_eq!(stats.min(), &-2.0); +assert_eq!(stats.max(), &3.0); +} else { +panic!("expecting Statistics::Double, got {:?}", stats); +} +} + +#[test] +fn test_byte_array_statistics() { +let input = vec!["aawaa", "zz", "aaw", "m", "qrs"] +.iter() +.map(|| s.into()) +.collect::>(); + +let stats = statistics_roundtrip::(); +assert!(stats.has_min_max_set()); +if let Statistics::ByteArray(stats) = stats { +assert_eq!(stats.min(), ::from("aaw")); +assert_eq!(stats.max(), ::from("zz")); +} else { +panic!("expecting Statistics::ByteArray, got {:?}", stats); +} +} + +#[test] +fn test_fixed_len_byte_array_statistics() { +let input = vec!["aawaa", "zz ", "aaw ", "m", "qrs "] +.iter() +.map(|| { +let b: ByteArray = s.into(); +b.into() +}) +.collect::>(); + +let stats = statistics_roun
[arrow-rs] branch master updated: Remove undefined behavior in `value` method of boolean and primitive arrays (#644)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 6bf1988 Remove undefined behavior in `value` method of boolean and primitive arrays (#644) 6bf1988 is described below commit 6bf1988852f87da21a163203eec4c83a7b692901 Author: Daniël Heres AuthorDate: Tue Aug 3 09:11:24 2021 +0200 Remove undefined behavior in `value` method of boolean and primitive arrays (#644) * Remove UB in `value` * Add safety note --- arrow/src/array/array_boolean.rs| 6 -- arrow/src/array/array_primitive.rs | 6 ++ arrow/src/array/array_string.rs | 25 + arrow/src/compute/kernels/comparison.rs | 11 --- 4 files changed, 19 insertions(+), 29 deletions(-) diff --git a/arrow/src/array/array_boolean.rs b/arrow/src/array/array_boolean.rs index 5357614..9274e65 100644 --- a/arrow/src/array/array_boolean.rs +++ b/arrow/src/array/array_boolean.rs @@ -115,9 +115,11 @@ impl BooleanArray { /// Returns the boolean value at index `i`. /// -/// Note this doesn't do any bound checking, for performance reason. +/// Panics of offset `i` is out of bounds pub fn value(, i: usize) -> bool { -debug_assert!(i < self.len()); +assert!(i < self.len()); +// Safety: +// `i < self.len() unsafe { self.value_unchecked(i) } } } diff --git a/arrow/src/array/array_primitive.rs b/arrow/src/array/array_primitive.rs index 0765629..9c14f88 100644 --- a/arrow/src/array/array_primitive.rs +++ b/arrow/src/array/array_primitive.rs @@ -101,12 +101,10 @@ impl PrimitiveArray { /// Returns the primitive value at index `i`. /// -/// Note this doesn't do any bound checking, for performance reason. -/// # Safety -/// caller must ensure that the passed in offset is less than the array len() +/// Panics of offset `i` is out of bounds #[inline] pub fn value(, i: usize) -> T::Native { -debug_assert!(i < self.len()); +assert!(i < self.len()); unsafe { self.value_unchecked(i) } } diff --git a/arrow/src/array/array_string.rs b/arrow/src/array/array_string.rs index 0b48e57..2fa4c48 100644 --- a/arrow/src/array/array_string.rs +++ b/arrow/src/array/array_string.rs @@ -81,6 +81,7 @@ impl GenericStringArray { /// Returns the element at index /// # Safety /// caller is responsible for ensuring that index is within the array bounds +#[inline] pub unsafe fn value_unchecked(, i: usize) -> { let end = self.value_offsets().get_unchecked(i + 1); let start = self.value_offsets().get_unchecked(i); @@ -103,28 +104,12 @@ impl GenericStringArray { } /// Returns the element at index `i` as +#[inline] pub fn value(, i: usize) -> { assert!(i < self.data.len(), "StringArray out of bounds access"); -//Soundness: length checked above, offset buffer length is 1 larger than logical array length -let end = unsafe { self.value_offsets().get_unchecked(i + 1) }; -let start = unsafe { self.value_offsets().get_unchecked(i) }; - -// Soundness -// pointer alignment & location is ensured by RawPtrBox -// buffer bounds/offset is ensured by the value_offset invariants -// ISSUE: utf-8 well formedness is not checked -unsafe { -// Safety of `to_isize().unwrap()` -// `start` and `end` are , which is a generic type that implements the -// OffsetSizeTrait. Currently, only i32 and i64 implement OffsetSizeTrait, -// both of which should cleanly cast to isize on an architecture that supports -// 32/64-bit offsets -let slice = std::slice::from_raw_parts( -self.value_data.as_ptr().offset(start.to_isize().unwrap()), -(*end - *start).to_usize().unwrap(), -); -std::str::from_utf8_unchecked(slice) -} +// Safety: +// `i < self.data.len() +unsafe { self.value_unchecked(i) } } fn from_list(v: GenericListArray) -> Self { diff --git a/arrow/src/compute/kernels/comparison.rs b/arrow/src/compute/kernels/comparison.rs index f54d305..a899d5b 100644 --- a/arrow/src/compute/kernels/comparison.rs +++ b/arrow/src/compute/kernels/comparison.rs @@ -46,7 +46,10 @@ macro_rules! compare_op { let null_bit_buffer = combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?; -let comparison = (0..$left.len()).map(|i| $op($left.value(i), $right.value(i))); +// Safety: +// `i < $left.len()` and $left.len() == $right.len() +let comparison = (0..$left.len()) +.map(|i| unsafe { $o
[arrow-rs] branch master updated: update documentation (#648)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new fe1e1f6 update documentation (#648) fe1e1f6 is described below commit fe1e1f68eb78bf093b1f6faa62a0fddcf0a69f82 Author: Ruihang Xia AuthorDate: Tue Aug 3 01:32:41 2021 +0800 update documentation (#648) Signed-off-by: Ruihang Xia --- dev/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/README.md b/dev/README.md index b4ea02b..f9d2070 100644 --- a/dev/README.md +++ b/dev/README.md @@ -30,8 +30,8 @@ We have provided a script to assist with verifying release candidates: bash dev/release/verify-release-candidate.sh 0.7.0 0 ``` -Currently this only works on Linux (patches to expand to macOS welcome!). Read -the script for information about system dependencies. +This works on Linux and macOS. Read the script for information about system +dependencies. On Windows, we have a script that verifies C++ and Python (requires Visual Studio 2015):
[arrow-rs] branch master updated: Fix data corruption in json decoder f64-to-i64 cast (#652)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new b075c3c Fix data corruption in json decoder f64-to-i64 cast (#652) b075c3c is described below commit b075c3cef6d48e1e5ffce7e5c555d9c740885fae Author: Christian Williams AuthorDate: Mon Aug 2 13:29:09 2021 -0400 Fix data corruption in json decoder f64-to-i64 cast (#652) * Add failing test for JSON writer i64 bug * Add special handling for i64/u64 to json decoder array builder * Fix linter error - linter wants .flatten on a new line --- arrow/src/json/reader.rs| 13 +++-- arrow/test/data/arrays.json | 2 +- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/arrow/src/json/reader.rs b/arrow/src/json/reader.rs index c4e8470..4912c5e 100644 --- a/arrow/src/json/reader.rs +++ b/arrow/src/json/reader.rs @@ -927,8 +927,16 @@ impl Decoder { rows.iter() .map(|row| { row.get(_name) -.and_then(|value| value.as_f64()) -.and_then(num::cast::cast) +.and_then(|value| { +if value.is_i64() { +value.as_i64().map(num::cast::cast) +} else if value.is_u64() { +value.as_u64().map(num::cast::cast) +} else { +value.as_f64().map(num::cast::cast) +} +}) +.flatten() }) .collect::>(), )) @@ -1933,6 +1941,7 @@ mod tests { .unwrap(); assert_eq!(1, aa.value(0)); assert_eq!(-10, aa.value(1)); +assert_eq!(162766868459400, aa.value(2)); let bb = batch .column(b.0) .as_any() diff --git a/arrow/test/data/arrays.json b/arrow/test/data/arrays.json index 5dbdd19..6de2b03 100644 --- a/arrow/test/data/arrays.json +++ b/arrow/test/data/arrays.json @@ -1,3 +1,3 @@ {"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":"4"} {"a":-10, "b":[2.0, 1.3, -6.1], "c":[true, true], "d":"4"} -{"a":2, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"} +{"a":162766868459400, "b":[2.0, null, -6.1], "c":[false, null], "d":"text"}
[arrow-rs] branch master updated (9be938e -> b38a4b6)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 9be938e Minimal MapArray support (#491) add b38a4b6 Add human readable Format for parquet ByteArray (#642) No new revisions were added by this update. Summary of changes: parquet/src/data_type.rs | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-)
[arrow-rs] branch master updated: Minimal MapArray support (#491)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 9be938e Minimal MapArray support (#491) 9be938e is described below commit 9be938e8d2847cf8d41bc59f0c907f23ff61cc3c Author: Wakahisa AuthorDate: Sat Jul 31 07:20:56 2021 +0200 Minimal MapArray support (#491) * add DataType::Map to datatypes * barebones MapArray and MapBuilder This commit adds the MapArray and MapBuilder. The interfaces are however incomplete at this stage. * minimal IPC read and write * barebones MapArray (missed) * add equality for map, relying on list A map is a list with some specific rules, so for equality it is the same as a list * json reader for MapArray * add schema roundtrip * read and write maps from/to arrow map * clippy * Calculate map levels separately Avoids the generic case of list > struct > [ley, value], which adds overhead * Fix map reader context and path * Map array tests * add doc comments and clean up code * wip: review feedback * add test for map * fix clippy 1.54 lints --- arrow/src/array/array.rs | 26 +++ arrow/src/array/array_map.rs | 421 + arrow/src/array/builder.rs | 211 +++ arrow/src/array/data.rs| 5 +- arrow/src/array/equal/mod.rs | 11 +- arrow/src/array/equal/utils.rs | 2 +- arrow/src/array/equal_json.rs | 32 +++ arrow/src/array/mod.rs | 2 + arrow/src/datatypes/datatype.rs| 31 +++ arrow/src/datatypes/field.rs | 33 +++ arrow/src/datatypes/mod.rs | 175 +++ arrow/src/ipc/convert.rs | 18 ++ arrow/src/ipc/reader.rs| 22 +- arrow/src/ipc/writer.rs| 4 + arrow/src/json/reader.rs | 177 arrow/src/util/integration_util.rs | 50 + parquet/src/arrow/array_reader.rs | 235 +++-- parquet/src/arrow/arrow_reader.rs | 16 ++ parquet/src/arrow/arrow_writer.rs | 39 parquet/src/arrow/levels.rs| 132 +++- parquet/src/arrow/schema.rs| 312 +-- 21 files changed, 1914 insertions(+), 40 deletions(-) diff --git a/arrow/src/array/array.rs b/arrow/src/array/array.rs index d715bc4..4702179 100644 --- a/arrow/src/array/array.rs +++ b/arrow/src/array/array.rs @@ -296,6 +296,7 @@ pub fn make_array(data: ArrayData) -> ArrayRef { DataType::List(_) => Arc::new(ListArray::from(data)) as ArrayRef, DataType::LargeList(_) => Arc::new(LargeListArray::from(data)) as ArrayRef, DataType::Struct(_) => Arc::new(StructArray::from(data)) as ArrayRef, +DataType::Map(_, _) => Arc::new(MapArray::from(data)) as ArrayRef, DataType::Union(_) => Arc::new(UnionArray::from(data)) as ArrayRef, DataType::FixedSizeList(_, _) => { Arc::new(FixedSizeListArray::from(data)) as ArrayRef @@ -452,6 +453,9 @@ pub fn new_null_array(data_type: , length: usize) -> ArrayRef { .map(|field| ArrayData::new_empty(field.data_type())) .collect(), )), +DataType::Map(field, _keys_sorted) => { +new_null_list_array::(data_type, field.data_type(), length) +} DataType::Union(_) => { unimplemented!("Creating null Union array not yet supported") } @@ -658,6 +662,28 @@ mod tests { } #[test] +fn test_null_map() { +let data_type = DataType::Map( +Box::new(Field::new( +"entry", +DataType::Struct(vec![ +Field::new("key", DataType::Utf8, false), +Field::new("key", DataType::Int32, true), +]), +false, +)), +false, +); +let array = new_null_array(_type, 9); +let a = array.as_any().downcast_ref::().unwrap(); +assert_eq!(a.len(), 9); +assert_eq!(a.value_offsets()[9], 0i32); +for i in 0..9 { +assert!(a.is_null(i)); +} +} + +#[test] fn test_null_dictionary() { let values = vec![None, None, None, None, None, None, None, None, None] as Vec>; diff --git a/arrow/src/array/array_map.rs b/arrow/src/array/array_map.rs new file mode 100644 index 000..b10c39e --- /dev/null +++ b/arrow/src/array/array_map.rs @@ -0,0 +1,421 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information
[arrow-rs] branch master updated: Remove Git SHA from created_by Parquet file metadata (#590)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 9f40f89 Remove Git SHA from created_by Parquet file metadata (#590) 9f40f89 is described below commit 9f40f899e439d072fc859e0b4abf46776387e0d1 Author: Carol (Nichols || Goulding) <193874+carols10ce...@users.noreply.github.com> AuthorDate: Thu Jul 22 04:17:17 2021 -0400 Remove Git SHA from created_by Parquet file metadata (#590) So that Parquet files will contain the same content whether or not your home directory is checked into Git or not ;) Fixes #589. --- parquet/build.rs | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/parquet/build.rs b/parquet/build.rs index b42b2a4..8aada18 100644 --- a/parquet/build.rs +++ b/parquet/build.rs @@ -15,29 +15,10 @@ // specific language governing permissions and limitations // under the License. -use std::process::Command; - fn main() { -// Set Parquet version, build hash and "created by" string. +// Set Parquet version and "created by" string. let version = env!("CARGO_PKG_VERSION"); -let mut created_by = format!("parquet-rs version {}", version); -if let Ok(git_hash) = run(Command::new("git").arg("rev-parse").arg("HEAD")) { -created_by.push_str(format!(" (build {})", git_hash).as_str()); -println!("cargo:rustc-env=PARQUET_BUILD={}", git_hash); -} +let created_by = format!("parquet-rs version {}", version); println!("cargo:rustc-env=PARQUET_VERSION={}", version); println!("cargo:rustc-env=PARQUET_CREATED_BY={}", created_by); } - -/// Runs command and returns either content of stdout for successful execution, -/// or an error message otherwise. -fn run(command: Command) -> Result { -println!("Running: `{:?}`", command); -match command.output() { -Ok(ref output) if output.status.success() => { -Ok(String::from_utf8_lossy().trim().to_string()) -} -Ok(ref output) => Err(format!("Failed: `{:?}` ({})", command, output.status)), -Err(error) => Err(format!("Failed: `{:?}` ({})", command, error)), -} -}
[arrow-rs] branch master updated: Exclude .github in rat files (#551)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new fc78af6 Exclude .github in rat files (#551) fc78af6 is described below commit fc78af6324513cc3da9fea8c80658d85dfcd8263 Author: Andrew Lamb AuthorDate: Wed Jul 14 10:45:29 2021 -0400 Exclude .github in rat files (#551) --- dev/release/rat_exclude_files.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt index d64a431..c5435d0 100644 --- a/dev/release/rat_exclude_files.txt +++ b/dev/release/rat_exclude_files.txt @@ -12,3 +12,4 @@ filtered_rat.txt rat.txt # auto-generated arrow-flight/src/arrow.flight.protocol.rs +.github/*
[arrow-rs] branch master updated: refactor: remove lifetime from DynComparator (#542)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new fde79a2 refactor: remove lifetime from DynComparator (#542) fde79a2 is described below commit fde79a2d58ac4076d3450549ae042fc112ad026d Author: Edd Robinson AuthorDate: Wed Jul 14 07:14:32 2021 +0100 refactor: remove lifetime from DynComparator (#542) This commit removes the need for an explicit lifetime on the `DynComparator`. The rationale behind this change is that callers may wish to share this comparator amongst threads and the explicit lifetime makes this harder to achieve. As a nice side-effect, performance of the sort kernel seems to have improved: ``` $ critcmp master pr group master pr - -- -- bool sort 2^12 1.03310.8±1.34µs 1.00302.8±7.78µs bool sort nulls 2^12 1.01287.4±2.22µs 1.00284.0±3.23µs sort 2^10 1.04 98.7±3.58µs 1.00 94.6±0.50µs sort 2^12 1.05510.7±5.56µs 1.00486.2±9.94µs sort 2^12 limit 10 1.05 48.1±0.38µs 1.00 45.6±0.30µs sort 2^12 limit 1001.04 52.8±0.37µs 1.00 50.6±0.41µs sort 2^12 limit 1000 1.06141.1±0.94µs 1.00132.7±0.95µs sort 2^12 limit 2^12 1.03501.2±4.01µs 1.00486.5±4.87µs sort nulls 2^101.02 70.9±0.72µs 1.00 69.4±0.51µs sort nulls 2^121.02369.7±3.51µs 1.00 363.0±18.52µs sort nulls 2^12 limit 10 1.01 70.6±1.22µs 1.00 70.0±1.27µs sort nulls 2^12 limit 100 1.00 71.7±0.82µs 1.00 71.8±1.60µs sort nulls 2^12 limit 1000 1.01 80.5±1.55µs 1.00 79.4±1.41µs sort nulls 2^12 limit 2^12 1.05375.4±4.78µs 1.00356.1±3.04µs ``` --- arrow/src/array/ord.rs| 48 --- arrow/src/compute/kernels/sort.rs | 6 ++--- 2 files changed, 22 insertions(+), 32 deletions(-) diff --git a/arrow/src/array/ord.rs b/arrow/src/array/ord.rs index 187542a..7fb4668 100644 --- a/arrow/src/array/ord.rs +++ b/arrow/src/array/ord.rs @@ -27,7 +27,7 @@ use crate::error::{ArrowError, Result}; use num::Float; /// Compare the values at two arbitrary indices in two arrays. -pub type DynComparator<'a> = Box Ordering + 'a>; +pub type DynComparator = Box Ordering + Send + Sync>; /// compares two floats, placing NaNs at last fn cmp_nans_last(a: , b: ) -> Ordering { @@ -39,60 +39,50 @@ fn cmp_nans_last(a: , b: ) -> Ordering { } } -fn compare_primitives<'a, T: ArrowPrimitiveType>( -left: &'a Array, -right: &'a Array, -) -> DynComparator<'a> +fn compare_primitives(left: , right: ) -> DynComparator where T::Native: Ord, { -let left = left.as_any().downcast_ref::>().unwrap(); -let right = right.as_any().downcast_ref::>().unwrap(); +let left: PrimitiveArray = PrimitiveArray::from(left.data().clone()); +let right: PrimitiveArray = PrimitiveArray::from(right.data().clone()); Box::new(move |i, j| left.value(i).cmp((j))) } -fn compare_boolean<'a>(left: &'a Array, right: &'a Array) -> DynComparator<'a> { -let left = left.as_any().downcast_ref::().unwrap(); -let right = right.as_any().downcast_ref::().unwrap(); +fn compare_boolean(left: , right: ) -> DynComparator { +let left: BooleanArray = BooleanArray::from(left.data().clone()); +let right: BooleanArray = BooleanArray::from(right.data().clone()); + Box::new(move |i, j| left.value(i).cmp((j))) } -fn compare_float<'a, T: ArrowPrimitiveType>( -left: &'a Array, -right: &'a Array, -) -> DynComparator<'a> +fn compare_float(left: , right: ) -> DynComparator where T::Native: Float, { -let left = left.as_any().downcast_ref::>().unwrap(); -let right = right.as_any().downcast_ref::>().unwrap(); +let left: PrimitiveArray = PrimitiveArray::from(left.data().clone()); +let right: PrimitiveArray = PrimitiveArray::from(right.data().clone()); Box::new(move |i, j| cmp_nans_last((i), (j))) } -fn compare_string<'a, T>(left: &'a Array, right: &'a Array) ->
[arrow-rs] branch master updated: Fix build, Make the js package a feature that can be enabled for wasm, rather than always on (#545)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new cdcf013 Fix build, Make the js package a feature that can be enabled for wasm, rather than always on (#545) cdcf013 is described below commit cdcf013f610c169d2a2efa493d586c76da521053 Author: Andrew Lamb AuthorDate: Wed Jul 14 00:35:41 2021 -0400 Fix build, Make the js package a feature that can be enabled for wasm, rather than always on (#545) * Fix build, add js feature * fix command --- .github/workflows/rust.yml | 2 +- arrow/Cargo.toml | 3 ++- arrow/README.md| 1 + 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml index 76511bf..5579072 100644 --- a/.github/workflows/rust.yml +++ b/.github/workflows/rust.yml @@ -332,7 +332,7 @@ jobs: export CARGO_HOME="/github/home/.cargo" export CARGO_TARGET_DIR="/github/home/target" cd arrow - cargo build --target wasm32-unknown-unknown + cargo build --features=js --target wasm32-unknown-unknown # test builds with various feature flags default-build: diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml index eef7dbc..ca343eb 100644 --- a/arrow/Cargo.toml +++ b/arrow/Cargo.toml @@ -43,7 +43,7 @@ indexmap = "1.6" rand = { version = "0.8", default-features = false } # getrandom is a dependency of rand, not (directly) of arrow # need to specify `js` feature to build on wasm -getrandom = { version = "0.2", features = ["js"] } +getrandom = { version = "0.2", optional = true } num = "0.4" csv_crate = { version = "1.1", optional = true, package="csv" } regex = "1.3" @@ -64,6 +64,7 @@ csv = ["csv_crate"] ipc = ["flatbuffers"] simd = ["packed_simd"] prettyprint = ["prettytable-rs"] +js = ["getrandom/js"] # The test utils feature enables code used in benchmarks and tests but # not the core arrow code itself test_utils = ["rand/std", "rand/std_rng"] diff --git a/arrow/README.md b/arrow/README.md index f9b7308..77e36ec 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -30,6 +30,7 @@ The arrow crate provides the following optional features: - `csv` (default) - support for reading and writing Arrow arrays to/from csv files - `ipc` (default) - support for the [arrow-flight]((https://crates.io/crates/arrow-flight) IPC and wire format - `prettyprint` - support for formatting record batches as textual columns +- `js` - support for building arrow for WebAssembly / JavaScript - `simd` - (_Requires Nightly Rust_) alternate optimized implementations of some [compute](https://github.com/apache/arrow/tree/master/rust/arrow/src/compute) kernels using explicit SIMD processor intrinsics.
[arrow-rs] branch master updated: Remove unused futures dependency from arrow-flight (#528)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 6538fe5 Remove unused futures dependency from arrow-flight (#528) 6538fe5 is described below commit 6538fe597b5952af02f45b715d9363845583129b Author: Andrew Lamb AuthorDate: Fri Jul 9 08:14:11 2021 -0400 Remove unused futures dependency from arrow-flight (#528) --- arrow-flight/Cargo.toml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arrow-flight/Cargo.toml b/arrow-flight/Cargo.toml index 941cc2b..693da46 100644 --- a/arrow-flight/Cargo.toml +++ b/arrow-flight/Cargo.toml @@ -33,6 +33,8 @@ bytes = "1" prost = "0.7" prost-derive = "0.7" tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread"] } + +[dev-dependencies] futures = { version = "0.3", default-features = false, features = ["alloc"]} [build-dependencies]
[arrow-rs] branch master updated: simplify interactions with arrow flight APIs (#377)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 21d69ca simplify interactions with arrow flight APIs (#377) 21d69ca is described below commit 21d69cab9b21398b0947da28b5aac3e22139e818 Author: Gary Pennington <31890086+garyanap...@users.noreply.github.com> AuthorDate: Mon Jul 5 07:44:48 2021 +0100 simplify interactions with arrow flight APIs (#377) * simplify interactions with arrow flight APIs Initial work to implement some basic traits * more polishing and introduction of a couple of wrapper types Some more polishing of the basic code I provided last week. * More polishing Add support for representing tickets as base64 encoded strings. Also: more polishing of Display, etc... * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer. * make BasicAuth accessible Following merge with master, make sure this is exposed so that integration tests work. also: there has been a release since I last looked at this so update the deprecation warnings. * fix documentation for ipc_message_from_arrow_schema TryFrom, not From * replace deprecated functions in integrations tests with traits clippy complains about using deprecated functions, so replace them with the new trait support. also: fix the trait documentation * address review comments - update deprecated warnings - improve TryFrom for DescriptorType --- arrow-flight/Cargo.toml| 1 + arrow-flight/src/lib.rs| 429 - arrow-flight/src/utils.rs | 137 ++- .../flight_client_scenarios/integration_test.rs| 7 +- .../flight_server_scenarios/integration_test.rs| 24 +- 5 files changed, 484 insertions(+), 114 deletions(-) diff --git a/arrow-flight/Cargo.toml b/arrow-flight/Cargo.toml index c6027f8..04a1a93 100644 --- a/arrow-flight/Cargo.toml +++ b/arrow-flight/Cargo.toml @@ -27,6 +27,7 @@ license = "Apache-2.0" [dependencies] arrow = { path = "../arrow", version = "5.0.0-SNAPSHOT" } +base64 = "0.13" tonic = "0.4" bytes = "1" prost = "0.7" diff --git a/arrow-flight/src/lib.rs b/arrow-flight/src/lib.rs index 6af2e74..a431cfc 100644 --- a/arrow-flight/src/lib.rs +++ b/arrow-flight/src/lib.rs @@ -15,6 +15,433 @@ // specific language governing permissions and limitations // under the License. -include!("arrow.flight.protocol.rs"); +use arrow::datatypes::Sch
[arrow-rs] branch master updated: fix reader schema (#513)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new ef88876 fix reader schema (#513) ef88876 is described below commit ef8887609017680d94b2a35f9889aa10cf3b3de8 Author: Wakahisa AuthorDate: Wed Jun 30 23:56:44 2021 +0200 fix reader schema (#513) We aren't comparing the right values --- parquet/benches/arrow_array_reader.rs | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/parquet/benches/arrow_array_reader.rs b/parquet/benches/arrow_array_reader.rs index 6e87512..acc5141 100644 --- a/parquet/benches/arrow_array_reader.rs +++ b/parquet/benches/arrow_array_reader.rs @@ -31,13 +31,9 @@ fn build_test_schema() -> SchemaDescPtr { let message_type = " message test_schema { REQUIRED INT32 mandatory_int32_leaf; -REPEATED Group test_mid_int32 { -OPTIONAL INT32 optional_int32_leaf; -} +OPTIONAL INT32 optional_int32_leaf; REQUIRED BYTE_ARRAY mandatory_string_leaf (UTF8); -REPEATED Group test_mid_string { -OPTIONAL BYTE_ARRAY optional_string_leaf (UTF8); -} +OPTIONAL BYTE_ARRAY optional_string_leaf (UTF8); } "; parse_message_type(message_type)
[arrow-rs] branch parquet-fix-list-reader created (now 3d6523a)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch parquet-fix-list-reader in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. at 3d6523a fix reader schema This branch includes the following new commits: new 3d6523a fix reader schema The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow-rs] 01/01: fix reader schema
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch parquet-fix-list-reader in repository https://gitbox.apache.org/repos/asf/arrow-rs.git commit 3d6523afd89be5b0b3d681ab0b12073eb63c9fc6 Author: Neville Dipale AuthorDate: Sat Jun 26 13:08:40 2021 +0200 fix reader schema We aren't comparing the right values --- parquet/benches/arrow_array_reader.rs | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/parquet/benches/arrow_array_reader.rs b/parquet/benches/arrow_array_reader.rs index 6e87512..acc5141 100644 --- a/parquet/benches/arrow_array_reader.rs +++ b/parquet/benches/arrow_array_reader.rs @@ -31,13 +31,9 @@ fn build_test_schema() -> SchemaDescPtr { let message_type = " message test_schema { REQUIRED INT32 mandatory_int32_leaf; -REPEATED Group test_mid_int32 { -OPTIONAL INT32 optional_int32_leaf; -} +OPTIONAL INT32 optional_int32_leaf; REQUIRED BYTE_ARRAY mandatory_string_leaf (UTF8); -REPEATED Group test_mid_string { -OPTIONAL BYTE_ARRAY optional_string_leaf (UTF8); -} +OPTIONAL BYTE_ARRAY optional_string_leaf (UTF8); } "; parse_message_type(message_type)
[arrow-rs] branch master updated: Implement function slice for RecordBatch (#490)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new de62168 Implement function slice for RecordBatch (#490) de62168 is described below commit de62168a4f428e3c334e1cfa5c5db23272f313d7 Author: baishen AuthorDate: Fri Jun 25 11:36:44 2021 -0500 Implement function slice for RecordBatch (#490) * Implement RecordBatch::slice() * optimize * optimize * add test case * fix clippy --- arrow/src/record_batch.rs | 91 +++ 1 file changed, 84 insertions(+), 7 deletions(-) diff --git a/arrow/src/record_batch.rs b/arrow/src/record_batch.rs index f1fd867..4d2abc3 100644 --- a/arrow/src/record_batch.rs +++ b/arrow/src/record_batch.rs @@ -244,6 +244,31 @@ impl RecordBatch { [..] } +/// Return a new RecordBatch where each column is sliced +/// according to `offset` and `length` +/// +/// # Panics +/// +/// Panics if `offset` with `length` is greater than column length. +pub fn slice(, offset: usize, length: usize) -> RecordBatch { +if self.schema.fields().is_empty() { +assert!((offset + length) == 0); +return RecordBatch::new_empty(self.schema.clone()); +} +assert!((offset + length) <= self.num_rows()); + +let columns = self +.columns() +.iter() +.map(|column| column.slice(offset, length)) +.collect(); + +Self { +schema: self.schema.clone(), +columns, +} +} + /// Create a `RecordBatch` from an iterable list of pairs of the /// form `(field_name, array)`, with the same requirements on /// fields and arrays as [`RecordBatch::try_new`]. This method is @@ -414,16 +439,68 @@ mod tests { let record_batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a), Arc::new(b)]) .unwrap(); -check_batch(record_batch) +check_batch(record_batch, 5) } -fn check_batch(record_batch: RecordBatch) { -assert_eq!(5, record_batch.num_rows()); +fn check_batch(record_batch: RecordBatch, num_rows: usize) { +assert_eq!(num_rows, record_batch.num_rows()); assert_eq!(2, record_batch.num_columns()); assert_eq!(::Int32, record_batch.schema().field(0).data_type()); assert_eq!(::Utf8, record_batch.schema().field(1).data_type()); -assert_eq!(5, record_batch.column(0).data().len()); -assert_eq!(5, record_batch.column(1).data().len()); +assert_eq!(num_rows, record_batch.column(0).data().len()); +assert_eq!(num_rows, record_batch.column(1).data().len()); +} + +#[test] +#[should_panic(expected = "assertion failed: (offset + length) <= self.num_rows()")] +fn create_record_batch_slice() { +let schema = Schema::new(vec![ +Field::new("a", DataType::Int32, false), +Field::new("b", DataType::Utf8, false), +]); +let expected_schema = schema.clone(); + +let a = Int32Array::from(vec![1, 2, 3, 4, 5, 6, 7, 8]); +let b = StringArray::from(vec!["a", "b", "c", "d", "e", "f", "h", "i"]); + +let record_batch = +RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a), Arc::new(b)]) +.unwrap(); + +let offset = 2; +let length = 5; +let record_batch_slice = record_batch.slice(offset, length); + +assert_eq!(record_batch_slice.schema().as_ref(), _schema); +check_batch(record_batch_slice, 5); + +let offset = 2; +let length = 0; +let record_batch_slice = record_batch.slice(offset, length); + +assert_eq!(record_batch_slice.schema().as_ref(), _schema); +check_batch(record_batch_slice, 0); + +let offset = 2; +let length = 10; +let _record_batch_slice = record_batch.slice(offset, length); +} + +#[test] +#[should_panic(expected = "assertion failed: (offset + length) == 0")] +fn create_record_batch_slice_empty_batch() { +let schema = Schema::new(vec![]); + +let record_batch = RecordBatch::new_empty(Arc::new(schema)); + +let offset = 0; +let length = 0; +let record_batch_slice = record_batch.slice(offset, length); +assert_eq!(0, record_batch_slice.schema().fields().len()); + +let offset = 1; +let length = 2; +let _record_batch_slice = record_batch.slice(offset, length); } #[test] @@ -445,7 +522,7 @@ mod tests { Field::new("b", DataType::Utf8, false), ]); assert_eq
[arrow-rs] branch master updated: remove stale comment and update unit tests (#472)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 6e2f684 remove stale comment and update unit tests (#472) 6e2f684 is described below commit 6e2f68420e03fe6926e8c2ffbd4441fc8cc1aeab Author: Jiayu Liu AuthorDate: Sun Jun 20 00:40:15 2021 +0800 remove stale comment and update unit tests (#472) --- arrow/src/array/array_struct.rs | 24 ++-- arrow/src/array/builder.rs | 24 ++-- 2 files changed, 4 insertions(+), 44 deletions(-) diff --git a/arrow/src/array/array_struct.rs b/arrow/src/array/array_struct.rs index 9c11b83..f721d35 100644 --- a/arrow/src/array/array_struct.rs +++ b/arrow/src/array/array_struct.rs @@ -362,28 +362,8 @@ mod tests { .add_buffer(Buffer::from(&[1, 2, 0, 4].to_byte_slice())) .build(); -assert_eq!(_string_data, arr.column(0).data()); - -// TODO: implement equality for ArrayData -assert_eq!(expected_int_data.len(), arr.column(1).data().len()); -assert_eq!( -expected_int_data.null_count(), -arr.column(1).data().null_count() -); -assert_eq!( -expected_int_data.null_bitmap(), -arr.column(1).data().null_bitmap() -); -let expected_value_buf = expected_int_data.buffers()[0].clone(); -let actual_value_buf = arr.column(1).data().buffers()[0].clone(); -for i in 0..expected_int_data.len() { -if !expected_int_data.is_null(i) { -assert_eq!( -expected_value_buf.as_slice()[i * 4..(i + 1) * 4], -actual_value_buf.as_slice()[i * 4..(i + 1) * 4] -); -} -} +assert_eq!(expected_string_data, *arr.column(0).data()); +assert_eq!(expected_int_data, *arr.column(1).data()); } #[test] diff --git a/arrow/src/array/builder.rs b/arrow/src/array/builder.rs index eacd764..66f2d81 100644 --- a/arrow/src/array/builder.rs +++ b/arrow/src/array/builder.rs @@ -3050,28 +3050,8 @@ mod tests { .add_buffer(Buffer::from_slice_ref(&[1, 2, 0, 4])) .build(); -assert_eq!(_string_data, arr.column(0).data()); - -// TODO: implement equality for ArrayData -assert_eq!(expected_int_data.len(), arr.column(1).data().len()); -assert_eq!( -expected_int_data.null_count(), -arr.column(1).data().null_count() -); -assert_eq!( -expected_int_data.null_bitmap(), -arr.column(1).data().null_bitmap() -); -let expected_value_buf = expected_int_data.buffers()[0].clone(); -let actual_value_buf = arr.column(1).data().buffers()[0].clone(); -for i in 0..expected_int_data.len() { -if !expected_int_data.is_null(i) { -assert_eq!( -expected_value_buf.as_slice()[i * 4..(i + 1) * 4], -actual_value_buf.as_slice()[i * 4..(i + 1) * 4] -); -} -} +assert_eq!(expected_string_data, *arr.column(0).data()); +assert_eq!(expected_int_data, *arr.column(1).data()); } #[test]
[arrow-rs] branch master updated: remove unused patch file (#471)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 8bdbf9d remove unused patch file (#471) 8bdbf9d is described below commit 8bdbf9d593c9270a0fe6ed9746d8d96c2bb27a19 Author: Jiayu Liu AuthorDate: Sun Jun 20 00:07:58 2021 +0800 remove unused patch file (#471) --- arrow/format-0ed34c83.patch | 220 1 file changed, 220 deletions(-) diff --git a/arrow/format-0ed34c83.patch b/arrow/format-0ed34c83.patch deleted file mode 100644 index 5da0a0c..000 --- a/arrow/format-0ed34c83.patch +++ /dev/null @@ -1,220 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -diff --git a/format/Message.fbs b/format/Message.fbs -index 1a7e0dfff..f1c18d765 100644 a/format/Message.fbs -+++ b/format/Message.fbs -@@ -28,7 +28,7 @@ namespace org.apache.arrow.flatbuf; - /// Metadata about a field at some level of a nested type tree (but not - /// its children). - /// --/// For example, a List with values [[1, 2, 3], null, [4], [5, 6], null] -+/// For example, a List with values `[[1, 2, 3], null, [4], [5, 6], null]` - /// would have {length: 5, null_count: 2} for its List node, and {length: 6, - /// null_count: 0} for its Int16 node, as separate FieldNode structs - struct FieldNode { -diff --git a/format/Schema.fbs b/format/Schema.fbs -index 3b37e5d85..3b00dd478 100644 a/format/Schema.fbs -+++ b/format/Schema.fbs -@@ -110,10 +110,11 @@ table FixedSizeList { - /// not enforced. - /// - /// Map -+/// ```text - /// - child[0] entries: Struct - /// - child[0] key: K - /// - child[1] value: V --/// -+/// ``` - /// Neither the "entries" field nor the "key" field may be nullable. - /// - /// The metadata is structured so that Arrow systems without special handling -@@ -129,7 +130,7 @@ enum UnionMode:short { Sparse, Dense } - /// A union is a complex type with children in Field - /// By default ids in the type vector refer to the offsets in the children - /// optionally typeIds provides an indirection between the child offset and the type id --/// for each child typeIds[offset] is the id used in the type vector -+/// for each child `typeIds[offset]` is the id used in the type vector - table Union { - mode: UnionMode; - typeIds: [ int ]; // optional, describes typeid of each child. -diff --git a/format/SparseTensor.fbs b/format/SparseTensor.fbs -index 3fe8a7582..a6fd2f9e7 100644 a/format/SparseTensor.fbs -+++ b/format/SparseTensor.fbs -@@ -37,21 +37,21 @@ namespace org.apache.arrow.flatbuf; - /// - /// For example, let X be a 2x3x4x5 tensor, and it has the following - /// 6 non-zero values: --/// -+/// ```text - /// X[0, 1, 2, 0] := 1 - /// X[1, 1, 2, 3] := 2 - /// X[0, 2, 1, 0] := 3 - /// X[0, 1, 3, 0] := 4 - /// X[0, 1, 2, 1] := 5 - /// X[1, 2, 0, 4] := 6 --/// -+/// ``` - /// In COO format, the index matrix of X is the following 4x6 matrix: --/// -+/// ```text - /// [[0, 0, 0, 0, 1, 1], - ///[1, 1, 1, 2, 1, 2], - ///[2, 2, 3, 1, 2, 0], - ///[0, 1, 0, 0, 3, 4]] --/// -+/// ``` - /// When isCanonical is true, the indices is sorted in lexicographical order - /// (row-major order), and it does not have duplicated entries. Otherwise, - /// the indices may not be sorted, or may have duplicated entries. -@@ -86,26 +86,27 @@ table SparseMatrixIndexCSX { - - /// indptrBuffer stores the location and size of indptr array that - /// represents the range of the rows. -- /// The i-th row spans from indptr[i] to indptr[i+1] in the data. -+ /// The i-th row spans from `indptr[i]` to `indptr[i+1]` in the data. - /// The length of this array is 1 + (the number of rows), and the type - /// of index value is long. - /// - /// For example, let X be the following 6x4 matrix: -- /// -+ /// ```text - /// X := [[0, 1, 2, 0], - /// [0, 0, 3, 0], - /// [0, 4, 0, 5], - /// [0, 0, 0, 0], - /// [6, 0, 7, 8], - /// [0, 9, 0, 0]]. -- /// -+ /// ``` - /// The array of non-zero
[arrow-rs] branch master updated: Implement the Iterator trait for the json Reader. (#451)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new e5cda31 Implement the Iterator trait for the json Reader. (#451) e5cda31 is described below commit e5cda312b697c3d610637b28c58b6f1b104b41cc Author: Laurent Mazare AuthorDate: Sun Jun 13 08:22:38 2021 +0800 Implement the Iterator trait for the json Reader. (#451) * Implement the Iterator trait for the json Reader. * Use transpose. --- arrow/src/json/reader.rs | 39 +++ 1 file changed, 39 insertions(+) diff --git a/arrow/src/json/reader.rs b/arrow/src/json/reader.rs index d0b9c19..9235142 100644 --- a/arrow/src/json/reader.rs +++ b/arrow/src/json/reader.rs @@ -1569,6 +1569,14 @@ impl ReaderBuilder { } } +impl Iterator for Reader { +type Item = Result; + +fn next( self) -> Option { +self.next().transpose() +} +} + #[cfg(test)] mod tests { use crate::{ @@ -2946,4 +2954,35 @@ mod tests { assert_eq!(batch.num_columns(), 1); assert_eq!(batch.num_rows(), 3); } + +#[test] +fn test_json_iterator() { +let builder = ReaderBuilder::new().infer_schema(None).with_batch_size(5); +let reader: Reader = builder +.build::(File::open("test/data/basic.json").unwrap()) +.unwrap(); +let schema = reader.schema(); +let (col_a_index, _) = schema.column_with_name("a").unwrap(); + +let mut sum_num_rows = 0; +let mut num_batches = 0; +let mut sum_a = 0; +for batch in reader { +let batch = batch.unwrap(); +assert_eq!(4, batch.num_columns()); +sum_num_rows += batch.num_rows(); +num_batches += 1; +let batch_schema = batch.schema(); +assert_eq!(schema, batch_schema); +let a_array = batch +.column(col_a_index) +.as_any() +.downcast_ref::() +.unwrap(); +sum_a += (0..a_array.len()).map(|i| a_array.value(i)).sum::(); +} +assert_eq!(12, sum_num_rows); +assert_eq!(3, num_batches); +assert_eq!(111, sum_a); +} }
[arrow-rs] branch master updated: Add Decimal to CsvWriter and improve debug display (#406)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new fb45112 Add Decimal to CsvWriter and improve debug display (#406) fb45112 is described below commit fb451125c4ed49a425de10afb6f42af0d9723a19 Author: Ádám Lippai AuthorDate: Sun Jun 13 02:20:08 2021 +0200 Add Decimal to CsvWriter and improve debug display (#406) * Add Decimal to CsvWriter and improve debug display * Measure CSV writer instead of file and data creation * Re-use decimal formatting --- arrow/benches/csv_writer.rs | 19 ++- arrow/src/array/array_binary.rs | 36 arrow/src/csv/writer.rs | 23 --- arrow/src/util/display.rs | 27 --- 4 files changed, 62 insertions(+), 43 deletions(-) diff --git a/arrow/benches/csv_writer.rs b/arrow/benches/csv_writer.rs index 50b94d6..62c5da9 100644 --- a/arrow/benches/csv_writer.rs +++ b/arrow/benches/csv_writer.rs @@ -28,14 +28,14 @@ use arrow::record_batch::RecordBatch; use std::fs::File; use std::sync::Arc; -fn record_batches_to_csv() { +fn criterion_benchmark(c: Criterion) { #[cfg(feature = "csv")] { let schema = Schema::new(vec![ Field::new("c1", DataType::Utf8, false), Field::new("c2", DataType::Float64, true), Field::new("c3", DataType::UInt32, false), -Field::new("c3", DataType::Boolean, true), +Field::new("c4", DataType::Boolean, true), ]); let c1 = StringArray::from(vec![ @@ -59,16 +59,17 @@ fn record_batches_to_csv() { let file = File::create("target/bench_write_csv.csv").unwrap(); let mut writer = csv::Writer::new(file); let batches = vec![, , , , , , , , , , ]; -#[allow(clippy::unit_arg)] -criterion::black_box(for batch in batches { -writer.write(batch).unwrap() + +c.bench_function("record_batches_to_csv", |b| { +b.iter(|| { +#[allow(clippy::unit_arg)] +criterion::black_box(for batch in { +writer.write(batch).unwrap() +}); +}); }); } } -fn criterion_benchmark(c: Criterion) { -c.bench_function("record_batches_to_csv", |b| b.iter(record_batches_to_csv)); -} - criterion_group!(benches, criterion_benchmark); criterion_main!(benches); diff --git a/arrow/src/array/array_binary.rs b/arrow/src/array/array_binary.rs index 0cb4db4..0b374db 100644 --- a/arrow/src/array/array_binary.rs +++ b/arrow/src/array/array_binary.rs @@ -666,6 +666,17 @@ impl DecimalArray { self.length * i as i32 } +#[inline] +pub fn value_as_string(, row: usize) -> String { +let decimal_string = self.value(row).to_string(); +if self.scale == 0 { +decimal_string +} else { +let splits = decimal_string.split_at(decimal_string.len() - self.scale); +format!("{}.{}", splits.0, splits.1) +} +} + pub fn from_fixed_size_list_array( v: FixedSizeListArray, precision: usize, @@ -729,7 +740,9 @@ impl fmt::Debug for DecimalArray { fn fmt(, f: fmt::Formatter) -> fmt::Result { write!(f, "DecimalArray<{}, {}>\n[\n", self.precision, self.scale)?; print_long_array(self, f, |array, index, f| { -fmt::Debug::fmt((index), f) +let formatted_decimal = array.value_as_string(index); + +write!(f, "{}", formatted_decimal) })?; write!(f, "]") } @@ -758,7 +771,7 @@ impl Array for DecimalArray { #[cfg(test)] mod tests { use crate::{ -array::{LargeListArray, ListArray}, +array::{DecimalBuilder, LargeListArray, ListArray}, datatypes::Field, }; @@ -1163,17 +1176,16 @@ mod tests { #[test] fn test_decimal_array_fmt_debug() { -let values: [u8; 32] = [ -192, 219, 180, 17, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 36, 75, 238, 253, -255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, -]; -let array_data = ArrayData::builder(DataType::Decimal(23, 6)) -.len(2) -.add_buffer(Buffer::from([..])) -.build(); -let arr = DecimalArray::from(array_data); +let values: Vec = vec![888700, -888700]; +let mut decimal_builder = DecimalBuilder::new(3, 23, 6); + +values.iter().for_each(|| { +decimal_builder.append_value(value).unwrap(); +}); +decimal_builder.append_null().unwrap(); +let arr = decimal_builder.
[arrow-rs] branch master updated: remove unnecessary wraps in sortk (#445)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new efe86cd remove unnecessary wraps in sortk (#445) efe86cd is described below commit efe86cdf329ec4bfad3b72bd23ee6558340fa297 Author: Jiayu Liu AuthorDate: Sun Jun 13 08:00:35 2021 +0800 remove unnecessary wraps in sortk (#445) --- arrow/src/compute/kernels/sort.rs | 96 +-- 1 file changed, 51 insertions(+), 45 deletions(-) diff --git a/arrow/src/compute/kernels/sort.rs b/arrow/src/compute/kernels/sort.rs index dff5695..b0eecb9 100644 --- a/arrow/src/compute/kernels/sort.rs +++ b/arrow/src/compute/kernels/sort.rs @@ -163,7 +163,7 @@ pub fn sort_to_indices( let (v, n) = partition_validity(values); -match values.data_type() { +Ok(match values.data_type() { DataType::Boolean => sort_boolean(values, v, n, , limit), DataType::Int8 => { sort_primitive::(values, v, n, cmp, , limit) @@ -278,10 +278,12 @@ pub fn sort_to_indices( DataType::Float64 => { sort_list::(values, v, n, , limit) } -t => Err(ArrowError::ComputeError(format!( -"Sort not supported for list type {:?}", -t -))), +t => { +return Err(ArrowError::ComputeError(format!( +"Sort not supported for list type {:?}", +t +))) +} }, DataType::LargeList(field) => match field.data_type() { DataType::Int8 => sort_list::(values, v, n, , limit), @@ -304,10 +306,12 @@ pub fn sort_to_indices( DataType::Float64 => { sort_list::(values, v, n, , limit) } -t => Err(ArrowError::ComputeError(format!( -"Sort not supported for list type {:?}", -t -))), +t => { +return Err(ArrowError::ComputeError(format!( +"Sort not supported for list type {:?}", +t +))) +} }, DataType::FixedSizeList(field, _) => match field.data_type() { DataType::Int8 => sort_list::(values, v, n, , limit), @@ -330,10 +334,12 @@ pub fn sort_to_indices( DataType::Float64 => { sort_list::(values, v, n, , limit) } -t => Err(ArrowError::ComputeError(format!( -"Sort not supported for list type {:?}", -t -))), +t => { +return Err(ArrowError::ComputeError(format!( +"Sort not supported for list type {:?}", +t +))) +} }, DataType::Dictionary(key_type, value_type) if *value_type.as_ref() == DataType::Utf8 => @@ -363,17 +369,21 @@ pub fn sort_to_indices( DataType::UInt64 => { sort_string_dictionary::(values, v, n, , limit) } -t => Err(ArrowError::ComputeError(format!( -"Sort not supported for dictionary key type {:?}", -t -))), +t => { +return Err(ArrowError::ComputeError(format!( +"Sort not supported for dictionary key type {:?}", +t +))) +} } } -t => Err(ArrowError::ComputeError(format!( -"Sort not supported for data type {:?}", -t -))), -} +t => { +return Err(ArrowError::ComputeError(format!( +"Sort not supported for data type {:?}", +t +))) +} +}) } /// Options that define how sort kernels should behave @@ -396,14 +406,13 @@ impl Default for SortOptions { } /// Sort primitive values -#[allow(clippy::unnecessary_wraps)] fn sort_boolean( values: , value_indices: Vec, null_indices: Vec, options: , limit: Option, -) -> Result { +) -> UInt32Array { let values = values .as_any() .downcast_ref::() @@ -469,11 +478,10 @@ fn sort_boolean( vec![], ); -Ok(UInt32Array::from(result_data)) +UInt32Array::from(result_data) } /// Sort primitive values -#[allow(clippy::unnecessary_wraps)] fn sort_primitive( values: , value_indices: Vec, @@ -481,7 +489,7 @@ fn sort_primitive( cmp: F, options: , limit: Option, -) -> Result +) -> UInt32Array where T
[arrow-datafusion] 01/01: add expr::like and expr::notlike to pruning logic
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch i507-string-like-prune in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git commit 1062d5c8e77291bd7ae2245b2f701c12d4d27310 Author: Neville Dipale AuthorDate: Sat Jun 5 11:57:56 2021 +0200 add expr::like and expr::notlike to pruning logic --- datafusion/src/physical_optimizer/pruning.rs | 96 +++- 1 file changed, 94 insertions(+), 2 deletions(-) diff --git a/datafusion/src/physical_optimizer/pruning.rs b/datafusion/src/physical_optimizer/pruning.rs index c65733b..0e43e4e 100644 --- a/datafusion/src/physical_optimizer/pruning.rs +++ b/datafusion/src/physical_optimizer/pruning.rs @@ -42,6 +42,7 @@ use crate::{ logical_plan::{Expr, Operator}, optimizer::utils, physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr}, +scalar::ScalarValue, }; /// Interface to pass statistics information to [`PruningPredicates`] @@ -548,7 +549,7 @@ fn build_predicate_expression( // allow partial failure in predicate expression generation // this can still produce a useful predicate when multiple conditions are joined using AND Err(_) => { -return Ok(logical_plan::lit(true)); +return Ok(unhandled); } }; let corrected_op = expr_builder.correct_operator(op); @@ -586,8 +587,45 @@ fn build_predicate_expression( .min_column_expr()? .lt_eq(expr_builder.scalar_expr().clone()) } +Operator::Like => { +match &**right { +// If the literal is a 'starts_with' +Expr::Literal(ScalarValue::Utf8(Some(string))) +if !string.starts_with('%') => +{ +let scalar_expr = + Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""; +// Behaves like Eq +let min_column_expr = expr_builder.min_column_expr()?; +let max_column_expr = expr_builder.max_column_expr()?; +min_column_expr +.lt_eq(scalar_expr.clone()) +.and(scalar_expr.lt_eq(max_column_expr)) +} +_ => unhandled, +} +} +Operator::NotLike => { +match &**right { +// If the literal is a 'starts_with' +Expr::Literal(ScalarValue::Utf8(Some(string))) +if !string.starts_with('%') => +{ +let scalar_expr = + Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""; +// Behaves like Eq +let min_column_expr = expr_builder.min_column_expr()?; +let max_column_expr = expr_builder.max_column_expr()?; +// Inverse of Like +min_column_expr +.gt_eq(scalar_expr.clone()) +.and(scalar_expr.gt_eq(max_column_expr)) +} +_ => unhandled, +} +} // other expressions are not supported -_ => logical_plan::lit(true), +_ => unhandled, }; Ok(statistics_expr) } @@ -1096,6 +1134,60 @@ mod tests { } #[test] +fn row_group_predicate_starts_with() -> Result<()> { +let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); +// test LIKE operator that is converted to a 'starts_with' +let expr = col("c1").like(lit("Banana%")); +let expected_expr = +"#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max"; +let predicate_expr = +build_predicate_expression(, , RequiredStatColumns::new())?; +assert_eq!(format!("{:?}", predicate_expr), expected_expr); + +Ok(()) +} + +#[test] +fn row_group_predicate_like() -> Result<()> { +let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); +// test LIKE operator that can't be converted to a 'starts_with' +let expr = col("c1").like(lit("%Banana%")); +let expected_expr = "Boolean(true)"; +let predicate_expr = +build_predicate_expression(, , RequiredStatColumns::new())?; +assert_eq!(format!("{:?}", predicate_expr), expected_expr); + +Ok(()) +} + +#[test] +fn row_group_predicate_not_starts_with() -> Result<()> { +let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]); +// test LIKE operator that can't be converted to a 'star
[arrow-datafusion] branch i507-string-like-prune created (now 1062d5c)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch i507-string-like-prune in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git. at 1062d5c add expr::like and expr::notlike to pruning logic This branch includes the following new commits: new 1062d5c add expr::like and expr::notlike to pruning logic The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow-rs] branch master updated: use prettiery to auto format md files (#398)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 2ddc717 use prettiery to auto format md files (#398) 2ddc717 is described below commit 2ddc7174af170e923c77d02ad9bd58027bd260e1 Author: Jiayu Liu AuthorDate: Sat Jun 5 13:01:58 2021 +0800 use prettiery to auto format md files (#398) --- .github/workflows/dev.yml | 14 +++- CODE_OF_CONDUCT.md | 4 +-- CONTRIBUTING.md| 26 +++ README.md | 34 ++-- arrow/README.md| 36 ++--- .../tests/fixtures/crossbow-success-message.md | 12 +++ dev/release/README.md | 35 integration-testing/README.md | 10 +++--- parquet/README.md | 37 +++--- 9 files changed, 107 insertions(+), 101 deletions(-) diff --git a/.github/workflows/dev.yml b/.github/workflows/dev.yml index 9d8146a..545cb97 100644 --- a/.github/workflows/dev.yml +++ b/.github/workflows/dev.yml @@ -27,7 +27,6 @@ env: ARCHERY_DOCKER_PASSWORD: ${{ secrets.DOCKERHUB_TOKEN }} jobs: - lint: name: Lint C++, Python, R, Rust, Docker, RAT runs-on: ubuntu-latest @@ -41,3 +40,16 @@ jobs: run: pip install -e dev/archery[docker] - name: Lint run: archery lint --rat + prettier: +name: Use prettier to check formatting of documents +runs-on: ubuntu-latest +steps: + - uses: actions/checkout@v2 + - uses: actions/setup-node@v2 +with: + node-version: "14" + - name: Prettier check +run: | + # if you encounter error, try rerun the command below with --write instead of --check + # and commit the changes + npx prettier@2.3.0 --check {arrow,arrow-flight,dev,integration-testing,parquet}/**/*.md README.md CODE_OF_CONDUCT.md CONTRIBUTING.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 2efe740..9a24b9b 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -19,6 +19,6 @@ # Code of Conduct -* [Code of Conduct for The Apache Software Foundation][1] +- [Code of Conduct for The Apache Software Foundation][1] -[1]: https://www.apache.org/foundation/policies/conduct.html \ No newline at end of file +[1]: https://www.apache.org/foundation/policies/conduct.html diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3e636d9..18d6a7b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -21,15 +21,15 @@ ## Did you find a bug? -The Arrow project uses JIRA as a bug tracker. To report a bug, you'll have +The Arrow project uses JIRA as a bug tracker. To report a bug, you'll have to first create an account on the -[Apache Foundation JIRA](https://issues.apache.org/jira/). The JIRA server -hosts bugs and issues for multiple Apache projects. The JIRA project name +[Apache Foundation JIRA](https://issues.apache.org/jira/). The JIRA server +hosts bugs and issues for multiple Apache projects. The JIRA project name for Arrow is "ARROW". To be assigned to an issue, ask an Arrow JIRA admin to go to [Arrow Roles](https://issues.apache.org/jira/plugins/servlet/project-config/ARROW/roles), -click "Add users to a role," and add you to the "Contributor" role. Most +click "Add users to a role," and add you to the "Contributor" role. Most committers are authorized to do this; if you're a committer and aren't able to load that project admin page, have someone else add you to the necessary role. @@ -39,15 +39,15 @@ Before you create a new bug entry, we recommend you first among existing Arrow issues. When you create a new JIRA entry, please don't forget to fill the "Component" -field. Arrow has many subcomponents and this helps triaging and filtering -tremendously. Also, we conventionally prefix the issue title with the component +field. Arrow has many subcomponents and this helps triaging and filtering +tremendously. Also, we conventionally prefix the issue title with the component name in brackets, such as "[C++] Crash in Array::Frobnicate()", so as to make lists more easy to navigate, and we'd be grateful if you did the same. ## Did you write a patch that fixes a bug or brings an improvement? -First create a JIRA entry as described above. Then, submit your changes -as a GitHub Pull Request. We'll ask you to prefix the pull request title +First create a JIRA entry as described above. Then, submit your changes +as a GitHub Pull Request. We'll ask you to prefix the pull request title with the JIRA issue number and the component name in brackets.
[arrow-rs] branch master updated: MINOR: update install instruction (#400)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new db63714 MINOR: update install instruction (#400) db63714 is described below commit db6371400ec4dae83e49859a13c8173f8501b1e4 Author: Ádám Lippai AuthorDate: Sat Jun 5 06:54:32 2021 +0200 MINOR: update install instruction (#400) We have frequent releases and honoring semver, removed minor and patch version pinning --- parquet/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/parquet/README.md b/parquet/README.md index 326c966..7f47b56 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -27,7 +27,7 @@ Add this to your Cargo.toml: ```toml [dependencies] -parquet = "4.1.0" +parquet = "^4" ``` and this to your crate root:
[arrow-rs] branch master updated: Fix typo in release script, update release location (#380)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new f41cb17 Fix typo in release script, update release location (#380) f41cb17 is described below commit f41cb17066146552701bb7eb67bc13b2ef9ff1b6 Author: Andrew Lamb AuthorDate: Sun May 30 02:25:18 2021 -0400 Fix typo in release script, update release location (#380) * Fix typo in release script * release to `arrow-rs-{version}` directory --- dev/release/create-tarball.sh | 2 +- dev/release/release-tarball.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/release/create-tarball.sh b/dev/release/create-tarball.sh index ab3e1d2..9fadedf 100755 --- a/dev/release/create-tarball.sh +++ b/dev/release/create-tarball.sh @@ -73,7 +73,7 @@ echo "" echo "-" cat <https://dist.apache.org/repos/dist/release/arrow ${tmp_dir}/release echo "Copy ${version}-rc${rc} to release working copy" -release_version=arrow-${version} +release_version=arrow-rs-${version} mkdir -p ${tmp_dir}/release/${release_version} cp -r ${tmp_dir}/dev/* ${tmp_dir}/release/${release_version}/ svn add ${tmp_dir}/release/${release_version}
[arrow-rs] branch active_release updated: Add crate badges (#362) (#373)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch active_release in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/active_release by this push: new 58d53cf Add crate badges (#362) (#373) 58d53cf is described below commit 58d53cfc8dcf018baf5e15097c3f8a402dc48ea1 Author: Andrew Lamb AuthorDate: Thu May 27 02:20:22 2021 -0400 Add crate badges (#362) (#373) * Add crate badges * Format markdown Co-authored-by: Dominik Moritz --- arrow-flight/README.md | 5 ++--- arrow/README.md| 2 ++ parquet/README.md | 16 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arrow-flight/README.md b/arrow-flight/README.md index ba63f65..4205ebb 100644 --- a/arrow-flight/README.md +++ b/arrow-flight/README.md @@ -19,11 +19,10 @@ # Apache Arrow Flight +[![Crates.io](https://img.shields.io/crates/v/arrow-flight.svg)](https://crates.io/crates/arrow-flight) + Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information. This crate simply provides the Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic. Note that building a Flight server also requires an implementation of Arrow IPC which is based on the Flatbuffers serialization framework. The Rust implementation of Arrow IPC is not yet complete although the generated Flatbuffers code is available as part of the core Arrow crate. - - - diff --git a/arrow/README.md b/arrow/README.md index 674c3fc..e873509 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -19,6 +19,8 @@ # Native Rust implementation of Apache Arrow +[![Crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow) + This crate contains a native Rust implementation of the [Arrow columnar format](https://arrow.apache.org/docs/format/Columnar.html). ## Developer's guide diff --git a/parquet/README.md b/parquet/README.md index 836a23b..d032fed 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -19,19 +19,25 @@ # An Apache Parquet implementation in Rust +[![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) + ## Usage + Add this to your Cargo.toml: + ```toml [dependencies] parquet = "5.0.0-SNAPSHOT" ``` and this to your crate root: + ```rust extern crate parquet; ``` Example usage of reading data: + ```rust use std::fs::File; use std::path::Path; @@ -44,6 +50,7 @@ while let Some(record) = iter.next() { println!("{}", record); } ``` + See [crate documentation](https://docs.rs/crate/parquet/5.0.0-SNAPSHOT) on available API. ## Upgrading from versions prior to 4.0 @@ -61,12 +68,14 @@ It is preferred that `LogicalType` is used, as it supports nanosecond precision timestamps without using the deprecated `Int96` Parquet type. ## Supported Parquet Version + - Parquet-format 2.6.0 To update Parquet format to a newer version, check if [parquet-format](https://github.com/sunchao/parquet-format-rs) version is available. Then simply update version of `parquet-format` crate in Cargo.toml. ## Features + - [X] All encodings supported - [X] All compression codecs supported - [X] Read support @@ -87,15 +96,18 @@ Parquet requires LLVM. Our windows CI image includes LLVM but to build the libr users will have to install LLVM. Follow [this](https://github.com/appveyor/ci/issues/2651) link for info. ## Build + Run `cargo build` or `cargo build --release` to build in release mode. Some features take advantage of SSE4.2 instructions, which can be enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the `cargo build` command. ## Test + Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`. ## Binaries + The following binaries are provided (use `cargo install --features cli` to install them): - **parquet-schema** for printing Parquet file schema and metadata. `Usage: parquet-schema `, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag @@ -111,16 +123,20 @@ be printed). Use `-j/--json` to print records in JSON lines format. files to read. If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is set properly: + ``` export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib ``` ## Benchmarks + Run `cargo bench` for benchmarks. ## Docs + To build documentation, run `cargo doc --no-deps`. To compile and view in the browser, run `cargo doc --no-deps --open`. ## License + Licensed und
[arrow-rs] branch active_release updated: Only register Flight.proto with cargo if it exists (#351) (#374)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch active_release in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/active_release by this push: new f0702df Only register Flight.proto with cargo if it exists (#351) (#374) f0702df is described below commit f0702df314434a1c79184c019b09d2aa2c39c00f Author: Andrew Lamb AuthorDate: Thu May 27 02:19:50 2021 -0400 Only register Flight.proto with cargo if it exists (#351) (#374) Co-authored-by: Raphael Taylor-Davies <1781103+tustv...@users.noreply.github.com> --- arrow-flight/build.rs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arrow-flight/build.rs b/arrow-flight/build.rs index bc84f37..1cbfceb 100644 --- a/arrow-flight/build.rs +++ b/arrow-flight/build.rs @@ -23,9 +23,6 @@ use std::{ }; fn main() -> Result<(), Box> { -// avoid rerunning build if the file has not changed -println!("cargo:rerun-if-changed=../format/Flight.proto"); - // override the build location, in order to check in the changes to proto files env::set_var("OUT_DIR", "src"); @@ -33,6 +30,9 @@ fn main() -> Result<(), Box> { // built or released so we build an absolute path to the proto file let path = Path::new("../format/Flight.proto"); if path.exists() { +// avoid rerunning build if the file has not changed +println!("cargo:rerun-if-changed=../format/Flight.proto"); + tonic_build::compile_protos("../format/Flight.proto")?; // read file contents to string let mut file = OpenOptions::new()
[arrow-rs] branch master updated (7753f41 -> f26ffb3)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 7753f41 Only register Flight.proto with cargo if it exists (#351) add f26ffb3 Remove superfluous space (#363) No new revisions were added by this update. Summary of changes: .github/pull_request_template.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[arrow-rs] branch master updated (4a27a3b -> 7753f41)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 4a27a3b Add crate badges (#362) add 7753f41 Only register Flight.proto with cargo if it exists (#351) No new revisions were added by this update. Summary of changes: arrow-flight/build.rs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[arrow-rs] branch master updated: Add crate badges (#362)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 4a27a3b Add crate badges (#362) 4a27a3b is described below commit 4a27a3b3c797e801d919ac30cd432f27f9a3d28c Author: Dominik Moritz AuthorDate: Wed May 26 13:20:04 2021 -0700 Add crate badges (#362) * Add crate badges * Format markdown --- arrow-flight/README.md | 5 ++--- arrow/README.md| 2 ++ parquet/README.md | 16 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arrow-flight/README.md b/arrow-flight/README.md index ba63f65..4205ebb 100644 --- a/arrow-flight/README.md +++ b/arrow-flight/README.md @@ -19,11 +19,10 @@ # Apache Arrow Flight +[![Crates.io](https://img.shields.io/crates/v/arrow-flight.svg)](https://crates.io/crates/arrow-flight) + Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information. This crate simply provides the Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic. Note that building a Flight server also requires an implementation of Arrow IPC which is based on the Flatbuffers serialization framework. The Rust implementation of Arrow IPC is not yet complete although the generated Flatbuffers code is available as part of the core Arrow crate. - - - diff --git a/arrow/README.md b/arrow/README.md index 7c54da0..f67d582 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -19,6 +19,8 @@ # Native Rust implementation of Apache Arrow +[![Crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow) + This crate contains a native Rust implementation of the [Arrow columnar format](https://arrow.apache.org/docs/format/Columnar.html). ## Developer's guide diff --git a/parquet/README.md b/parquet/README.md index 836a23b..d032fed 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -19,19 +19,25 @@ # An Apache Parquet implementation in Rust +[![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) + ## Usage + Add this to your Cargo.toml: + ```toml [dependencies] parquet = "5.0.0-SNAPSHOT" ``` and this to your crate root: + ```rust extern crate parquet; ``` Example usage of reading data: + ```rust use std::fs::File; use std::path::Path; @@ -44,6 +50,7 @@ while let Some(record) = iter.next() { println!("{}", record); } ``` + See [crate documentation](https://docs.rs/crate/parquet/5.0.0-SNAPSHOT) on available API. ## Upgrading from versions prior to 4.0 @@ -61,12 +68,14 @@ It is preferred that `LogicalType` is used, as it supports nanosecond precision timestamps without using the deprecated `Int96` Parquet type. ## Supported Parquet Version + - Parquet-format 2.6.0 To update Parquet format to a newer version, check if [parquet-format](https://github.com/sunchao/parquet-format-rs) version is available. Then simply update version of `parquet-format` crate in Cargo.toml. ## Features + - [X] All encodings supported - [X] All compression codecs supported - [X] Read support @@ -87,15 +96,18 @@ Parquet requires LLVM. Our windows CI image includes LLVM but to build the libr users will have to install LLVM. Follow [this](https://github.com/appveyor/ci/issues/2651) link for info. ## Build + Run `cargo build` or `cargo build --release` to build in release mode. Some features take advantage of SSE4.2 instructions, which can be enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the `cargo build` command. ## Test + Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`. ## Binaries + The following binaries are provided (use `cargo install --features cli` to install them): - **parquet-schema** for printing Parquet file schema and metadata. `Usage: parquet-schema `, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag @@ -111,16 +123,20 @@ be printed). Use `-j/--json` to print records in JSON lines format. files to read. If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is set properly: + ``` export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib ``` ## Benchmarks + Run `cargo bench` for benchmarks. ## Docs + To build documentation, run `cargo doc --no-deps`. To compile and view in the browser, run `cargo doc --no-deps --open`. ## License + Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0.
[arrow-rs] branch master updated: Version upgrades (#304)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new a959c85 Version upgrades (#304) a959c85 is described below commit a959c85f8e567e7f117445f78a7c524e57edfaf4 Author: Daniël Heres AuthorDate: Mon May 17 08:09:38 2021 +0200 Version upgrades (#304) --- arrow/Cargo.toml | 2 +- parquet/Cargo.toml | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arrow/Cargo.toml b/arrow/Cargo.toml index d66ac25..6d532ce 100644 --- a/arrow/Cargo.toml +++ b/arrow/Cargo.toml @@ -42,7 +42,7 @@ serde_json = { version = "1.0", features = ["preserve_order"] } indexmap = "1.6" rand = "0.7" csv = "1.1" -num = "0.3" +num = "0.4" regex = "1.3" lazy_static = "1.4" packed_simd = { version = "0.3.4", optional = true, package = "packed_simd_2" } diff --git a/parquet/Cargo.toml b/parquet/Cargo.toml index fc221b0..1e54047 100644 --- a/parquet/Cargo.toml +++ b/parquet/Cargo.toml @@ -38,11 +38,11 @@ snap = { version = "1.0", optional = true } brotli = { version = "3.3", optional = true } flate2 = { version = "1.0", optional = true } lz4 = { version = "1.23", optional = true } -zstd = { version = "0.7", optional = true } +zstd = { version = "0.8", optional = true } chrono = "0.4" -num-bigint = "0.3" +num-bigint = "0.4" arrow = { path = "../arrow", version = "5.0.0-SNAPSHOT", optional = true } -base64 = { version = "0.12", optional = true } +base64 = { version = "0.13", optional = true } clap = { version = "2.33.3", optional = true } serde_json = { version = "1.0", features = ["preserve_order"], optional = true } @@ -53,7 +53,7 @@ snap = "1.0" brotli = "3.3" flate2 = "1.0" lz4 = "1.23" -zstd = "0.7" +zstd = "0.8" arrow = { path = "../arrow", version = "5.0.0-SNAPSHOT" } serde_json = { version = "1.0", features = ["preserve_order"] }
[arrow-rs] branch master updated: Fix subtraction underflow when sorting string arrays with many nulls (#285)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new ce8e67c Fix subtraction underflow when sorting string arrays with many nulls (#285) ce8e67c is described below commit ce8e67c28ad1431cda36b38434e53871c2dd520a Author: Michael Edwards AuthorDate: Thu May 13 13:28:46 2021 +0200 Fix subtraction underflow when sorting string arrays with many nulls (#285) --- arrow/src/compute/kernels/sort.rs | 285 -- 1 file changed, 274 insertions(+), 11 deletions(-) diff --git a/arrow/src/compute/kernels/sort.rs b/arrow/src/compute/kernels/sort.rs index 9287425..7cd463d 100644 --- a/arrow/src/compute/kernels/sort.rs +++ b/arrow/src/compute/kernels/sort.rs @@ -410,24 +410,27 @@ fn sort_boolean( len = limit.min(len); } if !descending { -sort_by( valids, len - nulls_len, |a, b| cmp(a.1, b.1)); +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { +cmp(a.1, b.1) +}); } else { -sort_by( valids, len - nulls_len, |a, b| cmp(a.1, b.1).reverse()); +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { +cmp(a.1, b.1).reverse() +}); // reverse to keep a stable ordering nulls.reverse(); } // collect results directly into a buffer instead of a vec to avoid another aligned allocation -let mut result = MutableBuffer::new(values.len() * std::mem::size_of::()); +let result_capacity = len * std::mem::size_of::(); +let mut result = MutableBuffer::new(result_capacity); // sets len to capacity so we can access the whole buffer as a typed slice -result.resize(values.len() * std::mem::size_of::(), 0); +result.resize(result_capacity, 0); let result_slice: [u32] = result.typed_data_mut(); -debug_assert_eq!(result_slice.len(), nulls_len + valids_len); - if options.nulls_first { let size = nulls_len.min(len); -result_slice[0..nulls_len.min(len)].copy_from_slice(); +result_slice[0..size].copy_from_slice([0..size]); if nulls_len < len { insert_valid_values(result_slice, nulls_len, [0..len - size]); } @@ -626,9 +629,13 @@ where len = limit.min(len); } if !descending { -sort_by( valids, len - nulls_len, |a, b| cmp(a.1, b.1)); +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { +cmp(a.1, b.1) +}); } else { -sort_by( valids, len - nulls_len, |a, b| cmp(a.1, b.1).reverse()); +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { +cmp(a.1, b.1).reverse() +}); // reverse to keep a stable ordering nulls.reverse(); } @@ -689,11 +696,11 @@ where len = limit.min(len); } if !descending { -sort_by( valids, len - nulls_len, |a, b| { +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { cmp_array(a.1.as_ref(), b.1.as_ref()) }); } else { -sort_by( valids, len - nulls_len, |a, b| { +sort_by( valids, len.saturating_sub(nulls_len), |a, b| { cmp_array(a.1.as_ref(), b.1.as_ref()).reverse() }); // reverse to keep a stable ordering @@ -1285,6 +1292,48 @@ mod tests { None, vec![5, 0, 2, 1, 4, 3], ); + +// valid values less than limit with extra nulls +test_sort_to_indices_primitive_arrays::( +vec![Some(2.0), None, None, Some(1.0)], +Some(SortOptions { +descending: false, +nulls_first: false, +}), +Some(3), +vec![3, 0, 1], +); + +test_sort_to_indices_primitive_arrays::( +vec![Some(2.0), None, None, Some(1.0)], +Some(SortOptions { +descending: false, +nulls_first: true, +}), +Some(3), +vec![1, 2, 3], +); + +// more nulls than limit +test_sort_to_indices_primitive_arrays::( +vec![Some(1.0), None, None, None], +Some(SortOptions { +descending: false, +nulls_first: true, +}), +Some(2), +vec![1, 2], +); + +test_sort_to_indices_primitive_arrays::( +vec![Some(1.0), None, None, None], +Some(SortOptions { +descending: false, +nulls_first: false, +}), +Some(2), +vec![0, 1], +); } #[test] @@ -1329,6 +1378,48 @@ mod tests { Some(3), vec![5, 0, 2], ); + +// valid values less than limit with extra nulls +test_sort_to_indices_boolean_arr
[arrow-rs] branch master updated: Fix null struct and list roundtrip (#270)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 8226219 Fix null struct and list roundtrip (#270) 8226219 is described below commit 8226219fe7104f6c8a2740806f96f02c960d991c Author: Wakahisa AuthorDate: Tue May 11 07:42:41 2021 +0200 Fix null struct and list roundtrip (#270) * fix null struct and list inconsistencies in writer * fix list reader null and empty slot calculation * remove stray TODOs --- parquet/src/arrow/array_reader.rs | 95 - parquet/src/arrow/arrow_writer.rs | 54 ++--- parquet/src/arrow/levels.rs | 430 +- 3 files changed, 265 insertions(+), 314 deletions(-) diff --git a/parquet/src/arrow/array_reader.rs b/parquet/src/arrow/array_reader.rs index f209b8b..f54e446 100644 --- a/parquet/src/arrow/array_reader.rs +++ b/parquet/src/arrow/array_reader.rs @@ -615,6 +615,8 @@ pub struct ListArrayReader { item_type: ArrowType, list_def_level: i16, list_rep_level: i16, +list_empty_def_level: i16, +list_null_def_level: i16, def_level_buffer: Option, rep_level_buffer: Option, _marker: PhantomData, @@ -628,6 +630,8 @@ impl ListArrayReader { item_type: ArrowType, def_level: i16, rep_level: i16, +list_null_def_level: i16, +list_empty_def_level: i16, ) -> Self { Self { item_reader, @@ -635,6 +639,8 @@ impl ListArrayReader { item_type, list_def_level: def_level, list_rep_level: rep_level, +list_null_def_level, +list_empty_def_level, def_level_buffer: None, rep_level_buffer: None, _marker: PhantomData, @@ -843,61 +849,49 @@ impl ArrayReader for ListArrayReader { // Where n is the max definition level of the list's parent. // If a Parquet schema's only leaf is the list, then n = 0. -// TODO: ARROW-10391 - add a test case with a non-nullable child, check if max is 3 -let list_field_type = match self.get_data_type() { -ArrowType::List(field) -| ArrowType::FixedSizeList(field, _) -| ArrowType::LargeList(field) => field, -_ => { -// Panic: this is safe as we only write lists from list datatypes -unreachable!() -} -}; -let max_list_def_range = if list_field_type.is_nullable() { 3 } else { 2 }; -let max_list_definition = *(def_levels.iter().max().unwrap()); -// TODO: ARROW-10391 - Find a reliable way of validating deeply-nested lists -// debug_assert!( -// max_list_definition >= max_list_def_range, -// "Lift definition max less than range" -// ); -let list_null_def = max_list_definition - max_list_def_range; -let list_empty_def = max_list_definition - 1; -let mut null_list_indices: Vec = Vec::new(); -for i in 0..def_levels.len() { -if def_levels[i] == list_null_def { -null_list_indices.push(i); -} -} +// If the list index is at empty definition, the child slot is null +let null_list_indices: Vec = def_levels +.iter() +.enumerate() +.filter_map(|(index, def)| { +if *def <= self.list_empty_def_level { +Some(index) +} else { +None +} +}) +.collect(); let batch_values = match null_list_indices.len() { 0 => next_batch_array.clone(), _ => remove_indices(next_batch_array.clone(), item_type, null_list_indices)?, }; -// null list has def_level = 0 -// empty list has def_level = 1 -// null item in a list has def_level = 2 -// non-null item has def_level = 3 // first item in each list has rep_level = 0, subsequent items have rep_level = 1 - let mut offsets: Vec = Vec::new(); let mut cur_offset = OffsetSize::zero(); -for i in 0..rep_levels.len() { -if rep_levels[i] == 0 { -offsets.push(cur_offset) +def_levels.iter().zip(rep_levels).for_each(|(d, r)| { +if *r == 0 || d == _empty_def_level { +offsets.push(cur_offset); } -if def_levels[i] >= list_empty_def { +if d > _empty_def_level { cur_offset += OffsetSize::one(); } -} +}); offsets.push(cur_offset); let num_bytes = bit_util::ceil(offsets.len(), 8); -let mut null_buf = MutableBuffer::new(num_bytes).with_bitse
[arrow-rs] branch master updated: Speed up bound checking in `take` (#281)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 510f02f Speed up bound checking in `take` (#281) 510f02f is described below commit 510f02f449193bea9df3f423d18ce7a9e4112bdf Author: Daniël Heres AuthorDate: Tue May 11 07:35:05 2021 +0200 Speed up bound checking in `take` (#281) * WIP improve take performance * WIP * Bound checking speed * Simplify * fmt * Improve formatting --- arrow/benches/take_kernels.rs | 19 ++- arrow/src/compute/kernels/take.rs | 25 +++-- 2 files changed, 37 insertions(+), 7 deletions(-) diff --git a/arrow/benches/take_kernels.rs b/arrow/benches/take_kernels.rs index 2853eb5..b1d03d7 100644 --- a/arrow/benches/take_kernels.rs +++ b/arrow/benches/take_kernels.rs @@ -23,7 +23,7 @@ use rand::Rng; extern crate arrow; -use arrow::compute::take; +use arrow::compute::{take, TakeOptions}; use arrow::datatypes::*; use arrow::util::test_util::seedable_rng; use arrow::{array::*, util::bench_util::*}; @@ -46,6 +46,12 @@ fn bench_take(values: Array, indices: ) { criterion::black_box(take(values, , None).unwrap()); } +fn bench_take_bounds_check(values: Array, indices: ) { +criterion::black_box( +take(values, , Some(TakeOptions { check_bounds: true })).unwrap(), +); +} + fn add_benchmark(c: Criterion) { let values = create_primitive_array::(512, 0.0); let indices = create_random_index(512, 0.0); @@ -56,6 +62,17 @@ fn add_benchmark(c: Criterion) { b.iter(|| bench_take(, )) }); +let values = create_primitive_array::(512, 0.0); +let indices = create_random_index(512, 0.0); +c.bench_function("take check bounds i32 512", |b| { +b.iter(|| bench_take_bounds_check(, )) +}); +let values = create_primitive_array::(1024, 0.0); +let indices = create_random_index(1024, 0.0); +c.bench_function("take check bounds i32 1024", |b| { +b.iter(|| bench_take_bounds_check(, )) +}); + let indices = create_random_index(512, 0.5); c.bench_function("take i32 nulls 512", |b| { b.iter(|| bench_take(, )) diff --git a/arrow/src/compute/kernels/take.rs b/arrow/src/compute/kernels/take.rs index 0217573..d325ce4 100644 --- a/arrow/src/compute/kernels/take.rs +++ b/arrow/src/compute/kernels/take.rs @@ -100,17 +100,30 @@ where let options = options.unwrap_or_default(); if options.check_bounds { let len = values.len(); -for i in 0..indices.len() { -if indices.is_valid(i) { -let ix = ToPrimitive::to_usize((i)).ok_or_else(|| { +if indices.null_count() > 0 { +indices.iter().flatten().try_for_each(|index| { +let ix = ToPrimitive::to_usize().ok_or_else(|| { ArrowError::ComputeError("Cast to usize failed".to_string()) })?; if ix >= len { return Err(ArrowError::ComputeError( -format!("Array index out of bounds, cannot get item at index {} from {} entries", ix, len)) -); +format!("Array index out of bounds, cannot get item at index {} from {} entries", ix, len)) +); } -} +Ok(()) +})?; +} else { +indices.values().iter().try_for_each(|index| { +let ix = ToPrimitive::to_usize(index).ok_or_else(|| { +ArrowError::ComputeError("Cast to usize failed".to_string()) +})?; +if ix >= len { +return Err(ArrowError::ComputeError( +format!("Array index out of bounds, cannot get item at index {} from {} entries", ix, len)) +); +} +Ok(()) +})? } } match values.data_type() {
[arrow-rs] branch master updated: support full u32 and u64 roundtrip through parquet (#258)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 2f5f58a support full u32 and u64 roundtrip through parquet (#258) 2f5f58a is described below commit 2f5f58a2087be67b0109f0c8843c216a4fd1 Author: Marco Neumann AuthorDate: Mon May 10 18:44:58 2021 +0200 support full u32 and u64 roundtrip through parquet (#258) * re-export arity kernels in `arrow::compute` Seems logical since all other kernels are re-exported as well under this flat hierarchy. * return file from `parquet::arrow::arrow_writer::tests::[one_column]_roundtrip` * support full arrow u64 through parquet - updates arrow to parquet type mapping to use reinterpret/overflow cast for u64<->i64 similar to what the C++ stack does - changes statistics calculation to account for the fact that u64 should be compared unsigned (as per spec) Fixes #254. * avoid copying array when reading u64 from parquet * support full arrow u32 through parquet This is idential to the solution we now have for u64. --- arrow/src/compute/mod.rs | 1 + parquet/src/arrow/array_reader.rs | 30 ++-- parquet/src/arrow/arrow_writer.rs | 141 +- parquet/src/column/writer.rs | 59 4 files changed, 193 insertions(+), 38 deletions(-) diff --git a/arrow/src/compute/mod.rs b/arrow/src/compute/mod.rs index be1aa27..166f156 100644 --- a/arrow/src/compute/mod.rs +++ b/arrow/src/compute/mod.rs @@ -23,6 +23,7 @@ mod util; pub use self::kernels::aggregate::*; pub use self::kernels::arithmetic::*; +pub use self::kernels::arity::*; pub use self::kernels::boolean::*; pub use self::kernels::cast::*; pub use self::kernels::comparison::*; diff --git a/parquet/src/arrow/array_reader.rs b/parquet/src/arrow/array_reader.rs index d125cf6..f209b8b 100644 --- a/parquet/src/arrow/array_reader.rs +++ b/parquet/src/arrow/array_reader.rs @@ -268,10 +268,29 @@ impl ArrayReader for PrimitiveArrayReader { } } +let target_type = self.get_data_type().clone(); let arrow_data_type = match T::get_physical_type() { PhysicalType::BOOLEAN => ArrowBooleanType::DATA_TYPE, -PhysicalType::INT32 => ArrowInt32Type::DATA_TYPE, -PhysicalType::INT64 => ArrowInt64Type::DATA_TYPE, +PhysicalType::INT32 => { +match target_type { +ArrowType::UInt32 => { +// follow C++ implementation and use overflow/reinterpret cast from i32 to u32 which will map +// `i32::MIN..0` to `(i32::MAX as u32)..u32::MAX` +ArrowUInt32Type::DATA_TYPE +} +_ => ArrowInt32Type::DATA_TYPE, +} +} +PhysicalType::INT64 => { +match target_type { +ArrowType::UInt64 => { +// follow C++ implementation and use overflow/reinterpret cast from i64 to u64 which will map +// `i64::MIN..0` to `(i64::MAX as u64)..u64::MAX` +ArrowUInt64Type::DATA_TYPE +} +_ => ArrowInt64Type::DATA_TYPE, +} +} PhysicalType::FLOAT => ArrowFloat32Type::DATA_TYPE, PhysicalType::DOUBLE => ArrowFloat64Type::DATA_TYPE, PhysicalType::INT96 @@ -343,15 +362,14 @@ impl ArrayReader for PrimitiveArrayReader { // are datatypes which we must convert explicitly. // These are: // - date64: we should cast int32 to date32, then date32 to date64. -let target_type = self.get_data_type(); let array = match target_type { ArrowType::Date64 => { // this is cheap as it internally reinterprets the data let a = arrow::compute::cast(, ::Date32)?; -arrow::compute::cast(, target_type)? +arrow::compute::cast(, _type)? } ArrowType::Decimal(p, s) => { -let mut builder = DecimalBuilder::new(array.len(), *p, *s); +let mut builder = DecimalBuilder::new(array.len(), p, s); match array.data_type() { ArrowType::Int32 => { let values = array.as_any().downcast_ref::().unwrap(); @@ -380,7 +398,7 @@ impl ArrayReader for PrimitiveArrayReader { } Arc::new(builder.finish()) as ArrayRef } -_ => arrow::compute::cast(, target_type)?, +_ => arrow::compute::cast(, _type)?, }; // save
[arrow-rs] branch nevi-me-patch-1 created (now 4e61130)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch nevi-me-patch-1 in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. at 4e61130 Update PR template by commenting out instructions This branch includes the following new commits: new 4e61130 Update PR template by commenting out instructions The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[arrow-rs] 01/01: Update PR template by commenting out instructions
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch nevi-me-patch-1 in repository https://gitbox.apache.org/repos/asf/arrow-rs.git commit 4e6113026a186aff92ff304af5faffceefa1cdd4 Author: Wakahisa AuthorDate: Mon May 10 18:35:27 2021 +0200 Update PR template by commenting out instructions Some contributors don't remove the guidelines when creating PRs, so it might be more convenient if we hide them behind comments. The comments are still visible when editing, but are not displayed when the markdown is rendered --- .github/pull_request_template.md | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 5da0d08..95403e1 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -1,19 +1,31 @@ # Which issue does this PR close? + Closes #. # Rationale for this change + + # What changes are included in this PR? + # Are there any user-facing changes? + + +
[arrow-rs] branch master updated: Fix typo in csv/reader.rs (#265)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new a870b24 Fix typo in csv/reader.rs (#265) a870b24 is described below commit a870b24bd4eb76d3e0e5c718c9956a7dcdee52fd Author: Dominik Moritz AuthorDate: Thu May 6 22:36:56 2021 -0700 Fix typo in csv/reader.rs (#265) --- arrow/src/csv/reader.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/src/csv/reader.rs b/arrow/src/csv/reader.rs index 9fafc38..00f1d7f 100644 --- a/arrow/src/csv/reader.rs +++ b/arrow/src/csv/reader.rs @@ -353,7 +353,7 @@ impl Reader { } // Initialize batch_records with StringRecords so they -// can be reused accross batches +// can be reused across batches let mut batch_records = Vec::with_capacity(batch_size); batch_records.resize_with(batch_size, Default::default);
[arrow-rs] branch master updated: Fix empty Schema::metadata deserialization error (#260)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 64ea8da Fix empty Schema::metadata deserialization error (#260) 64ea8da is described below commit 64ea8dae64b05a1a4ffcde739b02411219653dc2 Author: hulunbier AuthorDate: Fri May 7 13:32:32 2021 +0800 Fix empty Schema::metadata deserialization error (#260) * Fix empty Schema::metadata deserialization error Hope this fixes issue #241 * Rename UT name to `test_ser_de_metadata` Co-authored-by: hulunbier --- arrow/src/datatypes/schema.rs | 33 + 1 file changed, 33 insertions(+) diff --git a/arrow/src/datatypes/schema.rs b/arrow/src/datatypes/schema.rs index ad89b29..cfc0744 100644 --- a/arrow/src/datatypes/schema.rs +++ b/arrow/src/datatypes/schema.rs @@ -35,6 +35,7 @@ pub struct Schema { pub(crate) fields: Vec, /// A map of key-value pairs containing additional meta data. #[serde(skip_serializing_if = "HashMap::is_empty")] +#[serde(default)] pub(crate) metadata: HashMap, } @@ -335,3 +336,35 @@ struct MetadataKeyValue { key: String, value: String, } + +#[cfg(test)] +mod tests { +use crate::datatypes::DataType; + +use super::*; + +#[test] +fn test_ser_de_metadata() { +// ser/de with empty metadata +let mut schema = Schema::new(vec![ +Field::new("name", DataType::Utf8, false), +Field::new("address", DataType::Utf8, false), +Field::new("priority", DataType::UInt8, false), +]); + +let json = serde_json::to_string().unwrap(); +let de_schema = serde_json::from_str().unwrap(); + +assert_eq!(schema, de_schema); + +// ser/de with non-empty metadata +schema.metadata = [("key".to_owned(), "val".to_owned())] +.iter() +.cloned() +.collect(); +let json = serde_json::to_string().unwrap(); +let de_schema = serde_json::from_str().unwrap(); + +assert_eq!(schema, de_schema); +} +}
[arrow-rs] branch master updated: Added env to run rust in integration. (#253)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 508f25c Added env to run rust in integration. (#253) 508f25c is described below commit 508f25c10032857da34ea88cc8166f0741616a32 Author: Jorge Leitao AuthorDate: Wed May 5 06:47:26 2021 +0200 Added env to run rust in integration. (#253) --- .github/workflows/integration.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml index 8dd2bd8..115bfad 100644 --- a/.github/workflows/integration.yml +++ b/.github/workflows/integration.yml @@ -48,4 +48,4 @@ jobs: - name: Setup Archery run: pip install -e dev/archery[docker] - name: Execute Docker Build -run: archery docker run conda-integration +run: archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration
[arrow-rs] branch master updated: fix NaN handling in parquet statistics (#256)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git The following commit(s) were added to refs/heads/master by this push: new 04779e0 fix NaN handling in parquet statistics (#256) 04779e0 is described below commit 04779e0b57efa2f88c75abc080cd5feb70737484 Author: Marco Neumann AuthorDate: Wed May 5 06:46:24 2021 +0200 fix NaN handling in parquet statistics (#256) Closes #255. --- parquet/src/column/writer.rs | 91 +--- 1 file changed, 86 insertions(+), 5 deletions(-) diff --git a/parquet/src/column/writer.rs b/parquet/src/column/writer.rs index 0b56594..64e4880 100644 --- a/parquet/src/column/writer.rs +++ b/parquet/src/column/writer.rs @@ -921,12 +921,16 @@ impl ColumnWriterImpl { } } +#[allow(clippy::eq_op)] fn update_page_min_max( self, val: ::T) { -if self.min_page_value.as_ref().map_or(true, |min| min > val) { -self.min_page_value = Some(val.clone()); -} -if self.max_page_value.as_ref().map_or(true, |max| max < val) { -self.max_page_value = Some(val.clone()); +// simple "isNaN" check that works for all types +if val == val { +if self.min_page_value.as_ref().map_or(true, |min| min > val) { +self.min_page_value = Some(val.clone()); +} +if self.max_page_value.as_ref().map_or(true, |max| max < val) { +self.max_page_value = Some(val.clone()); +} } } @@ -1652,6 +1656,68 @@ mod tests { ); } +#[test] +fn test_float_statistics_nan_middle() { +let stats = statistics_roundtrip::(&[1.0, f32::NAN, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Float(stats) = stats { +assert_eq!(stats.min(), &1.0); +assert_eq!(stats.max(), &2.0); +} else { +panic!("expecting Statistics::Float"); +} +} + +#[test] +fn test_float_statistics_nan_start() { +let stats = statistics_roundtrip::(&[f32::NAN, 1.0, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Float(stats) = stats { +assert_eq!(stats.min(), &1.0); +assert_eq!(stats.max(), &2.0); +} else { +panic!("expecting Statistics::Float"); +} +} + +#[test] +fn test_float_statistics_nan_only() { +let stats = statistics_roundtrip::(&[f32::NAN, f32::NAN]); +assert!(!stats.has_min_max_set()); +assert!(matches!(stats, Statistics::Float(_))); +} + +#[test] +fn test_double_statistics_nan_middle() { +let stats = statistics_roundtrip::(&[1.0, f64::NAN, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Double(stats) = stats { +assert_eq!(stats.min(), &1.0); +assert_eq!(stats.max(), &2.0); +} else { +panic!("expecting Statistics::Float"); +} +} + +#[test] +fn test_double_statistics_nan_start() { +let stats = statistics_roundtrip::(&[f64::NAN, 1.0, 2.0]); +assert!(stats.has_min_max_set()); +if let Statistics::Double(stats) = stats { +assert_eq!(stats.min(), &1.0); +assert_eq!(stats.max(), &2.0); +} else { +panic!("expecting Statistics::Float"); +} +} + +#[test] +fn test_double_statistics_nan_only() { +let stats = statistics_roundtrip::(&[f64::NAN, f64::NAN]); +assert!(!stats.has_min_max_set()); +assert!(matches!(stats, Statistics::Double(_))); +} + /// Performs write-read roundtrip with randomly generated values and levels. /// `max_size` is maximum number of values or levels (if `max_def_level` > 0) to write /// for a column. @@ -1905,4 +1971,19 @@ mod tests { Ok(()) } } + +/// Write data into parquet using [`get_test_page_writer`] and [`get_test_column_writer`] and returns generated statistics. +fn statistics_roundtrip(values: &[::T]) -> Statistics { +let page_writer = get_test_page_writer(); +let props = Arc::new(WriterProperties::builder().build()); +let mut writer = get_test_column_writer::(page_writer, 0, 0, props); +writer.write_batch(values, None, None).unwrap(); + +let (_bytes_written, _rows_written, metadata) = writer.close().unwrap(); +if let Some(stats) = metadata.statistics() { +stats.clone() +} else { +panic!("metadata missing statistics"); +} +} }
[arrow-datafusion] branch master updated: Revert "Add datafusion-python (#69)" (#257)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git The following commit(s) were added to refs/heads/master by this push: new d0af907 Revert "Add datafusion-python (#69)" (#257) d0af907 is described below commit d0af907652aa8773d1de21dfd2f15bbcf6f50ce3 Author: Andy Grove AuthorDate: Tue May 4 08:51:44 2021 -0600 Revert "Add datafusion-python (#69)" (#257) This reverts commit 46bde0bd148aacf1677a575cb9ddbc154b6c4fb3. --- .github/workflows/python_build.yml | 89 --- .github/workflows/python_test.yaml | 58 Cargo.toml | 4 +- dev/release/rat_exclude_files.txt | 1 - python/.cargo/config | 22 --- python/.dockerignore | 19 --- python/.gitignore | 20 --- python/Cargo.toml | 57 --- python/README.md | 146 -- python/pyproject.toml | 20 --- python/rust-toolchain | 1 - python/src/context.rs | 115 --- python/src/dataframe.rs| 161 python/src/errors.rs | 61 python/src/expression.rs | 162 python/src/functions.rs| 165 - python/src/lib.rs | 44 -- python/src/scalar.rs | 36 - python/src/to_py.rs| 77 -- python/src/to_rust.rs | 111 -- python/src/types.rs| 76 -- python/src/udaf.rs | 147 --- python/src/udf.rs | 62 python/tests/__init__.py | 16 -- python/tests/generic.py| 75 -- python/tests/test_df.py| 115 --- python/tests/test_sql.py | 294 - python/tests/test_udaf.py | 91 28 files changed, 1 insertion(+), 2244 deletions(-) diff --git a/.github/workflows/python_build.yml b/.github/workflows/python_build.yml deleted file mode 100644 index c86bb81..000 --- a/.github/workflows/python_build.yml +++ /dev/null @@ -1,89 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -name: Build -on: - push: -tags: - - v* - -jobs: - build-python-mac-win: -name: Mac/Win -runs-on: ${{ matrix.os }} -strategy: - fail-fast: false - matrix: -python-version: [3.6, 3.7, 3.8] -os: [macos-latest, windows-latest] -steps: - - uses: actions/checkout@v2 - - - uses: actions/setup-python@v1 -with: - python-version: ${{ matrix.python-version }} - - - uses: actions-rs/toolchain@v1 -with: - toolchain: nightly-2021-01-06 - - - name: Install dependencies -run: | - python -m pip install --upgrade pip - pip install maturin - - - name: Build Python package -run: cd python && maturin build --release --no-sdist --strip --interpreter python${{matrix.python_version}} - - - name: List wheels -if: matrix.os == 'windows-latest' -run: dir python/target\wheels\ - - - name: List wheels -if: matrix.os != 'windows-latest' -run: find ./python/target/wheels/ - - - name: Archive wheels -uses: actions/upload-artifact@v2 -with: - name: dist - path: python/target/wheels/* - - build-manylinux: -name: Manylinux -runs-on: ubuntu-latest -steps: - - uses: actions/checkout@v2 - - name: Build wheels -run: docker run --rm -v $(pwd):/io konstin2/maturin build --release --manylinux - - name: Archive wheels -uses: actions/upload-artifact@v2 -with: - name: dist - path: python/target/wheels/* - - release: -name: Publish in PyPI -needs: [build-manylinux, build-python-mac-win] -runs-on: ubuntu-latest -steps: - - uses: actions/download-artifact@v2 - - name: Publish to PyPI -
[arrow-rs] branch master updated (8f030db -> 6a65543)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 8f030db Made integration tests always run. (#248) add 6a65543 fix parquet max_definition for non-null structs (#246) No new revisions were added by this update. Summary of changes: parquet/src/arrow/arrow_writer.rs | 60 -- parquet/src/arrow/levels.rs | 124 +++--- 2 files changed, 170 insertions(+), 14 deletions(-)
[arrow-rs] branch master updated (51513c1 -> 111d5d6)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow-rs.git. from 51513c1 ARROW-12411: [Rust] Create RecordBatches from Iterators (#7) add 111d5d6 Support string dictionaries in csv reader (#228) (#229) No new revisions were added by this update. Summary of changes: arrow/src/csv/reader.rs | 147 +++- 1 file changed, 121 insertions(+), 26 deletions(-)
[arrow] branch master updated (249fa7c -> 892776f)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 249fa7c ARROW-12123: [Rust][DataFusion] Use smallvec for indices for better join performance add 892776f ARROW-12153: [Rust] [Parquet] Return file stats after writing file No new revisions were added by this update. Summary of changes: rust/datafusion/src/execution/context.rs | 2 +- rust/parquet/src/arrow/arrow_reader.rs | 2 +- rust/parquet/src/arrow/arrow_writer.rs | 2 +- rust/parquet/src/file/writer.rs | 14 +++--- 4 files changed, 10 insertions(+), 10 deletions(-)
[arrow] branch master updated (8de898d -> cd4379d)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 8de898d ARROW-12138: [Go][IPC] Update flatbuffers definitions add cd4379d ARROW-12121: [Rust] [Parquet] Arrow writer benchmarks No new revisions were added by this update. Summary of changes: rust/parquet/Cargo.toml | 5 + rust/parquet/benches/arrow_writer.rs | 201 +++ 2 files changed, 206 insertions(+) create mode 100644 rust/parquet/benches/arrow_writer.rs
[arrow] branch master updated (9aa0f85 -> 4de0ed7)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 9aa0f85 ARROW-11973 [Rust][DataFusion] Boolean kleene kernels add 4de0ed7 ARROW-12120: [Rust] Generate random arrays and batches No new revisions were added by this update. Summary of changes: rust/arrow/benches/aggregate_kernels.rs | 4 +- rust/arrow/benches/comparison_kernels.rs | 2 +- rust/arrow/benches/concatenate_kernel.rs | 8 +- rust/arrow/benches/equal.rs | 4 +- rust/arrow/benches/filter_kernels.rs | 2 +- rust/arrow/benches/mutable_array.rs | 4 +- rust/arrow/benches/take_kernels.rs | 12 +- rust/arrow/src/util/bench_util.rs| 50 - rust/arrow/src/util/data_gen.rs | 347 +++ rust/arrow/src/util/mod.rs | 1 + 10 files changed, 415 insertions(+), 19 deletions(-) create mode 100644 rust/arrow/src/util/data_gen.rs
[arrow] branch master updated: ARROW-12043: [Rust] [Parquet] Write FSB arrays
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 894dd17 ARROW-12043: [Rust] [Parquet] Write FSB arrays 894dd17 is described below commit 894dd17c9602439c2b84c0b849fb0966606ceb1c Author: Neville Dipale AuthorDate: Sun Mar 28 11:01:56 2021 +0200 ARROW-12043: [Rust] [Parquet] Write FSB arrays Minor change to compute the levels for FSB arrays and write them out. Added a roundtrip test. Closes #9771 from nevi-me/ARROW-12043 Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 28 ++-- rust/parquet/src/arrow/levels.rs | 30 -- rust/parquet/src/arrow/mod.rs | 2 +- 3 files changed, 43 insertions(+), 17 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 1ce907f..a3577ca 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -146,7 +146,8 @@ fn write_leaves( | ArrowDataType::Binary | ArrowDataType::Utf8 | ArrowDataType::LargeUtf8 -| ArrowDataType::Decimal(_, _) => { +| ArrowDataType::Decimal(_, _) +| ArrowDataType::FixedSizeBinary(_) => { let mut col_writer = get_col_writer( row_group_writer)?; write_leaf( col_writer, @@ -189,11 +190,14 @@ fn write_leaves( ArrowDataType::Float16 => Err(ParquetError::ArrowError( "Float16 arrays not supported".to_string(), )), -ArrowDataType::FixedSizeList(_, _) -| ArrowDataType::FixedSizeBinary(_) -| ArrowDataType::Union(_) => Err(ParquetError::NYI( -"Attempting to write an Arrow type that is not yet implemented".to_string(), -)), +ArrowDataType::FixedSizeList(_, _) | ArrowDataType::Union(_) => { +Err(ParquetError::NYI( +format!( +"Attempting to write an Arrow type {:?} to parquet that is not yet implemented", +array.data_type() +) +)) +} } } @@ -1225,6 +1229,18 @@ mod tests { } #[test] +fn fixed_size_binary_single_column() { +let mut builder = FixedSizeBinaryBuilder::new(16, 4); +builder.append_value(b"0123").unwrap(); +builder.append_null().unwrap(); +builder.append_value(b"8910").unwrap(); +builder.append_value(b"1112").unwrap(); +let array = Arc::new(builder.finish()); + +one_column_roundtrip("timestamp_millisecond_single_column", array, true); +} + +#[test] fn string_single_column() { let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| i.to_string()).collect(); let raw_strs = raw_values.iter().map(|s| s.as_str()); diff --git a/rust/parquet/src/arrow/levels.rs b/rust/parquet/src/arrow/levels.rs index 641e330..2168670 100644 --- a/rust/parquet/src/arrow/levels.rs +++ b/rust/parquet/src/arrow/levels.rs @@ -136,7 +136,8 @@ impl LevelInfo { | DataType::Interval(_) | DataType::Binary | DataType::LargeBinary -| DataType::Decimal(_, _) => { +| DataType::Decimal(_, _) +| DataType::FixedSizeBinary(_) => { // we return a vector of 1 value to represent the primitive vec![self.calculate_child_levels( array_offsets, @@ -145,7 +146,6 @@ impl LevelInfo { field.is_nullable(), )] } -DataType::FixedSizeBinary(_) => unimplemented!(), DataType::List(list_field) | DataType::LargeList(list_field) => { // Calculate the list level let list_level = self.calculate_child_levels( @@ -189,7 +189,8 @@ impl LevelInfo { | DataType::Utf8 | DataType::LargeUtf8 | DataType::Dictionary(_, _) -| DataType::Decimal(_, _) => { +| DataType::Decimal(_, _) +| DataType::FixedSizeBinary(_) => { vec![list_level.calculate_child_levels( child_offsets, child_mask, @@ -197,7 +198,6 @@ impl LevelInfo { list_field.is_nullable(), )] } -DataType::FixedSizeBinary(_) => unimplemented!(), DataType::List(_) | DataType::LargeList(_) | DataType::Struct(_) => { list_level.calculate_ar
[arrow] branch master updated: ARROW-12116: [Rust] Fix and ignore 1.51 clippy lints
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git The following commit(s) were added to refs/heads/master by this push: new 60011c0 ARROW-12116: [Rust] Fix and ignore 1.51 clippy lints 60011c0 is described below commit 60011c081508b09724469d7a4d1d93b4bd015fe4 Author: Neville Dipale AuthorDate: Sun Mar 28 00:59:48 2021 +0200 ARROW-12116: [Rust] Fix and ignore 1.51 clippy lints There's an acronym Rust lint that started failing after 1.51 was announced. The lint is in the `arrow::ffi` and `arrow::ipc::gen` modules, so I'm instead ignoring it and documenting this. Closes #9815 from nevi-me/1-51-lints Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/arrow/src/lib.rs | 13 ++--- rust/arrow/src/util/pretty.rs | 3 +-- rust/datafusion/src/lib.rs | 4 +++- rust/datafusion/src/logical_plan/builder.rs | 4 ++-- rust/parquet/src/lib.rs | 5 - 5 files changed, 20 insertions(+), 9 deletions(-) diff --git a/rust/arrow/src/lib.rs b/rust/arrow/src/lib.rs index 68a820b..30f968c9 100644 --- a/rust/arrow/src/lib.rs +++ b/rust/arrow/src/lib.rs @@ -129,11 +129,18 @@ #![cfg_attr(feature = "avx512", feature(avx512_target_feature))] #![allow(dead_code)] #![allow(non_camel_case_types)] +#![deny(clippy::redundant_clone)] +#![allow( +// introduced to ignore lint errors when upgrading from 2020-04-22 to 2020-11-14 +clippy::float_equality_without_abs, +clippy::type_complexity, +// upper_case_acronyms lint was introduced in Rust 1.51. +// It is triggered in the ffi module, and ipc::gen, which we have no control over +clippy::upper_case_acronyms, +clippy::vec_init_then_push +)] #![allow(bare_trait_objects)] #![warn(missing_debug_implementations)] -#![deny(clippy::redundant_clone)] -// introduced to ignore lint errors when upgrading from 2020-04-22 to 2020-11-14 -#![allow(clippy::float_equality_without_abs, clippy::type_complexity)] pub mod alloc; mod arch; diff --git a/rust/arrow/src/util/pretty.rs b/rust/arrow/src/util/pretty.rs index 7baf559..f354899 100644 --- a/rust/arrow/src/util/pretty.rs +++ b/rust/arrow/src/util/pretty.rs @@ -93,8 +93,7 @@ fn create_column(field: , columns: &[ArrayRef]) -> Result { for col in columns { for row in 0..col.len() { -let mut cells = Vec::new(); -cells.push(Cell::new(_value_to_string(, row)?)); +let cells = vec![Cell::new(_value_to_string(, row)?)]; table.add_row(Row::new(cells)); } } diff --git a/rust/datafusion/src/lib.rs b/rust/datafusion/src/lib.rs index 3e1e1e2..2733430 100644 --- a/rust/datafusion/src/lib.rs +++ b/rust/datafusion/src/lib.rs @@ -18,9 +18,11 @@ // Clippy lints, some should be disabled incrementally #![allow( clippy::float_cmp, +clippy::from_over_into, clippy::module_inception, clippy::new_without_default, -clippy::type_complexity +clippy::type_complexity, +clippy::upper_case_acronyms )] //! [DataFusion](https://github.com/apache/arrow/tree/master/rust/datafusion) diff --git a/rust/datafusion/src/logical_plan/builder.rs b/rust/datafusion/src/logical_plan/builder.rs index aa0380e..e748872 100644 --- a/rust/datafusion/src/logical_plan/builder.rs +++ b/rust/datafusion/src/logical_plan/builder.rs @@ -303,8 +303,8 @@ impl LogicalPlanBuilder { Ok(Self::from(::Aggregate { input: Arc::new(self.plan.clone()), -group_expr: group_expr, -aggr_expr: aggr_expr, +group_expr, +aggr_expr, schema: DFSchemaRef::new(aggr_schema), })) } diff --git a/rust/parquet/src/lib.rs b/rust/parquet/src/lib.rs index 19e1a0f..a931b95 100644 --- a/rust/parquet/src/lib.rs +++ b/rust/parquet/src/lib.rs @@ -23,13 +23,16 @@ clippy::cast_ptr_alignment, clippy::float_cmp, clippy::float_equality_without_abs, +clippy::from_over_into, clippy::many_single_char_names, clippy::needless_range_loop, clippy::new_without_default, clippy::or_fun_call, clippy::same_item_push, clippy::too_many_arguments, -clippy::transmute_ptr_to_ptr +clippy::transmute_ptr_to_ptr, +clippy::upper_case_acronyms, +clippy::vec_init_then_push )] #[macro_use]
[arrow] branch master updated (143c2be -> 2c5e264)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 143c2be ARROW-11736: [R] Allow string compute functions to be optional add 2c5e264 ARROW-11365: [Rust] [Parquet] Logical type printer and parser No new revisions were added by this update. Summary of changes: rust/parquet/src/arrow/schema.rs | 51 +++- rust/parquet/src/basic.rs | 73 +- rust/parquet/src/schema/parser.rs | 484 - rust/parquet/src/schema/printer.rs | 423 +--- 4 files changed, 903 insertions(+), 128 deletions(-)
[arrow] branch master updated (0bea590 -> 4eefa35)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 0bea590 ARROW-11422: [C#] add decimal support add 4eefa35 ARROW-12019: [Rust] [Parquet] Update README for 2.6.0 support No new revisions were added by this update. Summary of changes: rust/parquet/README.md| 18 +++--- rust/parquet/src/basic.rs | 7 --- 2 files changed, 15 insertions(+), 10 deletions(-)
[arrow] branch master updated (41833d3 -> 21483ad)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 41833d3 ARROW-12071: [GLib] Keep input stream reference of GArrowJSONReader add 21483ad ARROW-12076: [Rust] Fix build No new revisions were added by this update. Summary of changes: rust/arrow/src/compute/kernels/comparison.rs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
[arrow] branch master updated (ae87509 -> eebf64b)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from ae87509 ARROW-12038: [Rust][DataFusion] Upgrade hashbrown to 0.11 add eebf64b ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays No new revisions were added by this update. Summary of changes: rust/arrow/examples/dynamic_types.rs | 2 +- rust/arrow/src/array/array.rs | 82 +++ rust/arrow/src/array/array_binary.rs | 45 ++-- rust/arrow/src/array/array_boolean.rs | 16 +- rust/arrow/src/array/array_dictionary.rs | 28 +-- rust/arrow/src/array/array_list.rs | 48 ++-- rust/arrow/src/array/array_primitive.rs| 34 +-- rust/arrow/src/array/array_string.rs | 24 +- rust/arrow/src/array/array_struct.rs | 34 ++- rust/arrow/src/array/array_union.rs| 28 +-- rust/arrow/src/array/builder.rs| 10 +- rust/arrow/src/array/data.rs | 36 ++- rust/arrow/src/array/equal/dictionary.rs | 4 +- rust/arrow/src/array/equal/fixed_list.rs | 4 +- rust/arrow/src/array/equal/list.rs | 4 +- rust/arrow/src/array/equal/mod.rs | 255 +++-- rust/arrow/src/array/ffi.rs| 15 +- rust/arrow/src/array/null.rs | 19 +- rust/arrow/src/array/ord.rs| 4 +- rust/arrow/src/array/transform/mod.rs | 116 +- rust/arrow/src/compute/kernels/arithmetic.rs | 21 +- rust/arrow/src/compute/kernels/arity.rs| 2 +- rust/arrow/src/compute/kernels/boolean.rs | 13 +- rust/arrow/src/compute/kernels/cast.rs | 47 ++-- rust/arrow/src/compute/kernels/comparison.rs | 24 +- rust/arrow/src/compute/kernels/concat.rs | 9 +- rust/arrow/src/compute/kernels/filter.rs | 6 +- rust/arrow/src/compute/kernels/length.rs | 3 +- rust/arrow/src/compute/kernels/limit.rs| 4 +- rust/arrow/src/compute/kernels/sort.rs | 13 +- rust/arrow/src/compute/kernels/substring.rs| 3 +- rust/arrow/src/compute/kernels/take.rs | 57 +++-- rust/arrow/src/compute/kernels/window.rs | 3 +- rust/arrow/src/compute/kernels/zip.rs | 3 +- rust/arrow/src/compute/util.rs | 12 +- rust/arrow/src/ffi.rs | 21 +- rust/arrow/src/ipc/reader.rs | 12 +- rust/arrow/src/ipc/writer.rs | 6 +- rust/arrow/src/json/reader.rs | 26 ++- rust/arrow/src/json/writer.rs | 6 +- rust/arrow/src/record_batch.rs | 6 +- rust/arrow/src/util/integration_util.rs| 10 +- .../src/physical_plan/math_expressions.rs | 4 +- rust/integration-testing/src/lib.rs| 10 +- rust/parquet/src/arrow/array_reader.rs | 16 +- rust/parquet/src/arrow/arrow_writer.rs | 81 +-- rust/parquet/src/arrow/levels.rs | 4 +- 47 files changed, 614 insertions(+), 616 deletions(-)
[arrow] branch master updated (6112255 -> ae87509)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 6112255 ARROW-10250: [C++][FlightRPC] Consistently use FlightClientOptions::Defaults add ae87509 ARROW-12038: [Rust][DataFusion] Upgrade hashbrown to 0.11 No new revisions were added by this update. Summary of changes: rust/datafusion/Cargo.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[arrow] branch master updated (775a714 -> ef64d00)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 775a714 ARROW-10903 [Rust] Implement FromIter>> constructor for FixedSizeBinaryArray add ef64d00 ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion No new revisions were added by this update. Summary of changes: rust/arrow/src/array/array_binary.rs | 8 +- rust/parquet/src/arrow/schema.rs | 254 ++ rust/parquet/src/schema/parser.rs| 10 +- rust/parquet/src/schema/types.rs | 293 ++- 4 files changed, 419 insertions(+), 146 deletions(-)
[arrow] branch master updated (976ddbf -> 69d436d)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 976ddbf ARROW-11896: [Rust] Disable Debug symbols on CI test builds add 69d436d ARROW-11803: [Rust] [Parquet] Support v2 LogicalType No new revisions were added by this update. Summary of changes: rust/parquet/src/arrow/array_reader.rs | 13 +- rust/parquet/src/arrow/schema.rs | 96 +-- rust/parquet/src/basic.rs | 1098 +++- rust/parquet/src/column/reader.rs |4 +- rust/parquet/src/file/footer.rs|1 + rust/parquet/src/file/writer.rs| 57 +- rust/parquet/src/record/api.rs | 116 ++-- rust/parquet/src/record/reader.rs | 10 +- rust/parquet/src/schema/mod.rs |4 +- rust/parquet/src/schema/parser.rs | 39 +- rust/parquet/src/schema/printer.rs | 42 +- rust/parquet/src/schema/types.rs | 181 -- rust/parquet/src/schema/visitor.rs |8 +- 13 files changed, 1143 insertions(+), 526 deletions(-)
[arrow] branch master updated (b07027e -> bfa99d9)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from b07027e ARROW-11735: [R] Allow Parquet and Arrow Dataset to be optional components add bfa99d9 ARROW-11881: [Rust][DataFusion] Fix clippy lint No new revisions were added by this update. Summary of changes: rust/datafusion/src/physical_plan/merge.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)