[arrow] branch master updated (18495e0 -> 7189b91)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs add 7189b91 ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… No new revisions were added by this update. Summary of changes: .../java/org/apache/arrow/memory/Accountant.java | 3 +- .../org/apache/arrow/memory/AllocationManager.java | 39 -- .../org/apache/arrow/memory/BaseAllocator.java | 16 ++--- .../org/apache/arrow/memory/BufferAllocator.java | 32 ++ .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++-- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/NettyAllocationManager.java | 6 ++-- .../org/apache/arrow/memory/TestBaseAllocator.java | 2 +- .../arrow/memory/TestNettyAllocationManager.java | 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/UnsafeAllocationManager.java | 4 +-- 12 files changed, 87 insertions(+), 45 deletions(-)
[arrow] branch master updated (18495e0 -> 7189b91)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs add 7189b91 ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… No new revisions were added by this update. Summary of changes: .../java/org/apache/arrow/memory/Accountant.java | 3 +- .../org/apache/arrow/memory/AllocationManager.java | 39 -- .../org/apache/arrow/memory/BaseAllocator.java | 16 ++--- .../org/apache/arrow/memory/BufferAllocator.java | 32 ++ .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++-- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/NettyAllocationManager.java | 6 ++-- .../org/apache/arrow/memory/TestBaseAllocator.java | 2 +- .../arrow/memory/TestNettyAllocationManager.java | 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/UnsafeAllocationManager.java | 4 +-- 12 files changed, 87 insertions(+), 45 deletions(-)
[arrow] branch master updated (1d10f22 -> 18495e0)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion add 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs No new revisions were added by this update. Summary of changes: .../src/main/codegen/data/ValueVectorTypes.tdd | 2 +- .../src/main/codegen/templates/ComplexWriters.java | 4 ++-- .../codegen/templates/UnionFixedSizeListWriter.java | 2 +- .../src/main/codegen/templates/UnionListWriter.java | 4 ++-- .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++-- .../arrow/vector/complex/impl/PromotableWriter.java | 2 +- .../apache/arrow/vector/util/DecimalUtility.java| 2 +- .../org/apache/arrow/vector/ITTestLargeVector.java | 21 - 8 files changed, 34 insertions(+), 15 deletions(-)
[arrow] branch master updated (18495e0 -> 7189b91)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs add 7189b91 ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… No new revisions were added by this update. Summary of changes: .../java/org/apache/arrow/memory/Accountant.java | 3 +- .../org/apache/arrow/memory/AllocationManager.java | 39 -- .../org/apache/arrow/memory/BaseAllocator.java | 16 ++--- .../org/apache/arrow/memory/BufferAllocator.java | 32 ++ .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++-- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/NettyAllocationManager.java | 6 ++-- .../org/apache/arrow/memory/TestBaseAllocator.java | 2 +- .../arrow/memory/TestNettyAllocationManager.java | 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/UnsafeAllocationManager.java | 4 +-- 12 files changed, 87 insertions(+), 45 deletions(-)
[arrow] branch master updated (1d10f22 -> 18495e0)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion add 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs No new revisions were added by this update. Summary of changes: .../src/main/codegen/data/ValueVectorTypes.tdd | 2 +- .../src/main/codegen/templates/ComplexWriters.java | 4 ++-- .../codegen/templates/UnionFixedSizeListWriter.java | 2 +- .../src/main/codegen/templates/UnionListWriter.java | 4 ++-- .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++-- .../arrow/vector/complex/impl/PromotableWriter.java | 2 +- .../apache/arrow/vector/util/DecimalUtility.java| 2 +- .../org/apache/arrow/vector/ITTestLargeVector.java | 21 - 8 files changed, 34 insertions(+), 15 deletions(-)
[arrow] branch master updated (1d10f22 -> 18495e0)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion add 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs No new revisions were added by this update. Summary of changes: .../src/main/codegen/data/ValueVectorTypes.tdd | 2 +- .../src/main/codegen/templates/ComplexWriters.java | 4 ++-- .../codegen/templates/UnionFixedSizeListWriter.java | 2 +- .../src/main/codegen/templates/UnionListWriter.java | 4 ++-- .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++-- .../arrow/vector/complex/impl/PromotableWriter.java | 2 +- .../apache/arrow/vector/util/DecimalUtility.java| 2 +- .../org/apache/arrow/vector/ITTestLargeVector.java | 21 - 8 files changed, 34 insertions(+), 15 deletions(-)
[arrow] branch master updated (18495e0 -> 7189b91)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs add 7189b91 ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… No new revisions were added by this update. Summary of changes: .../java/org/apache/arrow/memory/Accountant.java | 3 +- .../org/apache/arrow/memory/AllocationManager.java | 39 -- .../org/apache/arrow/memory/BaseAllocator.java | 16 ++--- .../org/apache/arrow/memory/BufferAllocator.java | 32 ++ .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++-- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/NettyAllocationManager.java | 6 ++-- .../org/apache/arrow/memory/TestBaseAllocator.java | 2 +- .../arrow/memory/TestNettyAllocationManager.java | 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/UnsafeAllocationManager.java | 4 +-- 12 files changed, 87 insertions(+), 45 deletions(-)
[arrow] branch master updated (18495e0 -> 7189b91)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs add 7189b91 ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… No new revisions were added by this update. Summary of changes: .../java/org/apache/arrow/memory/Accountant.java | 3 +- .../org/apache/arrow/memory/AllocationManager.java | 39 -- .../org/apache/arrow/memory/BaseAllocator.java | 16 ++--- .../org/apache/arrow/memory/BufferAllocator.java | 32 ++ .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++-- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/NettyAllocationManager.java | 6 ++-- .../org/apache/arrow/memory/TestBaseAllocator.java | 2 +- .../arrow/memory/TestNettyAllocationManager.java | 2 +- .../memory/DefaultAllocationManagerFactory.java| 2 +- .../arrow/memory/UnsafeAllocationManager.java | 4 +-- 12 files changed, 87 insertions(+), 45 deletions(-)
[arrow] branch master updated (1d10f22 -> 18495e0)
This is an automated email from the ASF dual-hosted git repository. emkornfield pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion add 18495e0 ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs No new revisions were added by this update. Summary of changes: .../src/main/codegen/data/ValueVectorTypes.tdd | 2 +- .../src/main/codegen/templates/ComplexWriters.java | 4 ++-- .../codegen/templates/UnionFixedSizeListWriter.java | 2 +- .../src/main/codegen/templates/UnionListWriter.java | 4 ++-- .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++-- .../arrow/vector/complex/impl/PromotableWriter.java | 2 +- .../apache/arrow/vector/util/DecimalUtility.java| 2 +- .../org/apache/arrow/vector/ITTestLargeVector.java | 21 - 8 files changed, 34 insertions(+), 15 deletions(-)
[arrow] branch master updated (35ace39 -> 1d10f22)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 35ace39 ARROW-10174: [Java] Fix reading/writing dict structs add 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion No new revisions were added by this update. Summary of changes: rust/arrow/src/compute/kernels/cast.rs | 478 ++- rust/arrow/src/datatypes.rs | 10 + rust/datafusion/src/logical_plan/mod.rs | 13 +- rust/datafusion/src/physical_plan/expressions.rs | 26 +- 4 files changed, 505 insertions(+), 22 deletions(-)
[arrow] branch master updated (35ace39 -> 1d10f22)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 35ace39 ARROW-10174: [Java] Fix reading/writing dict structs add 1d10f22 ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion No new revisions were added by this update. Summary of changes: rust/arrow/src/compute/kernels/cast.rs | 478 ++- rust/arrow/src/datatypes.rs | 10 + rust/datafusion/src/logical_plan/mod.rs | 13 +- rust/datafusion/src/physical_plan/expressions.rs | 26 +- 4 files changed, 505 insertions(+), 22 deletions(-)
[arrow] 04/07: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 58855a888438fe2063db8499659d83fe1bb91b66 Author: Carol (Nichols || Goulding) AuthorDate: Sat Oct 3 02:34:38 2020 +0200 ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) Signed-off-by: Neville Dipale --- rust/arrow/src/datatypes.rs | 4 +- rust/arrow/src/ipc/convert.rs| 105 ++- rust/parquet/src/arrow/schema.rs | 20 3 files changed, 93 insertions(+), 36 deletions(-) diff --git a/rust/arrow/src/datatypes.rs b/rust/arrow/src/datatypes.rs index 0d05f82..c647af6 100644 --- a/rust/arrow/src/datatypes.rs +++ b/rust/arrow/src/datatypes.rs @@ -189,8 +189,8 @@ pub struct Field { name: String, data_type: DataType, nullable: bool, -dict_id: i64, -dict_is_ordered: bool, +pub(crate) dict_id: i64, +pub(crate) dict_is_ordered: bool, } pub trait ArrowNativeType: diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs index 7a5795d..8f429bf 100644 --- a/rust/arrow/src/ipc/convert.rs +++ b/rust/arrow/src/ipc/convert.rs @@ -34,18 +34,8 @@ pub fn schema_to_fb(schema: ) -> FlatBufferBuilder { let mut fields = vec![]; for field in schema.fields() { -let fb_field_name = fbb.create_string(field.name().as_str()); -let field_type = get_fb_field_type(field.data_type(), fbb); -let mut field_builder = ipc::FieldBuilder::new( fbb); -field_builder.add_name(fb_field_name); -field_builder.add_type_type(field_type.type_type); -field_builder.add_nullable(field.is_nullable()); -match field_type.children { -None => {} -Some(children) => field_builder.add_children(children), -}; -field_builder.add_type_(field_type.type_); -fields.push(field_builder.finish()); +let fb_field = build_field( fbb, field); +fields.push(fb_field); } let mut custom_metadata = vec![]; @@ -80,18 +70,8 @@ pub fn schema_to_fb_offset<'a: 'b, 'b>( ) -> WIPOffset> { let mut fields = vec![]; for field in schema.fields() { -let fb_field_name = fbb.create_string(field.name().as_str()); -let field_type = get_fb_field_type(field.data_type(), fbb); -let mut field_builder = ipc::FieldBuilder::new(fbb); -field_builder.add_name(fb_field_name); -field_builder.add_type_type(field_type.type_type); -field_builder.add_nullable(field.is_nullable()); -match field_type.children { -None => {} -Some(children) => field_builder.add_children(children), -}; -field_builder.add_type_(field_type.type_); -fields.push(field_builder.finish()); +let fb_field = build_field(fbb, field); +fields.push(fb_field); } let mut custom_metadata = vec![]; @@ -333,6 +313,38 @@ pub(crate) struct FBFieldType<'b> { pub(crate) children: Option, } +/// Create an IPC Field from an Arrow Field +pub(crate) fn build_field<'a: 'b, 'b>( +fbb: FlatBufferBuilder<'a>, +field: , +) -> WIPOffset> { +let fb_field_name = fbb.create_string(field.name().as_str()); +let field_type = get_fb_field_type(field.data_type(), fbb); + +let fb_dictionary = if let Dictionary(index_type, _) = field.data_type() { +Some(get_fb_dictionary( +index_type, +field.dict_id, +
[arrow] 07/07: ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit f70e6db575cce746d2a4cd1c9e5a99629c27926c Author: Neville Dipale AuthorDate: Thu Oct 8 17:08:59 2020 +0200 ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip Closes #8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 40e2553..a17e424 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -724,7 +724,11 @@ mod tests { assert_eq!(expected_data.offset(), actual_data.offset()); assert_eq!(expected_data.buffers(), actual_data.buffers()); assert_eq!(expected_data.child_data(), actual_data.child_data()); -assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +// Null counts should be the same, not necessarily bitmaps +// A null bitmap is optional if an array has no nulls +if expected_data.null_count() != 0 { +assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +} } } @@ -1001,7 +1005,7 @@ mod tests { } #[test] -#[ignore] // Binary support isn't correct yet - null_bitmap doesn't match +#[ignore] // Binary support isn't correct yet - buffers don't match fn binary_single_column() { let one_vec: Vec = (0..SMALL_SIZE as u8).collect(); let many_vecs: Vec<_> = std::iter::repeat(one_vec).take(SMALL_SIZE).collect(); @@ -1026,7 +1030,6 @@ mod tests { } #[test] -#[ignore] // String support isn't correct yet - null_bitmap doesn't match fn string_single_column() { let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| i.to_string()).collect(); let raw_strs = raw_values.iter().map(|s| s.as_str()); @@ -1035,7 +1038,6 @@ mod tests { } #[test] -#[ignore] // Large string support isn't correct yet - null_bitmap doesn't match fn large_string_single_column() { let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| i.to_string()).collect(); let raw_strs = raw_values.iter().map(|s| s.as_str());
[arrow] 02/07: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 12b11b29293b6ace9cb99cffa93fd1b74b4849be Author: Neville Dipale AuthorDate: Tue Aug 18 18:39:37 2020 +0200 ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes #7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/parquet/Cargo.toml| 3 +- rust/parquet/src/arrow/arrow_writer.rs | 27 ++- rust/parquet/src/arrow/mod.rs | 4 + rust/parquet/src/arrow/schema.rs | 306 - rust/parquet/src/file/properties.rs| 6 +- 5 files changed, 290 insertions(+), 56 deletions(-) diff --git a/rust/parquet/Cargo.toml b/rust/parquet/Cargo.toml index 50d7c34..60e43c9 100644 --- a/rust/parquet/Cargo.toml +++ b/rust/parquet/Cargo.toml @@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true } chrono = "0.4" num-bigint = "0.3" arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true } +base64 = { version = "*", optional = true } [dev-dependencies] rand = "0.7" @@ -52,4 +53,4 @@ arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" } serde_json = { version = "1.0", features = ["preserve_order"] } [features] -default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"] +default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"] diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 0c1c490..1ca8d50 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -24,6 +24,7 @@ use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; use arrow::record_batch::RecordBatch; use arrow_array::Array; +use super::schema::add_encoded_arrow_schema_to_metadata; use crate::column::writer::ColumnWriter; use crate::errors::{ParquetError, Result}; use crate::file::properties::WriterProperties; @@ -53,17 +54,17 @@ impl ArrowWriter { pub fn try_new( writer: W, arrow_schema: SchemaRef, -props: Option>, +props: Option, ) -> Result { let schema = crate::arrow::arrow_to_parquet_schema(_schema)?; -let props = match props { -Some(props) => props, -None => Rc::new(WriterProperties::builder().build()), -}; +// add serialized arrow schema +let mut props = props.unwrap_or_else(|| WriterProperties::builder().build()); +add_encoded_arrow_schema_to_metadata(_schema, props); + let file_writer = SerializedFileWriter::new( writer.try_clone()?, schema.root_schema_ptr(), -props, +Rc::new(props), )?; Ok(Self { @@ -495,7 +496,7 @@ mod tests { use arrow::record_batch::{RecordBatch, RecordBatchReader}; use crate::arrow::{ArrowReader, ParquetFileArrowReader}; -use crate::file::reader::SerializedFileReader; +use crate::file::{metadata::KeyValue, reader::SerializedFileReader}; use crate::util::test_common::get_temp_file; #[test] @@ -584,7 +585,7 @@ mod tests { ) .unwrap(); -let mut file = get_temp_file("test_arrow_writer.parquet", &[]); +let mut file = get_temp_file("test_arrow_writer_binary.parquet", &[]); let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), Arc::new(schema), None) .unwrap(); @@ -674,8 +675,16 @@ mod tests { ) .unwrap(); +let props = WriterProperties::builder() +.set_key_value_metadata(Some(vec![KeyValue { +key: "test_key".to_string(), +value: Some("test_value".to_string()), +}])) +.build(); + let file = get_temp_file("test_arrow_writer_complex.parquet", &[]); -let mut writer = ArrowWriter::try_new(file, Arc::new(schema), None).unwrap(); +let mut writer = +ArrowWriter::try_new(file, Arc::new(schema), Some(props)).unwrap(); writer.write().unwrap(); writer.close().unwrap(); } diff --git a/rust/parquet/src/arrow/mod.rs b/rust/parquet/src/arrow/mod.rs index 8499481..2b012fb 100644 --- a/rust/parquet/src/arrow/mod.rs +++ b/rust/parquet/src/arrow/mod.rs @@ -58,6 +58,10 @@ pub mod schema; pub use self::arrow_reader::ArrowReader; pub use self::arrow_reader::ParquetFileArrowReader; +pub use self::arrow_writer::ArrowWriter; pub use self::schema::{ arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns, }; + +/// Schema metadata key used to store
[arrow] 05/07: ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit de95e847bc82cbd28b3963edc843baaa10bb99ab Author: Carol (Nichols || Goulding) AuthorDate: Tue Oct 6 08:44:26 2020 -0600 ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) Co-authored-by: Neville Dipale Signed-off-by: Andy Grove --- rust/arrow/src/compute/kernels/cast.rs | 20 +- rust/parquet/src/arrow/array_reader.rs | 102 +--- rust/parquet/src/arrow/arrow_writer.rs | 413 - rust/parquet/src/arrow/converter.rs| 25 +- 4 files changed, 523 insertions(+), 37 deletions(-) diff --git a/rust/arrow/src/compute/kernels/cast.rs b/rust/arrow/src/compute/kernels/cast.rs index 08c6a2b..30180ca 100644 --- a/rust/arrow/src/compute/kernels/cast.rs +++ b/rust/arrow/src/compute/kernels/cast.rs @@ -356,11 +356,27 @@ pub fn cast(array: , to_type: ) -> Result { // temporal casts (Int32, Date32(_)) => cast_array_data::(array, to_type.clone()), -(Int32, Time32(_)) => cast_array_data::(array, to_type.clone()), +(Int32, Time32(unit)) => match unit { +TimeUnit::Second => { +cast_array_data::(array, to_type.clone()) +} +TimeUnit::Millisecond => { +cast_array_data::(array, to_type.clone()) +} +_ => unreachable!(), +}, (Date32(_), Int32) => cast_array_data::(array, to_type.clone()), (Time32(_), Int32) => cast_array_data::(array, to_type.clone()), (Int64, Date64(_)) => cast_array_data::(array, to_type.clone()), -(Int64, Time64(_)) => cast_array_data::(array, to_type.clone()), +(Int64, Time64(unit)) => match unit { +TimeUnit::Microsecond => { +cast_array_data::(array, to_type.clone()) +} +TimeUnit::Nanosecond => { +cast_array_data::(array, to_type.clone()) +} +_ => unreachable!(), +}, (Date64(_), Int64) => cast_array_data::(array, to_type.clone()), (Time64(_), Int64) => cast_array_data::(array, to_type.clone()), (Date32(DateUnit::Day), Date64(DateUnit::Millisecond)) => { diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs index 14bf7d2..4fbc54d 100644 --- a/rust/parquet/src/arrow/array_reader.rs +++ b/rust/parquet/src/arrow/array_reader.rs @@ -35,9 +35,10 @@ use crate::arrow::converter::{ BinaryArrayConverter, BinaryConverter, BoolConverter, BooleanArrayConverter, Converter, Date32Converter, FixedLenBinaryConverter, FixedSizeArrayConverter, Float32Converter, Float64Converter, Int16Converter, Int32Converter, Int64Converter, -Int8Converter, Int96ArrayConverter, Int96Converter, TimestampMicrosecondConverter, -TimestampMillisecondConverter, UInt16Converter, UInt32Converter, UInt64Converter, -UInt8Converter, Utf8ArrayConverter, Utf8Converter, +Int8Converter, Int96ArrayConverter, Int96Converter, Time32MillisecondConverter, +Time32SecondConverter, Time64MicrosecondConverter, Time64NanosecondConverter, +TimestampMicrosecondConverter, TimestampMillisecondConverter, UInt16Converter, +UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, Utf8Converter, }; use crate::arrow::record_reader::RecordReader; use crate::arrow::schema::parquet_to_arrow_field; @@ -196,11 +197,27 @@ impl ArrayReader for PrimitiveArrayReader { .convert(self.record_reader.cast::()), _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)), } -(ArrowType::Time32(_), PhysicalType::INT32) => { -
[arrow] 06/07: ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit e1b613b1ec9239bf58d5882081aeeb75fa06c3d3 Author: Carol (Nichols || Goulding) AuthorDate: Thu Oct 8 00:16:42 2020 +0200 ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available @nevi-me This is one commit on top of https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) Co-authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/arrow/src/ipc/convert.rs | 4 +- rust/parquet/src/arrow/array_reader.rs | 106 + rust/parquet/src/arrow/arrow_reader.rs | 36 -- rust/parquet/src/arrow/arrow_writer.rs | 4 +- rust/parquet/src/arrow/converter.rs | 52 +++- rust/parquet/src/arrow/mod.rs | 3 +- rust/parquet/src/arrow/record_reader.rs | 1 + rust/parquet/src/arrow/schema.rs| 205 +++- 8 files changed, 338 insertions(+), 73 deletions(-) diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs index 8f429bf..a02b6c4 100644 --- a/rust/arrow/src/ipc/convert.rs +++ b/rust/arrow/src/ipc/convert.rs @@ -334,7 +334,9 @@ pub(crate) fn build_field<'a: 'b, 'b>( let mut field_builder = ipc::FieldBuilder::new(fbb); field_builder.add_name(fb_field_name); -fb_dictionary.map(|dictionary| field_builder.add_dictionary(dictionary)); +if let Some(dictionary) = fb_dictionary { +field_builder.add_dictionary(dictionary) +} field_builder.add_type_type(field_type.type_type); field_builder.add_nullable(field.is_nullable()); match field_type.children { diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs index 4fbc54d..40df284 100644 --- a/rust/parquet/src/arrow/array_reader.rs +++ b/rust/parquet/src/arrow/array_reader.rs @@ -29,16 +29,20 @@ use arrow::array::{ Int16BufferBuilder, StructArray, }; use arrow::buffer::{Buffer, MutableBuffer}; -use arrow::datatypes::{DataType as ArrowType, DateUnit, Field, IntervalUnit, TimeUnit}; +use arrow::datatypes::{ +DataType as ArrowType, DateUnit, Field, IntervalUnit, Schema, TimeUnit, +}; use crate::arrow::converter::{ BinaryArrayConverter, BinaryConverter, BoolConverter, BooleanArrayConverter, Converter, Date32Converter, FixedLenBinaryConverter, FixedSizeArrayConverter, Float32Converter, Float64Converter, Int16Converter, Int32Converter, Int64Converter, -Int8Converter, Int96ArrayConverter, Int96Converter, Time32MillisecondConverter, -Time32SecondConverter, Time64MicrosecondConverter, Time64NanosecondConverter, -TimestampMicrosecondConverter, TimestampMillisecondConverter, UInt16Converter, -UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, Utf8Converter, +Int8Converter, Int96ArrayConverter, Int96Converter, LargeBinaryArrayConverter, +LargeBinaryConverter, LargeUtf8ArrayConverter, LargeUtf8Converter, +Time32MillisecondConverter, Time32SecondConverter, Time64MicrosecondConverter, +Time64NanosecondConverter, TimestampMicrosecondConverter, +TimestampMillisecondConverter, UInt16Converter, UInt32Converter, UInt64Converter, +UInt8Converter, Utf8ArrayConverter, Utf8Converter, }; use crate::arrow::record_reader::RecordReader; use crate::arrow::schema::parquet_to_arrow_field; @@ -612,6 +616,7 @@ impl ArrayReader for StructArrayReader { /// Create array reader from parquet schema, column indices, and parquet file reader. pub fn build_array_reader( parquet_schema: SchemaDescPtr, +arrow_schema: Schema, column_indices: T, file_reader: Rc, ) -> Result> @@ -650,13 +655,19 @@ where fields: filtered_root_fields, }; -ArrayReaderBuilder::new(Rc::new(proj), Rc::new(leaves), file_reader) -.build_array_reader() +ArrayReaderBuilder::new( +Rc::new(proj), +Rc::new(arrow_schema), +Rc::new(leaves), +file_reader, +) +.build_array_reader() }
[arrow] 03/07: ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit ebe81729da4e983ec0f580ea0582059c0237d41c Author: Carol (Nichols || Goulding) AuthorDate: Fri Sep 25 17:54:11 2020 +0200 ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes #8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 2 +- rust/parquet/src/arrow/schema.rs | 8 +--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 1ca8d50..e0ad207 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -22,7 +22,7 @@ use std::rc::Rc; use arrow::array as arrow_array; use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; use arrow::record_batch::RecordBatch; -use arrow_array::Array; +use arrow_array::{Array, PrimitiveArrayOps}; use super::schema::add_encoded_arrow_schema_to_metadata; use crate::column::writer::ColumnWriter; diff --git a/rust/parquet/src/arrow/schema.rs b/rust/parquet/src/arrow/schema.rs index d4cfe1f..d5a0ff9 100644 --- a/rust/parquet/src/arrow/schema.rs +++ b/rust/parquet/src/arrow/schema.rs @@ -27,6 +27,7 @@ use std::collections::{HashMap, HashSet}; use std::rc::Rc; use arrow::datatypes::{DataType, DateUnit, Field, Schema, TimeUnit}; +use arrow::ipc::writer; use crate::basic::{LogicalType, Repetition, Type as PhysicalType}; use crate::errors::{ParquetError::ArrowError, Result}; @@ -120,15 +121,16 @@ fn get_arrow_schema_from_metadata(encoded_meta: ) -> Option { /// Encodes the Arrow schema into the IPC format, and base64 encodes it fn encode_arrow_schema(schema: ) -> String { -let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(); +let options = writer::IpcWriteOptions::default(); +let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(, ); // manually prepending the length to the schema as arrow uses the legacy IPC format // TODO: change after addressing ARROW-9777 -let schema_len = serialized_schema.len(); +let schema_len = serialized_schema.ipc_message.len(); let mut len_prefix_schema = Vec::with_capacity(schema_len + 8); len_prefix_schema.append( vec![255u8, 255, 255, 255]); len_prefix_schema.append((schema_len as u32).to_le_bytes().to_vec().as_mut()); -len_prefix_schema.append( serialized_schema); +len_prefix_schema.append( serialized_schema.ipc_message); base64::encode(_prefix_schema) }
[arrow] 01/07: ARROW-8289: [Rust] Parquet Arrow writer with nested support
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 4e6a836b42b064a50582bcc9d6cfca2b7e77a46a Author: Neville Dipale AuthorDate: Thu Aug 13 18:47:34 2020 +0200 ARROW-8289: [Rust] Parquet Arrow writer with nested support **Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct> not null child 0, d: double child 1, e: struct child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale Co-authored-by: Max Burke Co-authored-by: Andy Grove Co-authored-by: Max Burke Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 682 + rust/parquet/src/arrow/mod.rs | 5 +- rust/parquet/src/schema/types.rs | 6 +- 3 files changed, 691 insertions(+), 2 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs new file mode 100644 index 000..0c1c490 --- /dev/null +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -0,0 +1,682 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Contains writer which writes arrow data into parquet data. + +use std::rc::Rc; + +use arrow::array as arrow_array; +use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; +use arrow::record_batch::RecordBatch; +use arrow_array::Array; + +use crate::column::writer::ColumnWriter; +use crate::errors::{ParquetError, Result}; +use crate::file::properties::WriterProperties; +use crate::{ +data_type::*, +file::writer::{FileWriter, ParquetWriter, RowGroupWriter, SerializedFileWriter}, +}; + +/// Arrow writer +/// +/// Writes Arrow `RecordBatch`es to a Parquet writer +pub struct ArrowWriter { +/// Underlying Parquet writer +writer: SerializedFileWriter, +/// A copy of the Arrow schema. +/// +/// The schema is used to verify that each record batch written has the correct schema +arrow_schema: SchemaRef, +} + +impl ArrowWriter { +/// Try to create a new Arrow writer +/// +/// The writer will
[arrow] 03/07: ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit ebe81729da4e983ec0f580ea0582059c0237d41c Author: Carol (Nichols || Goulding) AuthorDate: Fri Sep 25 17:54:11 2020 +0200 ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes #8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 2 +- rust/parquet/src/arrow/schema.rs | 8 +--- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 1ca8d50..e0ad207 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -22,7 +22,7 @@ use std::rc::Rc; use arrow::array as arrow_array; use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; use arrow::record_batch::RecordBatch; -use arrow_array::Array; +use arrow_array::{Array, PrimitiveArrayOps}; use super::schema::add_encoded_arrow_schema_to_metadata; use crate::column::writer::ColumnWriter; diff --git a/rust/parquet/src/arrow/schema.rs b/rust/parquet/src/arrow/schema.rs index d4cfe1f..d5a0ff9 100644 --- a/rust/parquet/src/arrow/schema.rs +++ b/rust/parquet/src/arrow/schema.rs @@ -27,6 +27,7 @@ use std::collections::{HashMap, HashSet}; use std::rc::Rc; use arrow::datatypes::{DataType, DateUnit, Field, Schema, TimeUnit}; +use arrow::ipc::writer; use crate::basic::{LogicalType, Repetition, Type as PhysicalType}; use crate::errors::{ParquetError::ArrowError, Result}; @@ -120,15 +121,16 @@ fn get_arrow_schema_from_metadata(encoded_meta: ) -> Option { /// Encodes the Arrow schema into the IPC format, and base64 encodes it fn encode_arrow_schema(schema: ) -> String { -let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(); +let options = writer::IpcWriteOptions::default(); +let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(, ); // manually prepending the length to the schema as arrow uses the legacy IPC format // TODO: change after addressing ARROW-9777 -let schema_len = serialized_schema.len(); +let schema_len = serialized_schema.ipc_message.len(); let mut len_prefix_schema = Vec::with_capacity(schema_len + 8); len_prefix_schema.append( vec![255u8, 255, 255, 255]); len_prefix_schema.append((schema_len as u32).to_le_bytes().to_vec().as_mut()); -len_prefix_schema.append( serialized_schema); +len_prefix_schema.append( serialized_schema.ipc_message); base64::encode(_prefix_schema) }
[arrow] 07/07: ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit f70e6db575cce746d2a4cd1c9e5a99629c27926c Author: Neville Dipale AuthorDate: Thu Oct 8 17:08:59 2020 +0200 ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip Closes #8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 40e2553..a17e424 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -724,7 +724,11 @@ mod tests { assert_eq!(expected_data.offset(), actual_data.offset()); assert_eq!(expected_data.buffers(), actual_data.buffers()); assert_eq!(expected_data.child_data(), actual_data.child_data()); -assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +// Null counts should be the same, not necessarily bitmaps +// A null bitmap is optional if an array has no nulls +if expected_data.null_count() != 0 { +assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +} } } @@ -1001,7 +1005,7 @@ mod tests { } #[test] -#[ignore] // Binary support isn't correct yet - null_bitmap doesn't match +#[ignore] // Binary support isn't correct yet - buffers don't match fn binary_single_column() { let one_vec: Vec = (0..SMALL_SIZE as u8).collect(); let many_vecs: Vec<_> = std::iter::repeat(one_vec).take(SMALL_SIZE).collect(); @@ -1026,7 +1030,6 @@ mod tests { } #[test] -#[ignore] // String support isn't correct yet - null_bitmap doesn't match fn string_single_column() { let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| i.to_string()).collect(); let raw_strs = raw_values.iter().map(|s| s.as_str()); @@ -1035,7 +1038,6 @@ mod tests { } #[test] -#[ignore] // Large string support isn't correct yet - null_bitmap doesn't match fn large_string_single_column() { let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| i.to_string()).collect(); let raw_strs = raw_values.iter().map(|s| s.as_str());
[arrow] 05/07: ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit de95e847bc82cbd28b3963edc843baaa10bb99ab Author: Carol (Nichols || Goulding) AuthorDate: Tue Oct 6 08:44:26 2020 -0600 ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) Co-authored-by: Neville Dipale Signed-off-by: Andy Grove --- rust/arrow/src/compute/kernels/cast.rs | 20 +- rust/parquet/src/arrow/array_reader.rs | 102 +--- rust/parquet/src/arrow/arrow_writer.rs | 413 - rust/parquet/src/arrow/converter.rs| 25 +- 4 files changed, 523 insertions(+), 37 deletions(-) diff --git a/rust/arrow/src/compute/kernels/cast.rs b/rust/arrow/src/compute/kernels/cast.rs index 08c6a2b..30180ca 100644 --- a/rust/arrow/src/compute/kernels/cast.rs +++ b/rust/arrow/src/compute/kernels/cast.rs @@ -356,11 +356,27 @@ pub fn cast(array: , to_type: ) -> Result { // temporal casts (Int32, Date32(_)) => cast_array_data::(array, to_type.clone()), -(Int32, Time32(_)) => cast_array_data::(array, to_type.clone()), +(Int32, Time32(unit)) => match unit { +TimeUnit::Second => { +cast_array_data::(array, to_type.clone()) +} +TimeUnit::Millisecond => { +cast_array_data::(array, to_type.clone()) +} +_ => unreachable!(), +}, (Date32(_), Int32) => cast_array_data::(array, to_type.clone()), (Time32(_), Int32) => cast_array_data::(array, to_type.clone()), (Int64, Date64(_)) => cast_array_data::(array, to_type.clone()), -(Int64, Time64(_)) => cast_array_data::(array, to_type.clone()), +(Int64, Time64(unit)) => match unit { +TimeUnit::Microsecond => { +cast_array_data::(array, to_type.clone()) +} +TimeUnit::Nanosecond => { +cast_array_data::(array, to_type.clone()) +} +_ => unreachable!(), +}, (Date64(_), Int64) => cast_array_data::(array, to_type.clone()), (Time64(_), Int64) => cast_array_data::(array, to_type.clone()), (Date32(DateUnit::Day), Date64(DateUnit::Millisecond)) => { diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs index 14bf7d2..4fbc54d 100644 --- a/rust/parquet/src/arrow/array_reader.rs +++ b/rust/parquet/src/arrow/array_reader.rs @@ -35,9 +35,10 @@ use crate::arrow::converter::{ BinaryArrayConverter, BinaryConverter, BoolConverter, BooleanArrayConverter, Converter, Date32Converter, FixedLenBinaryConverter, FixedSizeArrayConverter, Float32Converter, Float64Converter, Int16Converter, Int32Converter, Int64Converter, -Int8Converter, Int96ArrayConverter, Int96Converter, TimestampMicrosecondConverter, -TimestampMillisecondConverter, UInt16Converter, UInt32Converter, UInt64Converter, -UInt8Converter, Utf8ArrayConverter, Utf8Converter, +Int8Converter, Int96ArrayConverter, Int96Converter, Time32MillisecondConverter, +Time32SecondConverter, Time64MicrosecondConverter, Time64NanosecondConverter, +TimestampMicrosecondConverter, TimestampMillisecondConverter, UInt16Converter, +UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, Utf8Converter, }; use crate::arrow::record_reader::RecordReader; use crate::arrow::schema::parquet_to_arrow_field; @@ -196,11 +197,27 @@ impl ArrayReader for PrimitiveArrayReader { .convert(self.record_reader.cast::()), _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)), } -(ArrowType::Time32(_), PhysicalType::INT32) => { -
[arrow] 06/07: ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit e1b613b1ec9239bf58d5882081aeeb75fa06c3d3 Author: Carol (Nichols || Goulding) AuthorDate: Thu Oct 8 00:16:42 2020 +0200 ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available @nevi-me This is one commit on top of https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) Co-authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/arrow/src/ipc/convert.rs | 4 +- rust/parquet/src/arrow/array_reader.rs | 106 + rust/parquet/src/arrow/arrow_reader.rs | 36 -- rust/parquet/src/arrow/arrow_writer.rs | 4 +- rust/parquet/src/arrow/converter.rs | 52 +++- rust/parquet/src/arrow/mod.rs | 3 +- rust/parquet/src/arrow/record_reader.rs | 1 + rust/parquet/src/arrow/schema.rs| 205 +++- 8 files changed, 338 insertions(+), 73 deletions(-) diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs index 8f429bf..a02b6c4 100644 --- a/rust/arrow/src/ipc/convert.rs +++ b/rust/arrow/src/ipc/convert.rs @@ -334,7 +334,9 @@ pub(crate) fn build_field<'a: 'b, 'b>( let mut field_builder = ipc::FieldBuilder::new(fbb); field_builder.add_name(fb_field_name); -fb_dictionary.map(|dictionary| field_builder.add_dictionary(dictionary)); +if let Some(dictionary) = fb_dictionary { +field_builder.add_dictionary(dictionary) +} field_builder.add_type_type(field_type.type_type); field_builder.add_nullable(field.is_nullable()); match field_type.children { diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs index 4fbc54d..40df284 100644 --- a/rust/parquet/src/arrow/array_reader.rs +++ b/rust/parquet/src/arrow/array_reader.rs @@ -29,16 +29,20 @@ use arrow::array::{ Int16BufferBuilder, StructArray, }; use arrow::buffer::{Buffer, MutableBuffer}; -use arrow::datatypes::{DataType as ArrowType, DateUnit, Field, IntervalUnit, TimeUnit}; +use arrow::datatypes::{ +DataType as ArrowType, DateUnit, Field, IntervalUnit, Schema, TimeUnit, +}; use crate::arrow::converter::{ BinaryArrayConverter, BinaryConverter, BoolConverter, BooleanArrayConverter, Converter, Date32Converter, FixedLenBinaryConverter, FixedSizeArrayConverter, Float32Converter, Float64Converter, Int16Converter, Int32Converter, Int64Converter, -Int8Converter, Int96ArrayConverter, Int96Converter, Time32MillisecondConverter, -Time32SecondConverter, Time64MicrosecondConverter, Time64NanosecondConverter, -TimestampMicrosecondConverter, TimestampMillisecondConverter, UInt16Converter, -UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, Utf8Converter, +Int8Converter, Int96ArrayConverter, Int96Converter, LargeBinaryArrayConverter, +LargeBinaryConverter, LargeUtf8ArrayConverter, LargeUtf8Converter, +Time32MillisecondConverter, Time32SecondConverter, Time64MicrosecondConverter, +Time64NanosecondConverter, TimestampMicrosecondConverter, +TimestampMillisecondConverter, UInt16Converter, UInt32Converter, UInt64Converter, +UInt8Converter, Utf8ArrayConverter, Utf8Converter, }; use crate::arrow::record_reader::RecordReader; use crate::arrow::schema::parquet_to_arrow_field; @@ -612,6 +616,7 @@ impl ArrayReader for StructArrayReader { /// Create array reader from parquet schema, column indices, and parquet file reader. pub fn build_array_reader( parquet_schema: SchemaDescPtr, +arrow_schema: Schema, column_indices: T, file_reader: Rc, ) -> Result> @@ -650,13 +655,19 @@ where fields: filtered_root_fields, }; -ArrayReaderBuilder::new(Rc::new(proj), Rc::new(leaves), file_reader) -.build_array_reader() +ArrayReaderBuilder::new( +Rc::new(proj), +Rc::new(arrow_schema), +Rc::new(leaves), +file_reader, +) +.build_array_reader() }
[arrow] 02/07: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 12b11b29293b6ace9cb99cffa93fd1b74b4849be Author: Neville Dipale AuthorDate: Tue Aug 18 18:39:37 2020 +0200 ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes #7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale Signed-off-by: Neville Dipale --- rust/parquet/Cargo.toml| 3 +- rust/parquet/src/arrow/arrow_writer.rs | 27 ++- rust/parquet/src/arrow/mod.rs | 4 + rust/parquet/src/arrow/schema.rs | 306 - rust/parquet/src/file/properties.rs| 6 +- 5 files changed, 290 insertions(+), 56 deletions(-) diff --git a/rust/parquet/Cargo.toml b/rust/parquet/Cargo.toml index 50d7c34..60e43c9 100644 --- a/rust/parquet/Cargo.toml +++ b/rust/parquet/Cargo.toml @@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true } chrono = "0.4" num-bigint = "0.3" arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true } +base64 = { version = "*", optional = true } [dev-dependencies] rand = "0.7" @@ -52,4 +53,4 @@ arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" } serde_json = { version = "1.0", features = ["preserve_order"] } [features] -default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"] +default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"] diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs index 0c1c490..1ca8d50 100644 --- a/rust/parquet/src/arrow/arrow_writer.rs +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -24,6 +24,7 @@ use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; use arrow::record_batch::RecordBatch; use arrow_array::Array; +use super::schema::add_encoded_arrow_schema_to_metadata; use crate::column::writer::ColumnWriter; use crate::errors::{ParquetError, Result}; use crate::file::properties::WriterProperties; @@ -53,17 +54,17 @@ impl ArrowWriter { pub fn try_new( writer: W, arrow_schema: SchemaRef, -props: Option>, +props: Option, ) -> Result { let schema = crate::arrow::arrow_to_parquet_schema(_schema)?; -let props = match props { -Some(props) => props, -None => Rc::new(WriterProperties::builder().build()), -}; +// add serialized arrow schema +let mut props = props.unwrap_or_else(|| WriterProperties::builder().build()); +add_encoded_arrow_schema_to_metadata(_schema, props); + let file_writer = SerializedFileWriter::new( writer.try_clone()?, schema.root_schema_ptr(), -props, +Rc::new(props), )?; Ok(Self { @@ -495,7 +496,7 @@ mod tests { use arrow::record_batch::{RecordBatch, RecordBatchReader}; use crate::arrow::{ArrowReader, ParquetFileArrowReader}; -use crate::file::reader::SerializedFileReader; +use crate::file::{metadata::KeyValue, reader::SerializedFileReader}; use crate::util::test_common::get_temp_file; #[test] @@ -584,7 +585,7 @@ mod tests { ) .unwrap(); -let mut file = get_temp_file("test_arrow_writer.parquet", &[]); +let mut file = get_temp_file("test_arrow_writer_binary.parquet", &[]); let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), Arc::new(schema), None) .unwrap(); @@ -674,8 +675,16 @@ mod tests { ) .unwrap(); +let props = WriterProperties::builder() +.set_key_value_metadata(Some(vec![KeyValue { +key: "test_key".to_string(), +value: Some("test_value".to_string()), +}])) +.build(); + let file = get_temp_file("test_arrow_writer_complex.parquet", &[]); -let mut writer = ArrowWriter::try_new(file, Arc::new(schema), None).unwrap(); +let mut writer = +ArrowWriter::try_new(file, Arc::new(schema), Some(props)).unwrap(); writer.write().unwrap(); writer.close().unwrap(); } diff --git a/rust/parquet/src/arrow/mod.rs b/rust/parquet/src/arrow/mod.rs index 8499481..2b012fb 100644 --- a/rust/parquet/src/arrow/mod.rs +++ b/rust/parquet/src/arrow/mod.rs @@ -58,6 +58,10 @@ pub mod schema; pub use self::arrow_reader::ArrowReader; pub use self::arrow_reader::ParquetFileArrowReader; +pub use self::arrow_writer::ArrowWriter; pub use self::schema::{ arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns, }; + +/// Schema metadata key used to store
[arrow] 01/07: ARROW-8289: [Rust] Parquet Arrow writer with nested support
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 4e6a836b42b064a50582bcc9d6cfca2b7e77a46a Author: Neville Dipale AuthorDate: Thu Aug 13 18:47:34 2020 +0200 ARROW-8289: [Rust] Parquet Arrow writer with nested support **Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct> not null child 0, d: double child 1, e: struct child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale Co-authored-by: Max Burke Co-authored-by: Andy Grove Co-authored-by: Max Burke Signed-off-by: Neville Dipale --- rust/parquet/src/arrow/arrow_writer.rs | 682 + rust/parquet/src/arrow/mod.rs | 5 +- rust/parquet/src/schema/types.rs | 6 +- 3 files changed, 691 insertions(+), 2 deletions(-) diff --git a/rust/parquet/src/arrow/arrow_writer.rs b/rust/parquet/src/arrow/arrow_writer.rs new file mode 100644 index 000..0c1c490 --- /dev/null +++ b/rust/parquet/src/arrow/arrow_writer.rs @@ -0,0 +1,682 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Contains writer which writes arrow data into parquet data. + +use std::rc::Rc; + +use arrow::array as arrow_array; +use arrow::datatypes::{DataType as ArrowDataType, SchemaRef}; +use arrow::record_batch::RecordBatch; +use arrow_array::Array; + +use crate::column::writer::ColumnWriter; +use crate::errors::{ParquetError, Result}; +use crate::file::properties::WriterProperties; +use crate::{ +data_type::*, +file::writer::{FileWriter, ParquetWriter, RowGroupWriter, SerializedFileWriter}, +}; + +/// Arrow writer +/// +/// Writes Arrow `RecordBatch`es to a Parquet writer +pub struct ArrowWriter { +/// Underlying Parquet writer +writer: SerializedFileWriter, +/// A copy of the Arrow schema. +/// +/// The schema is used to verify that each record batch written has the correct schema +arrow_schema: SchemaRef, +} + +impl ArrowWriter { +/// Try to create a new Arrow writer +/// +/// The writer will
[arrow] 04/07: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
This is an automated email from the ASF dual-hosted git repository. nevime pushed a commit to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git commit 58855a888438fe2063db8499659d83fe1bb91b66 Author: Carol (Nichols || Goulding) AuthorDate: Sat Oct 3 02:34:38 2020 +0200 ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) Signed-off-by: Neville Dipale --- rust/arrow/src/datatypes.rs | 4 +- rust/arrow/src/ipc/convert.rs| 105 ++- rust/parquet/src/arrow/schema.rs | 20 3 files changed, 93 insertions(+), 36 deletions(-) diff --git a/rust/arrow/src/datatypes.rs b/rust/arrow/src/datatypes.rs index 0d05f82..c647af6 100644 --- a/rust/arrow/src/datatypes.rs +++ b/rust/arrow/src/datatypes.rs @@ -189,8 +189,8 @@ pub struct Field { name: String, data_type: DataType, nullable: bool, -dict_id: i64, -dict_is_ordered: bool, +pub(crate) dict_id: i64, +pub(crate) dict_is_ordered: bool, } pub trait ArrowNativeType: diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs index 7a5795d..8f429bf 100644 --- a/rust/arrow/src/ipc/convert.rs +++ b/rust/arrow/src/ipc/convert.rs @@ -34,18 +34,8 @@ pub fn schema_to_fb(schema: ) -> FlatBufferBuilder { let mut fields = vec![]; for field in schema.fields() { -let fb_field_name = fbb.create_string(field.name().as_str()); -let field_type = get_fb_field_type(field.data_type(), fbb); -let mut field_builder = ipc::FieldBuilder::new( fbb); -field_builder.add_name(fb_field_name); -field_builder.add_type_type(field_type.type_type); -field_builder.add_nullable(field.is_nullable()); -match field_type.children { -None => {} -Some(children) => field_builder.add_children(children), -}; -field_builder.add_type_(field_type.type_); -fields.push(field_builder.finish()); +let fb_field = build_field( fbb, field); +fields.push(fb_field); } let mut custom_metadata = vec![]; @@ -80,18 +70,8 @@ pub fn schema_to_fb_offset<'a: 'b, 'b>( ) -> WIPOffset> { let mut fields = vec![]; for field in schema.fields() { -let fb_field_name = fbb.create_string(field.name().as_str()); -let field_type = get_fb_field_type(field.data_type(), fbb); -let mut field_builder = ipc::FieldBuilder::new(fbb); -field_builder.add_name(fb_field_name); -field_builder.add_type_type(field_type.type_type); -field_builder.add_nullable(field.is_nullable()); -match field_type.children { -None => {} -Some(children) => field_builder.add_children(children), -}; -field_builder.add_type_(field_type.type_); -fields.push(field_builder.finish()); +let fb_field = build_field(fbb, field); +fields.push(fb_field); } let mut custom_metadata = vec![]; @@ -333,6 +313,38 @@ pub(crate) struct FBFieldType<'b> { pub(crate) children: Option, } +/// Create an IPC Field from an Arrow Field +pub(crate) fn build_field<'a: 'b, 'b>( +fbb: FlatBufferBuilder<'a>, +field: , +) -> WIPOffset> { +let fb_field_name = fbb.create_string(field.name().as_str()); +let field_type = get_fb_field_type(field.data_type(), fbb); + +let fb_dictionary = if let Dictionary(index_type, _) = field.data_type() { +Some(get_fb_dictionary( +index_type, +field.dict_id, +
[arrow] branch rust-parquet-arrow-writer updated (bd3c714 -> f70e6db)
This is an automated email from the ASF dual-hosted git repository. nevime pushed a change to branch rust-parquet-arrow-writer in repository https://gitbox.apache.org/repos/asf/arrow.git. omit bd3c714 ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip omit 12add42 ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available omit 7bfff71 ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes omit d748efa ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types omit 2ac525e ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes omit 0a92daa ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata omit 84e3576 ARROW-8289: [Rust] Parquet Arrow writer with nested support add 70ae161 ARROW-10290: [C++] List POP_BACK is not available in older CMake versions add f07a415 ARROW-10263: [C++][Compute] Improve variance kernel numerical stability add c5280a5 ARROW-10293: [Rust] [DataFusion] Fixed benchmarks add 818593f ARROW-10295 [Rust] [DataFusion] Replace Rc> by Box<> in accumulators. add becf329 ARROW-10289: [Rust] Read dictionaries in IPC streams add ea29f65 ARROW-10292: [Rust] [DataFusion] Simplify merge add 249adb4 ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel add ac14e91 ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) add ed8b1bc ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string add 35ace39 ARROW-10174: [Java] Fix reading/writing dict structs new 4e6a836 ARROW-8289: [Rust] Parquet Arrow writer with nested support new 12b11b2 ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata new ebe8172 ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes new 58855a8 ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types new de95e84 ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes new e1b613b ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available new f70e6db ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (bd3c714) \ N -- N -- N refs/heads/rust-parquet-arrow-writer (f70e6db) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 7 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .github/workflows/r.yml| 21 ++- cpp/src/arrow/compute/kernels/aggregate_test.cc| 44 -- cpp/src/arrow/compute/kernels/aggregate_var_std.cc | 21 +-- cpp/src/arrow/dataset/partition_test.cc| 12 ++ cpp/src/arrow/flight/CMakeLists.txt| 4 +- .../arrow/vector/util/DictionaryUtility.java | 20 ++- .../arrow/vector/ipc/TestArrowReaderWriter.java| 96 .../vector/testing/ValueVectorDataPopulator.java | 32 js/src/data.ts | 6 +- r/tests/testthat/test-csv.R| 4 +- rust/arrow/src/ipc/reader.rs | 174 + rust/datafusion/benches/aggregate_query_sql.rs | 14 +- rust/datafusion/benches/math_query_sql.rs | 36 +++-- rust/datafusion/benches/sort_limit_query_sql.rs| 17 +- rust/datafusion/examples/simple_udaf.rs| 4 +- rust/datafusion/src/execution/context.rs | 10 +- rust/datafusion/src/physical_plan/aggregates.rs| 4 +- .../src/physical_plan/distinct_expressions.rs | 17 +- rust/datafusion/src/physical_plan/expressions.rs | 34 ++-- .../datafusion/src/physical_plan/hash_aggregate.rs | 36 ++--- rust/datafusion/src/physical_plan/limit.rs | 3 +- rust/datafusion/src/physical_plan/merge.rs | 54 ++- rust/datafusion/src/physical_plan/mod.rs | 4 +- rust/datafusion/src/physical_plan/planner.rs | 5 +- rust/datafusion/src/physical_plan/sort.rs | 2
[arrow] branch master updated (ed8b1bc -> 35ace39)
This is an automated email from the ASF dual-hosted git repository. liyafan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from ed8b1bc ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string add 35ace39 ARROW-10174: [Java] Fix reading/writing dict structs No new revisions were added by this update. Summary of changes: .../arrow/vector/util/DictionaryUtility.java | 20 +++-- .../arrow/vector/ipc/TestArrowReaderWriter.java| 96 ++ .../vector/testing/ValueVectorDataPopulator.java | 32 3 files changed, 141 insertions(+), 7 deletions(-)
[arrow] branch master updated (ed8b1bc -> 35ace39)
This is an automated email from the ASF dual-hosted git repository. liyafan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from ed8b1bc ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string add 35ace39 ARROW-10174: [Java] Fix reading/writing dict structs No new revisions were added by this update. Summary of changes: .../arrow/vector/util/DictionaryUtility.java | 20 +++-- .../arrow/vector/ipc/TestArrowReaderWriter.java| 96 ++ .../vector/testing/ValueVectorDataPopulator.java | 32 3 files changed, 141 insertions(+), 7 deletions(-)
[arrow] branch master updated (ac14e91 -> ed8b1bc)
This is an automated email from the ASF dual-hosted git repository. jorisvandenbossche pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from ac14e91 ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) add ed8b1bc ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string No new revisions were added by this update. Summary of changes: cpp/src/arrow/dataset/partition_test.cc | 12 1 file changed, 12 insertions(+)
[arrow] branch master updated (249adb4 -> ac14e91)
This is an automated email from the ASF dual-hosted git repository. bhulette pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 249adb4 ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel add ac14e91 ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) No new revisions were added by this update. Summary of changes: js/src/data.ts | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[arrow] branch master updated (249adb4 -> ac14e91)
This is an automated email from the ASF dual-hosted git repository. bhulette pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git. from 249adb4 ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel add ac14e91 ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) No new revisions were added by this update. Summary of changes: js/src/data.ts | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[GitHub] [arrow-site] jorisvandenbossche commented on pull request #79: 2.0.0 release blog post
jorisvandenbossche commented on pull request #79: URL: https://github.com/apache/arrow-site/pull/79#issuecomment-709488684 I did a first pass on adding a few python notes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a change in pull request #79: 2.0.0 release blog post
lidavidm commented on a change in pull request #79: URL: https://github.com/apache/arrow-site/pull/79#discussion_r505707474 ## File path: _posts/2020-10-15-2.0.0-release.md ## @@ -0,0 +1,92 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 Release" +date: "2020-10-14 00:00:00 -0600" +author: pmc +categories: [release] +--- + + + + + +The Apache Arrow team is pleased to announce the 2.0.0 release. This covers +over XX months of development work and includes [**XX resolved issues**][1] +from [**XX distinct contributors**][2]. See the Install Page to learn how to +get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + + + +## Columnar Format Notes + +## Arrow Flight RPC notes +For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in `FlightStream` and `DoPut` have been addressed. In C++ and Python, a deadlock has been fixed in an edge case. Review comment: ```suggestion For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in `FlightStream` and `DoPut` have been addressed. In C++ and Python, a deadlock has been fixed in an edge case. Additionally, when supported by gRPC, TLS verification can be disabled. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] liyafan82 commented on a change in pull request #79: 2.0.0 release blog post
liyafan82 commented on a change in pull request #79: URL: https://github.com/apache/arrow-site/pull/79#discussion_r504527232 ## File path: _posts/2020-10-15-2.0.0-release.md ## @@ -0,0 +1,84 @@ +--- +layout: post +title: "Apache Arrow 2.0.0 Release" +date: "2020-10-14 00:00:00 -0600" +author: pmc +categories: [release] +--- + + + + + +The Apache Arrow team is pleased to announce the 2.0.0 release. This covers +over XX months of development work and includes [**XX resolved issues**][1] +from [**XX distinct contributors**][2]. See the Install Page to learn how to +get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + + + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes + +## Go notes + +## Java notes + Review comment: @nealrichardson Please check if the following notes are reasonable for Java. Thanks for your effort. ```suggestion The Java package has supported a number of new features. Users can validate vectors in a wider range of aspects, if they are willing to take more time. In dictionary encoding, dictionary indices can be expressed as unsinged integers. A framework for data compression has been setup for IPC. The calculation for vector capacity has been simplified, so users should experience notable performance improvements for various 'setSafe' methods. Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been resolved to make them more usable. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org