[arrow] branch master updated (18495e0 -> 7189b91)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs
 add 7189b91  ARROW-9475: [Java] Clean up usages of BaseAllocator, use 
BufferAllocator in…

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/arrow/memory/Accountant.java   |  3 +-
 .../org/apache/arrow/memory/AllocationManager.java | 39 --
 .../org/apache/arrow/memory/BaseAllocator.java | 16 ++---
 .../org/apache/arrow/memory/BufferAllocator.java   | 32 ++
 .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++--
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/NettyAllocationManager.java   |  6 ++--
 .../org/apache/arrow/memory/TestBaseAllocator.java |  2 +-
 .../arrow/memory/TestNettyAllocationManager.java   |  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/UnsafeAllocationManager.java  |  4 +--
 12 files changed, 87 insertions(+), 45 deletions(-)



[arrow] branch master updated (18495e0 -> 7189b91)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs
 add 7189b91  ARROW-9475: [Java] Clean up usages of BaseAllocator, use 
BufferAllocator in…

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/arrow/memory/Accountant.java   |  3 +-
 .../org/apache/arrow/memory/AllocationManager.java | 39 --
 .../org/apache/arrow/memory/BaseAllocator.java | 16 ++---
 .../org/apache/arrow/memory/BufferAllocator.java   | 32 ++
 .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++--
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/NettyAllocationManager.java   |  6 ++--
 .../org/apache/arrow/memory/TestBaseAllocator.java |  2 +-
 .../arrow/memory/TestNettyAllocationManager.java   |  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/UnsafeAllocationManager.java  |  4 +--
 12 files changed, 87 insertions(+), 45 deletions(-)



[arrow] branch master updated (1d10f22 -> 18495e0)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion
 add 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs

No new revisions were added by this update.

Summary of changes:
 .../src/main/codegen/data/ValueVectorTypes.tdd  |  2 +-
 .../src/main/codegen/templates/ComplexWriters.java  |  4 ++--
 .../codegen/templates/UnionFixedSizeListWriter.java |  2 +-
 .../src/main/codegen/templates/UnionListWriter.java |  4 ++--
 .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++--
 .../arrow/vector/complex/impl/PromotableWriter.java |  2 +-
 .../apache/arrow/vector/util/DecimalUtility.java|  2 +-
 .../org/apache/arrow/vector/ITTestLargeVector.java  | 21 -
 8 files changed, 34 insertions(+), 15 deletions(-)



[arrow] branch master updated (18495e0 -> 7189b91)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs
 add 7189b91  ARROW-9475: [Java] Clean up usages of BaseAllocator, use 
BufferAllocator in…

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/arrow/memory/Accountant.java   |  3 +-
 .../org/apache/arrow/memory/AllocationManager.java | 39 --
 .../org/apache/arrow/memory/BaseAllocator.java | 16 ++---
 .../org/apache/arrow/memory/BufferAllocator.java   | 32 ++
 .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++--
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/NettyAllocationManager.java   |  6 ++--
 .../org/apache/arrow/memory/TestBaseAllocator.java |  2 +-
 .../arrow/memory/TestNettyAllocationManager.java   |  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/UnsafeAllocationManager.java  |  4 +--
 12 files changed, 87 insertions(+), 45 deletions(-)



[arrow] branch master updated (1d10f22 -> 18495e0)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion
 add 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs

No new revisions were added by this update.

Summary of changes:
 .../src/main/codegen/data/ValueVectorTypes.tdd  |  2 +-
 .../src/main/codegen/templates/ComplexWriters.java  |  4 ++--
 .../codegen/templates/UnionFixedSizeListWriter.java |  2 +-
 .../src/main/codegen/templates/UnionListWriter.java |  4 ++--
 .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++--
 .../arrow/vector/complex/impl/PromotableWriter.java |  2 +-
 .../apache/arrow/vector/util/DecimalUtility.java|  2 +-
 .../org/apache/arrow/vector/ITTestLargeVector.java  | 21 -
 8 files changed, 34 insertions(+), 15 deletions(-)



[arrow] branch master updated (1d10f22 -> 18495e0)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion
 add 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs

No new revisions were added by this update.

Summary of changes:
 .../src/main/codegen/data/ValueVectorTypes.tdd  |  2 +-
 .../src/main/codegen/templates/ComplexWriters.java  |  4 ++--
 .../codegen/templates/UnionFixedSizeListWriter.java |  2 +-
 .../src/main/codegen/templates/UnionListWriter.java |  4 ++--
 .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++--
 .../arrow/vector/complex/impl/PromotableWriter.java |  2 +-
 .../apache/arrow/vector/util/DecimalUtility.java|  2 +-
 .../org/apache/arrow/vector/ITTestLargeVector.java  | 21 -
 8 files changed, 34 insertions(+), 15 deletions(-)



[arrow] branch master updated (18495e0 -> 7189b91)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs
 add 7189b91  ARROW-9475: [Java] Clean up usages of BaseAllocator, use 
BufferAllocator in…

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/arrow/memory/Accountant.java   |  3 +-
 .../org/apache/arrow/memory/AllocationManager.java | 39 --
 .../org/apache/arrow/memory/BaseAllocator.java | 16 ++---
 .../org/apache/arrow/memory/BufferAllocator.java   | 32 ++
 .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++--
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/NettyAllocationManager.java   |  6 ++--
 .../org/apache/arrow/memory/TestBaseAllocator.java |  2 +-
 .../arrow/memory/TestNettyAllocationManager.java   |  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/UnsafeAllocationManager.java  |  4 +--
 12 files changed, 87 insertions(+), 45 deletions(-)



[arrow] branch master updated (18495e0 -> 7189b91)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs
 add 7189b91  ARROW-9475: [Java] Clean up usages of BaseAllocator, use 
BufferAllocator in…

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/arrow/memory/Accountant.java   |  3 +-
 .../org/apache/arrow/memory/AllocationManager.java | 39 --
 .../org/apache/arrow/memory/BaseAllocator.java | 16 ++---
 .../org/apache/arrow/memory/BufferAllocator.java   | 32 ++
 .../java/org/apache/arrow/memory/BufferLedger.java | 22 ++--
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/NettyAllocationManager.java   |  6 ++--
 .../org/apache/arrow/memory/TestBaseAllocator.java |  2 +-
 .../arrow/memory/TestNettyAllocationManager.java   |  2 +-
 .../memory/DefaultAllocationManagerFactory.java|  2 +-
 .../arrow/memory/UnsafeAllocationManager.java  |  4 +--
 12 files changed, 87 insertions(+), 45 deletions(-)



[arrow] branch master updated (1d10f22 -> 18495e0)

2020-10-15 Thread emkornfield
This is an automated email from the ASF dual-hosted git repository.

emkornfield pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion
 add 18495e0  ARROW-10294: [Java] Resolve problems of DecimalVector APIs on 
ArrowBufs

No new revisions were added by this update.

Summary of changes:
 .../src/main/codegen/data/ValueVectorTypes.tdd  |  2 +-
 .../src/main/codegen/templates/ComplexWriters.java  |  4 ++--
 .../codegen/templates/UnionFixedSizeListWriter.java |  2 +-
 .../src/main/codegen/templates/UnionListWriter.java |  4 ++--
 .../java/org/apache/arrow/vector/DecimalVector.java | 12 ++--
 .../arrow/vector/complex/impl/PromotableWriter.java |  2 +-
 .../apache/arrow/vector/util/DecimalUtility.java|  2 +-
 .../org/apache/arrow/vector/ITTestLargeVector.java  | 21 -
 8 files changed, 34 insertions(+), 15 deletions(-)



[arrow] branch master updated (35ace39 -> 1d10f22)

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 35ace39  ARROW-10174: [Java] Fix reading/writing dict structs
 add 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion

No new revisions were added by this update.

Summary of changes:
 rust/arrow/src/compute/kernels/cast.rs   | 478 ++-
 rust/arrow/src/datatypes.rs  |  10 +
 rust/datafusion/src/logical_plan/mod.rs  |  13 +-
 rust/datafusion/src/physical_plan/expressions.rs |  26 +-
 4 files changed, 505 insertions(+), 22 deletions(-)



[arrow] branch master updated (35ace39 -> 1d10f22)

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 35ace39  ARROW-10174: [Java] Fix reading/writing dict structs
 add 1d10f22  ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, 
use in DataFusion

No new revisions were added by this update.

Summary of changes:
 rust/arrow/src/compute/kernels/cast.rs   | 478 ++-
 rust/arrow/src/datatypes.rs  |  10 +
 rust/datafusion/src/logical_plan/mod.rs  |  13 +-
 rust/datafusion/src/physical_plan/expressions.rs |  26 +-
 4 files changed, 505 insertions(+), 22 deletions(-)



[arrow] 04/07: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 58855a888438fe2063db8499659d83fe1bb91b66
Author: Carol (Nichols || Goulding) 
AuthorDate: Sat Oct 3 02:34:38 2020 +0200

ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) 
Signed-off-by: Neville Dipale 
---
 rust/arrow/src/datatypes.rs  |   4 +-
 rust/arrow/src/ipc/convert.rs| 105 ++-
 rust/parquet/src/arrow/schema.rs |  20 
 3 files changed, 93 insertions(+), 36 deletions(-)

diff --git a/rust/arrow/src/datatypes.rs b/rust/arrow/src/datatypes.rs
index 0d05f82..c647af6 100644
--- a/rust/arrow/src/datatypes.rs
+++ b/rust/arrow/src/datatypes.rs
@@ -189,8 +189,8 @@ pub struct Field {
 name: String,
 data_type: DataType,
 nullable: bool,
-dict_id: i64,
-dict_is_ordered: bool,
+pub(crate) dict_id: i64,
+pub(crate) dict_is_ordered: bool,
 }
 
 pub trait ArrowNativeType:
diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs
index 7a5795d..8f429bf 100644
--- a/rust/arrow/src/ipc/convert.rs
+++ b/rust/arrow/src/ipc/convert.rs
@@ -34,18 +34,8 @@ pub fn schema_to_fb(schema: ) -> FlatBufferBuilder {
 
 let mut fields = vec![];
 for field in schema.fields() {
-let fb_field_name = fbb.create_string(field.name().as_str());
-let field_type = get_fb_field_type(field.data_type(),  fbb);
-let mut field_builder = ipc::FieldBuilder::new( fbb);
-field_builder.add_name(fb_field_name);
-field_builder.add_type_type(field_type.type_type);
-field_builder.add_nullable(field.is_nullable());
-match field_type.children {
-None => {}
-Some(children) => field_builder.add_children(children),
-};
-field_builder.add_type_(field_type.type_);
-fields.push(field_builder.finish());
+let fb_field = build_field( fbb, field);
+fields.push(fb_field);
 }
 
 let mut custom_metadata = vec![];
@@ -80,18 +70,8 @@ pub fn schema_to_fb_offset<'a: 'b, 'b>(
 ) -> WIPOffset> {
 let mut fields = vec![];
 for field in schema.fields() {
-let fb_field_name = fbb.create_string(field.name().as_str());
-let field_type = get_fb_field_type(field.data_type(), fbb);
-let mut field_builder = ipc::FieldBuilder::new(fbb);
-field_builder.add_name(fb_field_name);
-field_builder.add_type_type(field_type.type_type);
-field_builder.add_nullable(field.is_nullable());
-match field_type.children {
-None => {}
-Some(children) => field_builder.add_children(children),
-};
-field_builder.add_type_(field_type.type_);
-fields.push(field_builder.finish());
+let fb_field = build_field(fbb, field);
+fields.push(fb_field);
 }
 
 let mut custom_metadata = vec![];
@@ -333,6 +313,38 @@ pub(crate) struct FBFieldType<'b> {
 pub(crate) children: Option,
 }
 
+/// Create an IPC Field from an Arrow Field
+pub(crate) fn build_field<'a: 'b, 'b>(
+fbb:  FlatBufferBuilder<'a>,
+field: ,
+) -> WIPOffset> {
+let fb_field_name = fbb.create_string(field.name().as_str());
+let field_type = get_fb_field_type(field.data_type(), fbb);
+
+let fb_dictionary = if let Dictionary(index_type, _) = field.data_type() {
+Some(get_fb_dictionary(
+index_type,
+field.dict_id,
+

[arrow] 07/07: ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit f70e6db575cce746d2a4cd1c9e5a99629c27926c
Author: Neville Dipale 
AuthorDate: Thu Oct 8 17:08:59 2020 +0200

ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip

Closes #8388 from nevi-me/ARROW-10225

Authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 40e2553..a17e424 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -724,7 +724,11 @@ mod tests {
 assert_eq!(expected_data.offset(), actual_data.offset());
 assert_eq!(expected_data.buffers(), actual_data.buffers());
 assert_eq!(expected_data.child_data(), actual_data.child_data());
-assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap());
+// Null counts should be the same, not necessarily bitmaps
+// A null bitmap is optional if an array has no nulls
+if expected_data.null_count() != 0 {
+assert_eq!(expected_data.null_bitmap(), 
actual_data.null_bitmap());
+}
 }
 }
 
@@ -1001,7 +1005,7 @@ mod tests {
 }
 
 #[test]
-#[ignore] // Binary support isn't correct yet - null_bitmap doesn't match
+#[ignore] // Binary support isn't correct yet - buffers don't match
 fn binary_single_column() {
 let one_vec: Vec = (0..SMALL_SIZE as u8).collect();
 let many_vecs: Vec<_> = 
std::iter::repeat(one_vec).take(SMALL_SIZE).collect();
@@ -1026,7 +1030,6 @@ mod tests {
 }
 
 #[test]
-#[ignore] // String support isn't correct yet - null_bitmap doesn't match
 fn string_single_column() {
 let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| 
i.to_string()).collect();
 let raw_strs = raw_values.iter().map(|s| s.as_str());
@@ -1035,7 +1038,6 @@ mod tests {
 }
 
 #[test]
-#[ignore] // Large string support isn't correct yet - null_bitmap doesn't 
match
 fn large_string_single_column() {
 let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| 
i.to_string()).collect();
 let raw_strs = raw_values.iter().map(|s| s.as_str());



[arrow] 02/07: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 12b11b29293b6ace9cb99cffa93fd1b74b4849be
Author: Neville Dipale 
AuthorDate: Tue Aug 18 18:39:37 2020 +0200

ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

This will allow preserving Arrow-specific metadata when writing or reading 
Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema 
conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/parquet/Cargo.toml|   3 +-
 rust/parquet/src/arrow/arrow_writer.rs |  27 ++-
 rust/parquet/src/arrow/mod.rs  |   4 +
 rust/parquet/src/arrow/schema.rs   | 306 -
 rust/parquet/src/file/properties.rs|   6 +-
 5 files changed, 290 insertions(+), 56 deletions(-)

diff --git a/rust/parquet/Cargo.toml b/rust/parquet/Cargo.toml
index 50d7c34..60e43c9 100644
--- a/rust/parquet/Cargo.toml
+++ b/rust/parquet/Cargo.toml
@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
 chrono = "0.4"
 num-bigint = "0.3"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true }
+base64 = { version = "*", optional = true }
 
 [dev-dependencies]
 rand = "0.7"
@@ -52,4 +53,4 @@ arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 serde_json = { version = "1.0", features = ["preserve_order"] }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]
diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 0c1c490..1ca8d50 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -24,6 +24,7 @@ use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
 use arrow::record_batch::RecordBatch;
 use arrow_array::Array;
 
+use super::schema::add_encoded_arrow_schema_to_metadata;
 use crate::column::writer::ColumnWriter;
 use crate::errors::{ParquetError, Result};
 use crate::file::properties::WriterProperties;
@@ -53,17 +54,17 @@ impl ArrowWriter {
 pub fn try_new(
 writer: W,
 arrow_schema: SchemaRef,
-props: Option>,
+props: Option,
 ) -> Result {
 let schema = crate::arrow::arrow_to_parquet_schema(_schema)?;
-let props = match props {
-Some(props) => props,
-None => Rc::new(WriterProperties::builder().build()),
-};
+// add serialized arrow schema
+let mut props = props.unwrap_or_else(|| 
WriterProperties::builder().build());
+add_encoded_arrow_schema_to_metadata(_schema,  props);
+
 let file_writer = SerializedFileWriter::new(
 writer.try_clone()?,
 schema.root_schema_ptr(),
-props,
+Rc::new(props),
 )?;
 
 Ok(Self {
@@ -495,7 +496,7 @@ mod tests {
 use arrow::record_batch::{RecordBatch, RecordBatchReader};
 
 use crate::arrow::{ArrowReader, ParquetFileArrowReader};
-use crate::file::reader::SerializedFileReader;
+use crate::file::{metadata::KeyValue, reader::SerializedFileReader};
 use crate::util::test_common::get_temp_file;
 
 #[test]
@@ -584,7 +585,7 @@ mod tests {
 )
 .unwrap();
 
-let mut file = get_temp_file("test_arrow_writer.parquet", &[]);
+let mut file = get_temp_file("test_arrow_writer_binary.parquet", &[]);
 let mut writer =
 ArrowWriter::try_new(file.try_clone().unwrap(), Arc::new(schema), 
None)
 .unwrap();
@@ -674,8 +675,16 @@ mod tests {
 )
 .unwrap();
 
+let props = WriterProperties::builder()
+.set_key_value_metadata(Some(vec![KeyValue {
+key: "test_key".to_string(),
+value: Some("test_value".to_string()),
+}]))
+.build();
+
 let file = get_temp_file("test_arrow_writer_complex.parquet", &[]);
-let mut writer = ArrowWriter::try_new(file, Arc::new(schema), 
None).unwrap();
+let mut writer =
+ArrowWriter::try_new(file, Arc::new(schema), Some(props)).unwrap();
 writer.write().unwrap();
 writer.close().unwrap();
 }
diff --git a/rust/parquet/src/arrow/mod.rs b/rust/parquet/src/arrow/mod.rs
index 8499481..2b012fb 100644
--- a/rust/parquet/src/arrow/mod.rs
+++ b/rust/parquet/src/arrow/mod.rs
@@ -58,6 +58,10 @@ pub mod schema;
 
 pub use self::arrow_reader::ArrowReader;
 pub use self::arrow_reader::ParquetFileArrowReader;
+pub use self::arrow_writer::ArrowWriter;
 pub use self::schema::{
 arrow_to_parquet_schema, parquet_to_arrow_schema, 
parquet_to_arrow_schema_by_columns,
 };
+
+/// Schema metadata key used to store 

[arrow] 05/07: ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit de95e847bc82cbd28b3963edc843baaa10bb99ab
Author: Carol (Nichols || Goulding) 
AuthorDate: Tue Oct 6 08:44:26 2020 -0600

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all 
supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch 
with a single column of values of each the supported data types and some of the 
unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round 
trip directly; I'm not sure that this is appropriate or not because for some 
data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed 
to or not.

So I would love advice on that front, and I would love to know if these 
tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) 
Co-authored-by: Neville Dipale 
Signed-off-by: Andy Grove 
---
 rust/arrow/src/compute/kernels/cast.rs |  20 +-
 rust/parquet/src/arrow/array_reader.rs | 102 +---
 rust/parquet/src/arrow/arrow_writer.rs | 413 -
 rust/parquet/src/arrow/converter.rs|  25 +-
 4 files changed, 523 insertions(+), 37 deletions(-)

diff --git a/rust/arrow/src/compute/kernels/cast.rs 
b/rust/arrow/src/compute/kernels/cast.rs
index 08c6a2b..30180ca 100644
--- a/rust/arrow/src/compute/kernels/cast.rs
+++ b/rust/arrow/src/compute/kernels/cast.rs
@@ -356,11 +356,27 @@ pub fn cast(array: , to_type: ) -> 
Result {
 
 // temporal casts
 (Int32, Date32(_)) => cast_array_data::(array, 
to_type.clone()),
-(Int32, Time32(_)) => cast_array_data::(array, 
to_type.clone()),
+(Int32, Time32(unit)) => match unit {
+TimeUnit::Second => {
+cast_array_data::(array, to_type.clone())
+}
+TimeUnit::Millisecond => {
+cast_array_data::(array, 
to_type.clone())
+}
+_ => unreachable!(),
+},
 (Date32(_), Int32) => cast_array_data::(array, 
to_type.clone()),
 (Time32(_), Int32) => cast_array_data::(array, 
to_type.clone()),
 (Int64, Date64(_)) => cast_array_data::(array, 
to_type.clone()),
-(Int64, Time64(_)) => cast_array_data::(array, 
to_type.clone()),
+(Int64, Time64(unit)) => match unit {
+TimeUnit::Microsecond => {
+cast_array_data::(array, 
to_type.clone())
+}
+TimeUnit::Nanosecond => {
+cast_array_data::(array, to_type.clone())
+}
+_ => unreachable!(),
+},
 (Date64(_), Int64) => cast_array_data::(array, 
to_type.clone()),
 (Time64(_), Int64) => cast_array_data::(array, 
to_type.clone()),
 (Date32(DateUnit::Day), Date64(DateUnit::Millisecond)) => {
diff --git a/rust/parquet/src/arrow/array_reader.rs 
b/rust/parquet/src/arrow/array_reader.rs
index 14bf7d2..4fbc54d 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -35,9 +35,10 @@ use crate::arrow::converter::{
 BinaryArrayConverter, BinaryConverter, BoolConverter, 
BooleanArrayConverter,
 Converter, Date32Converter, FixedLenBinaryConverter, 
FixedSizeArrayConverter,
 Float32Converter, Float64Converter, Int16Converter, Int32Converter, 
Int64Converter,
-Int8Converter, Int96ArrayConverter, Int96Converter, 
TimestampMicrosecondConverter,
-TimestampMillisecondConverter, UInt16Converter, UInt32Converter, 
UInt64Converter,
-UInt8Converter, Utf8ArrayConverter, Utf8Converter,
+Int8Converter, Int96ArrayConverter, Int96Converter, 
Time32MillisecondConverter,
+Time32SecondConverter, Time64MicrosecondConverter, 
Time64NanosecondConverter,
+TimestampMicrosecondConverter, TimestampMillisecondConverter, 
UInt16Converter,
+UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, 
Utf8Converter,
 };
 use crate::arrow::record_reader::RecordReader;
 use crate::arrow::schema::parquet_to_arrow_field;
@@ -196,11 +197,27 @@ impl ArrayReader for PrimitiveArrayReader 
{
 .convert(self.record_reader.cast::()),
 _ => Err(general_err!("No conversion from parquet type to 
arrow type for date with unit {:?}", unit)),
 }
-(ArrowType::Time32(_), PhysicalType::INT32) => {
-

[arrow] 06/07: ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit e1b613b1ec9239bf58d5882081aeeb75fa06c3d3
Author: Carol (Nichols || Goulding) 
AuthorDate: Thu Oct 8 00:16:42 2020 +0200

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from 
Parquet metadata when available

@nevi-me This is one commit on top of 
https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback 
from you on about whether this will help with ARROW-10168. I *think* this will 
bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and 
`LargeUtf8` tests, but they're still failing because the schemas don't match 
yet-- it looks like [this 
code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638)
 will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code 
I've changed here, is there a possibility for the code to be shared or is there 
a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) 
Co-authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/arrow/src/ipc/convert.rs   |   4 +-
 rust/parquet/src/arrow/array_reader.rs  | 106 +
 rust/parquet/src/arrow/arrow_reader.rs  |  36 --
 rust/parquet/src/arrow/arrow_writer.rs  |   4 +-
 rust/parquet/src/arrow/converter.rs |  52 +++-
 rust/parquet/src/arrow/mod.rs   |   3 +-
 rust/parquet/src/arrow/record_reader.rs |   1 +
 rust/parquet/src/arrow/schema.rs| 205 +++-
 8 files changed, 338 insertions(+), 73 deletions(-)

diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs
index 8f429bf..a02b6c4 100644
--- a/rust/arrow/src/ipc/convert.rs
+++ b/rust/arrow/src/ipc/convert.rs
@@ -334,7 +334,9 @@ pub(crate) fn build_field<'a: 'b, 'b>(
 
 let mut field_builder = ipc::FieldBuilder::new(fbb);
 field_builder.add_name(fb_field_name);
-fb_dictionary.map(|dictionary| field_builder.add_dictionary(dictionary));
+if let Some(dictionary) = fb_dictionary {
+field_builder.add_dictionary(dictionary)
+}
 field_builder.add_type_type(field_type.type_type);
 field_builder.add_nullable(field.is_nullable());
 match field_type.children {
diff --git a/rust/parquet/src/arrow/array_reader.rs 
b/rust/parquet/src/arrow/array_reader.rs
index 4fbc54d..40df284 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -29,16 +29,20 @@ use arrow::array::{
 Int16BufferBuilder, StructArray,
 };
 use arrow::buffer::{Buffer, MutableBuffer};
-use arrow::datatypes::{DataType as ArrowType, DateUnit, Field, IntervalUnit, 
TimeUnit};
+use arrow::datatypes::{
+DataType as ArrowType, DateUnit, Field, IntervalUnit, Schema, TimeUnit,
+};
 
 use crate::arrow::converter::{
 BinaryArrayConverter, BinaryConverter, BoolConverter, 
BooleanArrayConverter,
 Converter, Date32Converter, FixedLenBinaryConverter, 
FixedSizeArrayConverter,
 Float32Converter, Float64Converter, Int16Converter, Int32Converter, 
Int64Converter,
-Int8Converter, Int96ArrayConverter, Int96Converter, 
Time32MillisecondConverter,
-Time32SecondConverter, Time64MicrosecondConverter, 
Time64NanosecondConverter,
-TimestampMicrosecondConverter, TimestampMillisecondConverter, 
UInt16Converter,
-UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, 
Utf8Converter,
+Int8Converter, Int96ArrayConverter, Int96Converter, 
LargeBinaryArrayConverter,
+LargeBinaryConverter, LargeUtf8ArrayConverter, LargeUtf8Converter,
+Time32MillisecondConverter, Time32SecondConverter, 
Time64MicrosecondConverter,
+Time64NanosecondConverter, TimestampMicrosecondConverter,
+TimestampMillisecondConverter, UInt16Converter, UInt32Converter, 
UInt64Converter,
+UInt8Converter, Utf8ArrayConverter, Utf8Converter,
 };
 use crate::arrow::record_reader::RecordReader;
 use crate::arrow::schema::parquet_to_arrow_field;
@@ -612,6 +616,7 @@ impl ArrayReader for StructArrayReader {
 /// Create array reader from parquet schema, column indices, and parquet file 
reader.
 pub fn build_array_reader(
 parquet_schema: SchemaDescPtr,
+arrow_schema: Schema,
 column_indices: T,
 file_reader: Rc,
 ) -> Result>
@@ -650,13 +655,19 @@ where
 fields: filtered_root_fields,
 };
 
-ArrayReaderBuilder::new(Rc::new(proj), Rc::new(leaves), file_reader)
-.build_array_reader()
+ArrayReaderBuilder::new(
+Rc::new(proj),
+Rc::new(arrow_schema),
+Rc::new(leaves),
+file_reader,
+)
+.build_array_reader()
 }
 
 

[arrow] 03/07: ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ebe81729da4e983ec0f580ea0582059c0237d41c
Author: Carol (Nichols || Goulding) 
AuthorDate: Fri Sep 25 17:54:11 2020 +0200

ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's 
encode_arrow_schema with ipc changes

Note that this PR is deliberately filed against the 
rust-parquet-arrow-writer branch, not master!!

Hi!  I'm looking to help out with the rust-parquet-arrow-writer branch, 
and I just pulled it down and it wasn't compiling because in 
75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take 
`IpcWriteOptions` and to return `EncodedData`. This updates 
`encode_arrow_schema` to use those changes, which should get this branch 
compiling and passing tests again.

I'm kind of guessing which JIRA ticket this should be associated with; 
honestly I think this commit can just be squashed with 
https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 
next time this branch gets rebased.

Please let me know if I should change anything, I'm happy to!

Closes #8274 from carols10cents/update-with-ipc-changes

Authored-by: Carol (Nichols || Goulding) 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 2 +-
 rust/parquet/src/arrow/schema.rs   | 8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 1ca8d50..e0ad207 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -22,7 +22,7 @@ use std::rc::Rc;
 use arrow::array as arrow_array;
 use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
 use arrow::record_batch::RecordBatch;
-use arrow_array::Array;
+use arrow_array::{Array, PrimitiveArrayOps};
 
 use super::schema::add_encoded_arrow_schema_to_metadata;
 use crate::column::writer::ColumnWriter;
diff --git a/rust/parquet/src/arrow/schema.rs b/rust/parquet/src/arrow/schema.rs
index d4cfe1f..d5a0ff9 100644
--- a/rust/parquet/src/arrow/schema.rs
+++ b/rust/parquet/src/arrow/schema.rs
@@ -27,6 +27,7 @@ use std::collections::{HashMap, HashSet};
 use std::rc::Rc;
 
 use arrow::datatypes::{DataType, DateUnit, Field, Schema, TimeUnit};
+use arrow::ipc::writer;
 
 use crate::basic::{LogicalType, Repetition, Type as PhysicalType};
 use crate::errors::{ParquetError::ArrowError, Result};
@@ -120,15 +121,16 @@ fn get_arrow_schema_from_metadata(encoded_meta: ) -> 
Option {
 
 /// Encodes the Arrow schema into the IPC format, and base64 encodes it
 fn encode_arrow_schema(schema: ) -> String {
-let mut serialized_schema = arrow::ipc::writer::schema_to_bytes();
+let options = writer::IpcWriteOptions::default();
+let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(, 
);
 
 // manually prepending the length to the schema as arrow uses the legacy 
IPC format
 // TODO: change after addressing ARROW-9777
-let schema_len = serialized_schema.len();
+let schema_len = serialized_schema.ipc_message.len();
 let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
 len_prefix_schema.append( vec![255u8, 255, 255, 255]);
 len_prefix_schema.append((schema_len as 
u32).to_le_bytes().to_vec().as_mut());
-len_prefix_schema.append( serialized_schema);
+len_prefix_schema.append( serialized_schema.ipc_message);
 
 base64::encode(_prefix_schema)
 }



[arrow] 01/07: ARROW-8289: [Rust] Parquet Arrow writer with nested support

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 4e6a836b42b064a50582bcc9d6cfca2b7e77a46a
Author: Neville Dipale 
AuthorDate: Thu Aug 13 18:47:34 2020 +0200

ARROW-8289: [Rust] Parquet Arrow writer with nested support

**Note**: I started making changes to #6785, and ended up deviating a lot, 
so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the 
following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on 
pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct> not null
  child 0, d: double
  child 1, e: struct
  child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above 
incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle 
from an old Parquet blog post from the Twitter engineering blog. It's likely 
that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from 
arrays
* Adding tests - I suspect we might need a lot of tests, so far we only 
test writing 1 batch, so I don't know how paging would work when writing a 
large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) 
and compression levels are applied automagically, or if that'd be something we 
need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale 
Co-authored-by: Max Burke 
Co-authored-by: Andy Grove 
Co-authored-by: Max Burke 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 682 +
 rust/parquet/src/arrow/mod.rs  |   5 +-
 rust/parquet/src/schema/types.rs   |   6 +-
 3 files changed, 691 insertions(+), 2 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
new file mode 100644
index 000..0c1c490
--- /dev/null
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -0,0 +1,682 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains writer which writes arrow data into parquet data.
+
+use std::rc::Rc;
+
+use arrow::array as arrow_array;
+use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
+use arrow::record_batch::RecordBatch;
+use arrow_array::Array;
+
+use crate::column::writer::ColumnWriter;
+use crate::errors::{ParquetError, Result};
+use crate::file::properties::WriterProperties;
+use crate::{
+data_type::*,
+file::writer::{FileWriter, ParquetWriter, RowGroupWriter, 
SerializedFileWriter},
+};
+
+/// Arrow writer
+///
+/// Writes Arrow `RecordBatch`es to a Parquet writer
+pub struct ArrowWriter {
+/// Underlying Parquet writer
+writer: SerializedFileWriter,
+/// A copy of the Arrow schema.
+///
+/// The schema is used to verify that each record batch written has the 
correct schema
+arrow_schema: SchemaRef,
+}
+
+impl ArrowWriter {
+/// Try to create a new Arrow writer
+///
+/// The writer will 

[arrow] 03/07: ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ebe81729da4e983ec0f580ea0582059c0237d41c
Author: Carol (Nichols || Goulding) 
AuthorDate: Fri Sep 25 17:54:11 2020 +0200

ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's 
encode_arrow_schema with ipc changes

Note that this PR is deliberately filed against the 
rust-parquet-arrow-writer branch, not master!!

Hi!  I'm looking to help out with the rust-parquet-arrow-writer branch, 
and I just pulled it down and it wasn't compiling because in 
75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take 
`IpcWriteOptions` and to return `EncodedData`. This updates 
`encode_arrow_schema` to use those changes, which should get this branch 
compiling and passing tests again.

I'm kind of guessing which JIRA ticket this should be associated with; 
honestly I think this commit can just be squashed with 
https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 
next time this branch gets rebased.

Please let me know if I should change anything, I'm happy to!

Closes #8274 from carols10cents/update-with-ipc-changes

Authored-by: Carol (Nichols || Goulding) 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 2 +-
 rust/parquet/src/arrow/schema.rs   | 8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 1ca8d50..e0ad207 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -22,7 +22,7 @@ use std::rc::Rc;
 use arrow::array as arrow_array;
 use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
 use arrow::record_batch::RecordBatch;
-use arrow_array::Array;
+use arrow_array::{Array, PrimitiveArrayOps};
 
 use super::schema::add_encoded_arrow_schema_to_metadata;
 use crate::column::writer::ColumnWriter;
diff --git a/rust/parquet/src/arrow/schema.rs b/rust/parquet/src/arrow/schema.rs
index d4cfe1f..d5a0ff9 100644
--- a/rust/parquet/src/arrow/schema.rs
+++ b/rust/parquet/src/arrow/schema.rs
@@ -27,6 +27,7 @@ use std::collections::{HashMap, HashSet};
 use std::rc::Rc;
 
 use arrow::datatypes::{DataType, DateUnit, Field, Schema, TimeUnit};
+use arrow::ipc::writer;
 
 use crate::basic::{LogicalType, Repetition, Type as PhysicalType};
 use crate::errors::{ParquetError::ArrowError, Result};
@@ -120,15 +121,16 @@ fn get_arrow_schema_from_metadata(encoded_meta: ) -> 
Option {
 
 /// Encodes the Arrow schema into the IPC format, and base64 encodes it
 fn encode_arrow_schema(schema: ) -> String {
-let mut serialized_schema = arrow::ipc::writer::schema_to_bytes();
+let options = writer::IpcWriteOptions::default();
+let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(, 
);
 
 // manually prepending the length to the schema as arrow uses the legacy 
IPC format
 // TODO: change after addressing ARROW-9777
-let schema_len = serialized_schema.len();
+let schema_len = serialized_schema.ipc_message.len();
 let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
 len_prefix_schema.append( vec![255u8, 255, 255, 255]);
 len_prefix_schema.append((schema_len as 
u32).to_le_bytes().to_vec().as_mut());
-len_prefix_schema.append( serialized_schema);
+len_prefix_schema.append( serialized_schema.ipc_message);
 
 base64::encode(_prefix_schema)
 }



[arrow] 07/07: ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit f70e6db575cce746d2a4cd1c9e5a99629c27926c
Author: Neville Dipale 
AuthorDate: Thu Oct 8 17:08:59 2020 +0200

ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip

Closes #8388 from nevi-me/ARROW-10225

Authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 40e2553..a17e424 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -724,7 +724,11 @@ mod tests {
 assert_eq!(expected_data.offset(), actual_data.offset());
 assert_eq!(expected_data.buffers(), actual_data.buffers());
 assert_eq!(expected_data.child_data(), actual_data.child_data());
-assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap());
+// Null counts should be the same, not necessarily bitmaps
+// A null bitmap is optional if an array has no nulls
+if expected_data.null_count() != 0 {
+assert_eq!(expected_data.null_bitmap(), 
actual_data.null_bitmap());
+}
 }
 }
 
@@ -1001,7 +1005,7 @@ mod tests {
 }
 
 #[test]
-#[ignore] // Binary support isn't correct yet - null_bitmap doesn't match
+#[ignore] // Binary support isn't correct yet - buffers don't match
 fn binary_single_column() {
 let one_vec: Vec = (0..SMALL_SIZE as u8).collect();
 let many_vecs: Vec<_> = 
std::iter::repeat(one_vec).take(SMALL_SIZE).collect();
@@ -1026,7 +1030,6 @@ mod tests {
 }
 
 #[test]
-#[ignore] // String support isn't correct yet - null_bitmap doesn't match
 fn string_single_column() {
 let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| 
i.to_string()).collect();
 let raw_strs = raw_values.iter().map(|s| s.as_str());
@@ -1035,7 +1038,6 @@ mod tests {
 }
 
 #[test]
-#[ignore] // Large string support isn't correct yet - null_bitmap doesn't 
match
 fn large_string_single_column() {
 let raw_values: Vec<_> = (0..SMALL_SIZE).map(|i| 
i.to_string()).collect();
 let raw_strs = raw_values.iter().map(|s| s.as_str());



[arrow] 05/07: ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit de95e847bc82cbd28b3963edc843baaa10bb99ab
Author: Carol (Nichols || Goulding) 
AuthorDate: Tue Oct 6 08:44:26 2020 -0600

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all 
supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch 
with a single column of values of each the supported data types and some of the 
unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round 
trip directly; I'm not sure that this is appropriate or not because for some 
data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed 
to or not.

So I would love advice on that front, and I would love to know if these 
tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) 
Co-authored-by: Neville Dipale 
Signed-off-by: Andy Grove 
---
 rust/arrow/src/compute/kernels/cast.rs |  20 +-
 rust/parquet/src/arrow/array_reader.rs | 102 +---
 rust/parquet/src/arrow/arrow_writer.rs | 413 -
 rust/parquet/src/arrow/converter.rs|  25 +-
 4 files changed, 523 insertions(+), 37 deletions(-)

diff --git a/rust/arrow/src/compute/kernels/cast.rs 
b/rust/arrow/src/compute/kernels/cast.rs
index 08c6a2b..30180ca 100644
--- a/rust/arrow/src/compute/kernels/cast.rs
+++ b/rust/arrow/src/compute/kernels/cast.rs
@@ -356,11 +356,27 @@ pub fn cast(array: , to_type: ) -> 
Result {
 
 // temporal casts
 (Int32, Date32(_)) => cast_array_data::(array, 
to_type.clone()),
-(Int32, Time32(_)) => cast_array_data::(array, 
to_type.clone()),
+(Int32, Time32(unit)) => match unit {
+TimeUnit::Second => {
+cast_array_data::(array, to_type.clone())
+}
+TimeUnit::Millisecond => {
+cast_array_data::(array, 
to_type.clone())
+}
+_ => unreachable!(),
+},
 (Date32(_), Int32) => cast_array_data::(array, 
to_type.clone()),
 (Time32(_), Int32) => cast_array_data::(array, 
to_type.clone()),
 (Int64, Date64(_)) => cast_array_data::(array, 
to_type.clone()),
-(Int64, Time64(_)) => cast_array_data::(array, 
to_type.clone()),
+(Int64, Time64(unit)) => match unit {
+TimeUnit::Microsecond => {
+cast_array_data::(array, 
to_type.clone())
+}
+TimeUnit::Nanosecond => {
+cast_array_data::(array, to_type.clone())
+}
+_ => unreachable!(),
+},
 (Date64(_), Int64) => cast_array_data::(array, 
to_type.clone()),
 (Time64(_), Int64) => cast_array_data::(array, 
to_type.clone()),
 (Date32(DateUnit::Day), Date64(DateUnit::Millisecond)) => {
diff --git a/rust/parquet/src/arrow/array_reader.rs 
b/rust/parquet/src/arrow/array_reader.rs
index 14bf7d2..4fbc54d 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -35,9 +35,10 @@ use crate::arrow::converter::{
 BinaryArrayConverter, BinaryConverter, BoolConverter, 
BooleanArrayConverter,
 Converter, Date32Converter, FixedLenBinaryConverter, 
FixedSizeArrayConverter,
 Float32Converter, Float64Converter, Int16Converter, Int32Converter, 
Int64Converter,
-Int8Converter, Int96ArrayConverter, Int96Converter, 
TimestampMicrosecondConverter,
-TimestampMillisecondConverter, UInt16Converter, UInt32Converter, 
UInt64Converter,
-UInt8Converter, Utf8ArrayConverter, Utf8Converter,
+Int8Converter, Int96ArrayConverter, Int96Converter, 
Time32MillisecondConverter,
+Time32SecondConverter, Time64MicrosecondConverter, 
Time64NanosecondConverter,
+TimestampMicrosecondConverter, TimestampMillisecondConverter, 
UInt16Converter,
+UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, 
Utf8Converter,
 };
 use crate::arrow::record_reader::RecordReader;
 use crate::arrow::schema::parquet_to_arrow_field;
@@ -196,11 +197,27 @@ impl ArrayReader for PrimitiveArrayReader 
{
 .convert(self.record_reader.cast::()),
 _ => Err(general_err!("No conversion from parquet type to 
arrow type for date with unit {:?}", unit)),
 }
-(ArrowType::Time32(_), PhysicalType::INT32) => {
-

[arrow] 06/07: ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit e1b613b1ec9239bf58d5882081aeeb75fa06c3d3
Author: Carol (Nichols || Goulding) 
AuthorDate: Thu Oct 8 00:16:42 2020 +0200

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from 
Parquet metadata when available

@nevi-me This is one commit on top of 
https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback 
from you on about whether this will help with ARROW-10168. I *think* this will 
bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and 
`LargeUtf8` tests, but they're still failing because the schemas don't match 
yet-- it looks like [this 
code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638)
 will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code 
I've changed here, is there a possibility for the code to be shared or is there 
a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) 
Co-authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/arrow/src/ipc/convert.rs   |   4 +-
 rust/parquet/src/arrow/array_reader.rs  | 106 +
 rust/parquet/src/arrow/arrow_reader.rs  |  36 --
 rust/parquet/src/arrow/arrow_writer.rs  |   4 +-
 rust/parquet/src/arrow/converter.rs |  52 +++-
 rust/parquet/src/arrow/mod.rs   |   3 +-
 rust/parquet/src/arrow/record_reader.rs |   1 +
 rust/parquet/src/arrow/schema.rs| 205 +++-
 8 files changed, 338 insertions(+), 73 deletions(-)

diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs
index 8f429bf..a02b6c4 100644
--- a/rust/arrow/src/ipc/convert.rs
+++ b/rust/arrow/src/ipc/convert.rs
@@ -334,7 +334,9 @@ pub(crate) fn build_field<'a: 'b, 'b>(
 
 let mut field_builder = ipc::FieldBuilder::new(fbb);
 field_builder.add_name(fb_field_name);
-fb_dictionary.map(|dictionary| field_builder.add_dictionary(dictionary));
+if let Some(dictionary) = fb_dictionary {
+field_builder.add_dictionary(dictionary)
+}
 field_builder.add_type_type(field_type.type_type);
 field_builder.add_nullable(field.is_nullable());
 match field_type.children {
diff --git a/rust/parquet/src/arrow/array_reader.rs 
b/rust/parquet/src/arrow/array_reader.rs
index 4fbc54d..40df284 100644
--- a/rust/parquet/src/arrow/array_reader.rs
+++ b/rust/parquet/src/arrow/array_reader.rs
@@ -29,16 +29,20 @@ use arrow::array::{
 Int16BufferBuilder, StructArray,
 };
 use arrow::buffer::{Buffer, MutableBuffer};
-use arrow::datatypes::{DataType as ArrowType, DateUnit, Field, IntervalUnit, 
TimeUnit};
+use arrow::datatypes::{
+DataType as ArrowType, DateUnit, Field, IntervalUnit, Schema, TimeUnit,
+};
 
 use crate::arrow::converter::{
 BinaryArrayConverter, BinaryConverter, BoolConverter, 
BooleanArrayConverter,
 Converter, Date32Converter, FixedLenBinaryConverter, 
FixedSizeArrayConverter,
 Float32Converter, Float64Converter, Int16Converter, Int32Converter, 
Int64Converter,
-Int8Converter, Int96ArrayConverter, Int96Converter, 
Time32MillisecondConverter,
-Time32SecondConverter, Time64MicrosecondConverter, 
Time64NanosecondConverter,
-TimestampMicrosecondConverter, TimestampMillisecondConverter, 
UInt16Converter,
-UInt32Converter, UInt64Converter, UInt8Converter, Utf8ArrayConverter, 
Utf8Converter,
+Int8Converter, Int96ArrayConverter, Int96Converter, 
LargeBinaryArrayConverter,
+LargeBinaryConverter, LargeUtf8ArrayConverter, LargeUtf8Converter,
+Time32MillisecondConverter, Time32SecondConverter, 
Time64MicrosecondConverter,
+Time64NanosecondConverter, TimestampMicrosecondConverter,
+TimestampMillisecondConverter, UInt16Converter, UInt32Converter, 
UInt64Converter,
+UInt8Converter, Utf8ArrayConverter, Utf8Converter,
 };
 use crate::arrow::record_reader::RecordReader;
 use crate::arrow::schema::parquet_to_arrow_field;
@@ -612,6 +616,7 @@ impl ArrayReader for StructArrayReader {
 /// Create array reader from parquet schema, column indices, and parquet file 
reader.
 pub fn build_array_reader(
 parquet_schema: SchemaDescPtr,
+arrow_schema: Schema,
 column_indices: T,
 file_reader: Rc,
 ) -> Result>
@@ -650,13 +655,19 @@ where
 fields: filtered_root_fields,
 };
 
-ArrayReaderBuilder::new(Rc::new(proj), Rc::new(leaves), file_reader)
-.build_array_reader()
+ArrayReaderBuilder::new(
+Rc::new(proj),
+Rc::new(arrow_schema),
+Rc::new(leaves),
+file_reader,
+)
+.build_array_reader()
 }
 
 

[arrow] 02/07: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 12b11b29293b6ace9cb99cffa93fd1b74b4849be
Author: Neville Dipale 
AuthorDate: Tue Aug 18 18:39:37 2020 +0200

ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

This will allow preserving Arrow-specific metadata when writing or reading 
Parquet files created from C++ or Rust.
If the schema can't be deserialised, the normal Parquet > Arrow schema 
conversion is performed.

Closes #7917 from nevi-me/ARROW-8243

Authored-by: Neville Dipale 
Signed-off-by: Neville Dipale 
---
 rust/parquet/Cargo.toml|   3 +-
 rust/parquet/src/arrow/arrow_writer.rs |  27 ++-
 rust/parquet/src/arrow/mod.rs  |   4 +
 rust/parquet/src/arrow/schema.rs   | 306 -
 rust/parquet/src/file/properties.rs|   6 +-
 5 files changed, 290 insertions(+), 56 deletions(-)

diff --git a/rust/parquet/Cargo.toml b/rust/parquet/Cargo.toml
index 50d7c34..60e43c9 100644
--- a/rust/parquet/Cargo.toml
+++ b/rust/parquet/Cargo.toml
@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
 chrono = "0.4"
 num-bigint = "0.3"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true }
+base64 = { version = "*", optional = true }
 
 [dev-dependencies]
 rand = "0.7"
@@ -52,4 +53,4 @@ arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 serde_json = { version = "1.0", features = ["preserve_order"] }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]
diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
index 0c1c490..1ca8d50 100644
--- a/rust/parquet/src/arrow/arrow_writer.rs
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -24,6 +24,7 @@ use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
 use arrow::record_batch::RecordBatch;
 use arrow_array::Array;
 
+use super::schema::add_encoded_arrow_schema_to_metadata;
 use crate::column::writer::ColumnWriter;
 use crate::errors::{ParquetError, Result};
 use crate::file::properties::WriterProperties;
@@ -53,17 +54,17 @@ impl ArrowWriter {
 pub fn try_new(
 writer: W,
 arrow_schema: SchemaRef,
-props: Option>,
+props: Option,
 ) -> Result {
 let schema = crate::arrow::arrow_to_parquet_schema(_schema)?;
-let props = match props {
-Some(props) => props,
-None => Rc::new(WriterProperties::builder().build()),
-};
+// add serialized arrow schema
+let mut props = props.unwrap_or_else(|| 
WriterProperties::builder().build());
+add_encoded_arrow_schema_to_metadata(_schema,  props);
+
 let file_writer = SerializedFileWriter::new(
 writer.try_clone()?,
 schema.root_schema_ptr(),
-props,
+Rc::new(props),
 )?;
 
 Ok(Self {
@@ -495,7 +496,7 @@ mod tests {
 use arrow::record_batch::{RecordBatch, RecordBatchReader};
 
 use crate::arrow::{ArrowReader, ParquetFileArrowReader};
-use crate::file::reader::SerializedFileReader;
+use crate::file::{metadata::KeyValue, reader::SerializedFileReader};
 use crate::util::test_common::get_temp_file;
 
 #[test]
@@ -584,7 +585,7 @@ mod tests {
 )
 .unwrap();
 
-let mut file = get_temp_file("test_arrow_writer.parquet", &[]);
+let mut file = get_temp_file("test_arrow_writer_binary.parquet", &[]);
 let mut writer =
 ArrowWriter::try_new(file.try_clone().unwrap(), Arc::new(schema), 
None)
 .unwrap();
@@ -674,8 +675,16 @@ mod tests {
 )
 .unwrap();
 
+let props = WriterProperties::builder()
+.set_key_value_metadata(Some(vec![KeyValue {
+key: "test_key".to_string(),
+value: Some("test_value".to_string()),
+}]))
+.build();
+
 let file = get_temp_file("test_arrow_writer_complex.parquet", &[]);
-let mut writer = ArrowWriter::try_new(file, Arc::new(schema), 
None).unwrap();
+let mut writer =
+ArrowWriter::try_new(file, Arc::new(schema), Some(props)).unwrap();
 writer.write().unwrap();
 writer.close().unwrap();
 }
diff --git a/rust/parquet/src/arrow/mod.rs b/rust/parquet/src/arrow/mod.rs
index 8499481..2b012fb 100644
--- a/rust/parquet/src/arrow/mod.rs
+++ b/rust/parquet/src/arrow/mod.rs
@@ -58,6 +58,10 @@ pub mod schema;
 
 pub use self::arrow_reader::ArrowReader;
 pub use self::arrow_reader::ParquetFileArrowReader;
+pub use self::arrow_writer::ArrowWriter;
 pub use self::schema::{
 arrow_to_parquet_schema, parquet_to_arrow_schema, 
parquet_to_arrow_schema_by_columns,
 };
+
+/// Schema metadata key used to store 

[arrow] 01/07: ARROW-8289: [Rust] Parquet Arrow writer with nested support

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 4e6a836b42b064a50582bcc9d6cfca2b7e77a46a
Author: Neville Dipale 
AuthorDate: Thu Aug 13 18:47:34 2020 +0200

ARROW-8289: [Rust] Parquet Arrow writer with nested support

**Note**: I started making changes to #6785, and ended up deviating a lot, 
so I opted for making a new draft PR in case my approach is not suitable.
___

This is a draft to implement an arrow writer for parquet. It supports the 
following (no complete test coverage yet):

* writing primitives except for booleans and binary
* nested structs
* null values (via definition levels)

It does not yet support:

- Boolean arrays (have to be handled differently from numeric values)
- Binary arrays
- Dictionary arrays
- Union arrays (are they even possible?)

I have only added a test by creating a nested schema, which I tested on 
pyarrow.

```jupyter
# schema of test_complex.parquet

a: int32 not null
b: int32
c: struct> not null
  child 0, d: double
  child 1, e: struct
  child 0, f: float
```

This PR potentially addresses:

* https://issues.apache.org/jira/browse/ARROW-8289
* https://issues.apache.org/jira/browse/ARROW-8423
* https://issues.apache.org/jira/browse/ARROW-8424
* https://issues.apache.org/jira/browse/ARROW-8425

And I would like to propose either opening new JIRAs for the above 
incomplete items, or renaming the last 3 above.

___

**Help Needed**

I'm implementing the definition and repetition levels on first principle 
from an old Parquet blog post from the Twitter engineering blog. It's likely 
that I'm not getting some concepts correct, so I would appreciate help with:

* Checking if my logic is correct
* Guidance or suggestions on how to more efficiently extract levels from 
arrays
* Adding tests - I suspect we might need a lot of tests, so far we only 
test writing 1 batch, so I don't know how paging would work when writing a 
large enough file

I also don't know if the various encoding levels (dictionary, RLE, etc.) 
and compression levels are applied automagically, or if that'd be something we 
need to explicitly enable.

CC @sunchao @sadikovi @andygrove @paddyhoran

Might be of interest to @mcassels @maxburke

Closes #7319 from nevi-me/arrow-parquet-writer

Lead-authored-by: Neville Dipale 
Co-authored-by: Max Burke 
Co-authored-by: Andy Grove 
Co-authored-by: Max Burke 
Signed-off-by: Neville Dipale 
---
 rust/parquet/src/arrow/arrow_writer.rs | 682 +
 rust/parquet/src/arrow/mod.rs  |   5 +-
 rust/parquet/src/schema/types.rs   |   6 +-
 3 files changed, 691 insertions(+), 2 deletions(-)

diff --git a/rust/parquet/src/arrow/arrow_writer.rs 
b/rust/parquet/src/arrow/arrow_writer.rs
new file mode 100644
index 000..0c1c490
--- /dev/null
+++ b/rust/parquet/src/arrow/arrow_writer.rs
@@ -0,0 +1,682 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains writer which writes arrow data into parquet data.
+
+use std::rc::Rc;
+
+use arrow::array as arrow_array;
+use arrow::datatypes::{DataType as ArrowDataType, SchemaRef};
+use arrow::record_batch::RecordBatch;
+use arrow_array::Array;
+
+use crate::column::writer::ColumnWriter;
+use crate::errors::{ParquetError, Result};
+use crate::file::properties::WriterProperties;
+use crate::{
+data_type::*,
+file::writer::{FileWriter, ParquetWriter, RowGroupWriter, 
SerializedFileWriter},
+};
+
+/// Arrow writer
+///
+/// Writes Arrow `RecordBatch`es to a Parquet writer
+pub struct ArrowWriter {
+/// Underlying Parquet writer
+writer: SerializedFileWriter,
+/// A copy of the Arrow schema.
+///
+/// The schema is used to verify that each record batch written has the 
correct schema
+arrow_schema: SchemaRef,
+}
+
+impl ArrowWriter {
+/// Try to create a new Arrow writer
+///
+/// The writer will 

[arrow] 04/07: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a commit to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 58855a888438fe2063db8499659d83fe1bb91b66
Author: Carol (Nichols || Goulding) 
AuthorDate: Sat Oct 3 02:34:38 2020 +0200

ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: 
https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) 
Signed-off-by: Neville Dipale 
---
 rust/arrow/src/datatypes.rs  |   4 +-
 rust/arrow/src/ipc/convert.rs| 105 ++-
 rust/parquet/src/arrow/schema.rs |  20 
 3 files changed, 93 insertions(+), 36 deletions(-)

diff --git a/rust/arrow/src/datatypes.rs b/rust/arrow/src/datatypes.rs
index 0d05f82..c647af6 100644
--- a/rust/arrow/src/datatypes.rs
+++ b/rust/arrow/src/datatypes.rs
@@ -189,8 +189,8 @@ pub struct Field {
 name: String,
 data_type: DataType,
 nullable: bool,
-dict_id: i64,
-dict_is_ordered: bool,
+pub(crate) dict_id: i64,
+pub(crate) dict_is_ordered: bool,
 }
 
 pub trait ArrowNativeType:
diff --git a/rust/arrow/src/ipc/convert.rs b/rust/arrow/src/ipc/convert.rs
index 7a5795d..8f429bf 100644
--- a/rust/arrow/src/ipc/convert.rs
+++ b/rust/arrow/src/ipc/convert.rs
@@ -34,18 +34,8 @@ pub fn schema_to_fb(schema: ) -> FlatBufferBuilder {
 
 let mut fields = vec![];
 for field in schema.fields() {
-let fb_field_name = fbb.create_string(field.name().as_str());
-let field_type = get_fb_field_type(field.data_type(),  fbb);
-let mut field_builder = ipc::FieldBuilder::new( fbb);
-field_builder.add_name(fb_field_name);
-field_builder.add_type_type(field_type.type_type);
-field_builder.add_nullable(field.is_nullable());
-match field_type.children {
-None => {}
-Some(children) => field_builder.add_children(children),
-};
-field_builder.add_type_(field_type.type_);
-fields.push(field_builder.finish());
+let fb_field = build_field( fbb, field);
+fields.push(fb_field);
 }
 
 let mut custom_metadata = vec![];
@@ -80,18 +70,8 @@ pub fn schema_to_fb_offset<'a: 'b, 'b>(
 ) -> WIPOffset> {
 let mut fields = vec![];
 for field in schema.fields() {
-let fb_field_name = fbb.create_string(field.name().as_str());
-let field_type = get_fb_field_type(field.data_type(), fbb);
-let mut field_builder = ipc::FieldBuilder::new(fbb);
-field_builder.add_name(fb_field_name);
-field_builder.add_type_type(field_type.type_type);
-field_builder.add_nullable(field.is_nullable());
-match field_type.children {
-None => {}
-Some(children) => field_builder.add_children(children),
-};
-field_builder.add_type_(field_type.type_);
-fields.push(field_builder.finish());
+let fb_field = build_field(fbb, field);
+fields.push(fb_field);
 }
 
 let mut custom_metadata = vec![];
@@ -333,6 +313,38 @@ pub(crate) struct FBFieldType<'b> {
 pub(crate) children: Option,
 }
 
+/// Create an IPC Field from an Arrow Field
+pub(crate) fn build_field<'a: 'b, 'b>(
+fbb:  FlatBufferBuilder<'a>,
+field: ,
+) -> WIPOffset> {
+let fb_field_name = fbb.create_string(field.name().as_str());
+let field_type = get_fb_field_type(field.data_type(), fbb);
+
+let fb_dictionary = if let Dictionary(index_type, _) = field.data_type() {
+Some(get_fb_dictionary(
+index_type,
+field.dict_id,
+

[arrow] branch rust-parquet-arrow-writer updated (bd3c714 -> f70e6db)

2020-10-15 Thread nevime
This is an automated email from the ASF dual-hosted git repository.

nevime pushed a change to branch rust-parquet-arrow-writer
in repository https://gitbox.apache.org/repos/asf/arrow.git.


omit bd3c714  ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip
omit 12add42  ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow 
schema from Parquet metadata when available
omit 7bfff71  ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet 
tests for all supported Arrow DataTypes
omit d748efa  ARROW-8426: [Rust] [Parquet] Add support for writing 
dictionary types
omit 2ac525e  ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's 
encode_arrow_schema with ipc changes
omit 0a92daa  ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata
omit 84e3576  ARROW-8289: [Rust] Parquet Arrow writer with nested support
 add 70ae161  ARROW-10290: [C++] List POP_BACK is not available in older 
CMake versions
 add f07a415  ARROW-10263: [C++][Compute] Improve variance kernel numerical 
stability
 add c5280a5  ARROW-10293: [Rust] [DataFusion] Fixed benchmarks
 add 818593f  ARROW-10295 [Rust] [DataFusion] Replace Rc> by 
Box<> in accumulators.
 add becf329  ARROW-10289: [Rust] Read dictionaries in IPC streams
 add ea29f65  ARROW-10292: [Rust] [DataFusion] Simplify merge
 add 249adb4  ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel
 add ac14e91  ARROW-9479: [JS] Fix Table.from for zero-item serialized 
tables, Table.empty for schemas containing compound types (List, FixedSizeList, 
Map)
 add ed8b1bc  ARROW-10145: [C++][Dataset] Assert integer overflow in 
partitioning falls back to string
 add 35ace39  ARROW-10174: [Java] Fix reading/writing dict structs
 new 4e6a836  ARROW-8289: [Rust] Parquet Arrow writer with nested support
 new 12b11b2  ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata
 new ebe8172  ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's 
encode_arrow_schema with ipc changes
 new 58855a8  ARROW-8426: [Rust] [Parquet] Add support for writing 
dictionary types
 new de95e84  ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet 
tests for all supported Arrow DataTypes
 new e1b613b  ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow 
schema from Parquet metadata when available
 new f70e6db  ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (bd3c714)
\
 N -- N -- N   refs/heads/rust-parquet-arrow-writer (f70e6db)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 7 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .github/workflows/r.yml|  21 ++-
 cpp/src/arrow/compute/kernels/aggregate_test.cc|  44 --
 cpp/src/arrow/compute/kernels/aggregate_var_std.cc |  21 +--
 cpp/src/arrow/dataset/partition_test.cc|  12 ++
 cpp/src/arrow/flight/CMakeLists.txt|   4 +-
 .../arrow/vector/util/DictionaryUtility.java   |  20 ++-
 .../arrow/vector/ipc/TestArrowReaderWriter.java|  96 
 .../vector/testing/ValueVectorDataPopulator.java   |  32 
 js/src/data.ts |   6 +-
 r/tests/testthat/test-csv.R|   4 +-
 rust/arrow/src/ipc/reader.rs   | 174 +
 rust/datafusion/benches/aggregate_query_sql.rs |  14 +-
 rust/datafusion/benches/math_query_sql.rs  |  36 +++--
 rust/datafusion/benches/sort_limit_query_sql.rs|  17 +-
 rust/datafusion/examples/simple_udaf.rs|   4 +-
 rust/datafusion/src/execution/context.rs   |  10 +-
 rust/datafusion/src/physical_plan/aggregates.rs|   4 +-
 .../src/physical_plan/distinct_expressions.rs  |  17 +-
 rust/datafusion/src/physical_plan/expressions.rs   |  34 ++--
 .../datafusion/src/physical_plan/hash_aggregate.rs |  36 ++---
 rust/datafusion/src/physical_plan/limit.rs |   3 +-
 rust/datafusion/src/physical_plan/merge.rs |  54 ++-
 rust/datafusion/src/physical_plan/mod.rs   |   4 +-
 rust/datafusion/src/physical_plan/planner.rs   |   5 +-
 rust/datafusion/src/physical_plan/sort.rs  |   2 

[arrow] branch master updated (ed8b1bc -> 35ace39)

2020-10-15 Thread liyafan
This is an automated email from the ASF dual-hosted git repository.

liyafan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ed8b1bc  ARROW-10145: [C++][Dataset] Assert integer overflow in 
partitioning falls back to string
 add 35ace39  ARROW-10174: [Java] Fix reading/writing dict structs

No new revisions were added by this update.

Summary of changes:
 .../arrow/vector/util/DictionaryUtility.java   | 20 +++--
 .../arrow/vector/ipc/TestArrowReaderWriter.java| 96 ++
 .../vector/testing/ValueVectorDataPopulator.java   | 32 
 3 files changed, 141 insertions(+), 7 deletions(-)



[arrow] branch master updated (ed8b1bc -> 35ace39)

2020-10-15 Thread liyafan
This is an automated email from the ASF dual-hosted git repository.

liyafan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ed8b1bc  ARROW-10145: [C++][Dataset] Assert integer overflow in 
partitioning falls back to string
 add 35ace39  ARROW-10174: [Java] Fix reading/writing dict structs

No new revisions were added by this update.

Summary of changes:
 .../arrow/vector/util/DictionaryUtility.java   | 20 +++--
 .../arrow/vector/ipc/TestArrowReaderWriter.java| 96 ++
 .../vector/testing/ValueVectorDataPopulator.java   | 32 
 3 files changed, 141 insertions(+), 7 deletions(-)



[arrow] branch master updated (ac14e91 -> ed8b1bc)

2020-10-15 Thread jorisvandenbossche
This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ac14e91  ARROW-9479: [JS] Fix Table.from for zero-item serialized 
tables, Table.empty for schemas containing compound types (List, FixedSizeList, 
Map)
 add ed8b1bc  ARROW-10145: [C++][Dataset] Assert integer overflow in 
partitioning falls back to string

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/dataset/partition_test.cc | 12 
 1 file changed, 12 insertions(+)



[arrow] branch master updated (249adb4 -> ac14e91)

2020-10-15 Thread bhulette
This is an automated email from the ASF dual-hosted git repository.

bhulette pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 249adb4  ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel
 add ac14e91  ARROW-9479: [JS] Fix Table.from for zero-item serialized 
tables, Table.empty for schemas containing compound types (List, FixedSizeList, 
Map)

No new revisions were added by this update.

Summary of changes:
 js/src/data.ts | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)



[arrow] branch master updated (249adb4 -> ac14e91)

2020-10-15 Thread bhulette
This is an automated email from the ASF dual-hosted git repository.

bhulette pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 249adb4  ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel
 add ac14e91  ARROW-9479: [JS] Fix Table.from for zero-item serialized 
tables, Table.empty for schemas containing compound types (List, FixedSizeList, 
Map)

No new revisions were added by this update.

Summary of changes:
 js/src/data.ts | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)



[GitHub] [arrow-site] jorisvandenbossche commented on pull request #79: 2.0.0 release blog post

2020-10-15 Thread GitBox


jorisvandenbossche commented on pull request #79:
URL: https://github.com/apache/arrow-site/pull/79#issuecomment-709488684


   I did a first pass on adding a few python notes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-site] lidavidm commented on a change in pull request #79: 2.0.0 release blog post

2020-10-15 Thread GitBox


lidavidm commented on a change in pull request #79:
URL: https://github.com/apache/arrow-site/pull/79#discussion_r505707474



##
File path: _posts/2020-10-15-2.0.0-release.md
##
@@ -0,0 +1,92 @@
+---
+layout: post
+title: "Apache Arrow 2.0.0 Release"
+date: "2020-10-14 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+
+
+
+
+
+The Apache Arrow team is pleased to announce the 2.0.0 release. This covers
+over XX months of development work and includes [**XX resolved issues**][1]
+from [**XX distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in 
`FlightStream` and `DoPut` have been addressed. In C++ and Python, a deadlock 
has been fixed in an edge case.

Review comment:
   ```suggestion
   For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks 
in `FlightStream` and `DoPut` have been addressed. In C++ and Python, a 
deadlock has been fixed in an edge case. Additionally, when supported by gRPC, 
TLS verification can be disabled.
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-site] liyafan82 commented on a change in pull request #79: 2.0.0 release blog post

2020-10-15 Thread GitBox


liyafan82 commented on a change in pull request #79:
URL: https://github.com/apache/arrow-site/pull/79#discussion_r504527232



##
File path: _posts/2020-10-15-2.0.0-release.md
##
@@ -0,0 +1,84 @@
+---
+layout: post
+title: "Apache Arrow 2.0.0 Release"
+date: "2020-10-14 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+
+
+
+
+
+The Apache Arrow team is pleased to announce the 2.0.0 release. This covers
+over XX months of development work and includes [**XX resolved issues**][1]
+from [**XX distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes
+
+## Go notes
+
+## Java notes
+

Review comment:
   @nealrichardson Please check if the following notes are reasonable for 
Java. Thanks for your effort.
   
   ```suggestion
   The Java package has supported a number of new features. 
   Users can validate vectors in a wider range of aspects, if they are willing 
to take more time. 
   In dictionary encoding, dictionary indices can be expressed as unsinged 
integers.
   A framework for data compression has been setup for IPC.  
   
   The calculation for vector capacity has been simplified, so users should 
experience notable performance
   improvements for various 'setSafe' methods. 
   
   Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been 
resolved to make them more usable.
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org