[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-21 Thread GitBox


nevi-me commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r562158835



##
File path: rust/parquet/src/arrow/array_reader.rs
##
@@ -912,11 +912,36 @@ impl ArrayReader for 
ListArrayReader {
 ));
 }
 
-// Need to remove from the values array the nulls that represent null 
lists rather than null items
-// null lists have def_level = 0
+// List definitions can be encoded as 4 values:
+// - n + 0: the list slot is null
+// - n + 1: the list slot is not null, but is empty (i.e. [])
+// - n + 2: the list slot is not null, but its child is empty (i.e. [ 
null ])
+// - n + 3: the list slot is not null, and its child is not empty
+// Where n is the max definition level of the list's parent.
+// If a Parquet schema's only leaf is the list, then n = 0.
+
+// TODO: ARROW-10391 - add a test case with a non-nullable child, 
check if max is 3
+let list_field_type = match self.get_data_type() {
+ArrowType::List(field)
+| ArrowType::FixedSizeList(field, _)
+| ArrowType::LargeList(field) => field,
+_ => {
+// Panic: this is safe as we only write lists from list 
datatypes
+unreachable!()

Review comment:
   They don't make any difference, `unreachable!()` will call `panic!()` 
with a message about unreachable code being reached. So it's probably a more 
descriptive panic.
   
   I tohught that marking a condition as `unreachable!()` lets the compiler 
optimise out that condition, but it seems like only its `unsafe` equivalent 
does.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-21 Thread GitBox


nevi-me commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r562158835



##
File path: rust/parquet/src/arrow/array_reader.rs
##
@@ -912,11 +912,36 @@ impl ArrayReader for 
ListArrayReader {
 ));
 }
 
-// Need to remove from the values array the nulls that represent null 
lists rather than null items
-// null lists have def_level = 0
+// List definitions can be encoded as 4 values:
+// - n + 0: the list slot is null
+// - n + 1: the list slot is not null, but is empty (i.e. [])
+// - n + 2: the list slot is not null, but its child is empty (i.e. [ 
null ])
+// - n + 3: the list slot is not null, and its child is not empty
+// Where n is the max definition level of the list's parent.
+// If a Parquet schema's only leaf is the list, then n = 0.
+
+// TODO: ARROW-10391 - add a test case with a non-nullable child, 
check if max is 3
+let list_field_type = match self.get_data_type() {
+ArrowType::List(field)
+| ArrowType::FixedSizeList(field, _)
+| ArrowType::LargeList(field) => field,
+_ => {
+// Panic: this is safe as we only write lists from list 
datatypes
+unreachable!()

Review comment:
   They don't make any difference, `unreachable!()` will call `panic!()` 
with a message about unreachable code being reached. So it's probably a more 
descriptive panic.
   
   I tohught that marking a condition as `unreachable!()` lets the compiler 
optimise out that condition, but it seems like only its `unsafe` equivalent 
does.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-17 Thread GitBox


nevi-me commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r559273034



##
File path: rust/parquet/src/arrow/array_reader.rs
##
@@ -1369,13 +1394,12 @@ impl<'a> TypeVisitor>, &'a 
ArrayReaderBuilderContext
 let item_reader_type = item_reader.get_data_type().clone();
 
 match item_reader_type {
-ArrowType::List(_)
-| ArrowType::FixedSizeList(_, _)
-| ArrowType::Struct(_)
-| ArrowType::Dictionary(_, _) => Err(ArrowError(format!(
-"reading List({:?}) into arrow not supported yet",
-item_type
-))),
+ArrowType::FixedSizeList(_, _) | ArrowType::Dictionary(_, _) => {

Review comment:
   I need to revert this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #9240: ARROW-10766: [Rust] [Parquet] Compute nested list definitions

2021-01-17 Thread GitBox


nevi-me commented on a change in pull request #9240:
URL: https://github.com/apache/arrow/pull/9240#discussion_r559270842



##
File path: rust/parquet/src/arrow/levels.rs
##
@@ -20,12 +20,12 @@
 //! Contains the algorithm for computing definition and repetition levels.
 //! The algorithm works by tracking the slots of an array that should 
ultimately be populated when
 //! writing to Parquet.
-//! Parquet achieves nesting through definition levels and repetition levels 
\[1\].
+//! Parquet achieves nesting through definition levels and repetition levels 
[1].

Review comment:
   rebase gone bad. I'll fix this

##
File path: rust/parquet/src/arrow/levels.rs
##
@@ -294,40 +242,1158 @@ impl LevelInfo {
 // Need to check for these cases not implemented in C++:
 // - "Writing DictionaryArray with nested dictionary type not 
yet supported"
 // - "Writing DictionaryArray with null encoded in dictionary 
type not yet supported"
-vec![Self {
-definition: self.get_primitive_def_levels(array, field),
-repetition: self.repetition.clone(),
-definition_mask: self.definition_mask.clone(),
+// vec![self.get_primitive_def_levels(array, field, 
array_mask)]
+vec![self.calculate_child_levels(
+array_offsets,
+array_mask,
+false,
+field.is_nullable(),
+)]
+}
+}
+}
+
+/// Calculate child/leaf array levels.
+///
+/// The algorithm works by incrementing definitions of array values based 
on whether:
+/// - a value is optional or required (is_nullable)
+/// - a list value is repeated + optional or required (is_list)
+///
+/// *Examples:*

Review comment:
   I'll review this. I shouldn't have 2 *Examples:* headings.

##
File path: rust/parquet/src/arrow/levels.rs
##
@@ -294,40 +242,1158 @@ impl LevelInfo {
 // Need to check for these cases not implemented in C++:
 // - "Writing DictionaryArray with nested dictionary type not 
yet supported"
 // - "Writing DictionaryArray with null encoded in dictionary 
type not yet supported"
-vec![Self {
-definition: self.get_primitive_def_levels(array, field),
-repetition: self.repetition.clone(),
-definition_mask: self.definition_mask.clone(),
+// vec![self.get_primitive_def_levels(array, field, 
array_mask)]
+vec![self.calculate_child_levels(
+array_offsets,
+array_mask,
+false,
+field.is_nullable(),
+)]
+}
+}
+}
+
+/// Calculate child/leaf array levels.
+///
+/// The algorithm works by incrementing definitions of array values based 
on whether:
+/// - a value is optional or required (is_nullable)
+/// - a list value is repeated + optional or required (is_list)
+///
+/// *Examples:*
+///
+/// A record batch always starts at a populated definition = level 0.
+/// When a batch only has a primitive, i.e. `>, column 
`a`
+/// can only have a maximum level of 1 if it is not null.
+/// If it is not null, we increment by 1, such that the null slots will = 
level 1.
+/// The above applies to types that have no repetition (anything not a 
list or map).
+///
+/// If a batch has lists, then we increment by up to 2 levels:
+/// - 1 level for the list
+/// - 1 level if the list itself is nullable
+///
+/// A list's child then gets incremented using the above rules.
+///
+/// A special case is when at the root of the schema. We always increment 
the
+/// level regardless of whether the child is nullable or not. If we do not 
do
+/// this, we could have a non-nullable array having a definition of 0.
+///
+/// *Examples*
+///
+/// A batch with only a primitive that's non-nullable. 
``:
+/// * We don't increment the definition level as the array is not optional.
+/// * This would leave us with a definition of 0, so the special case 
applies.
+/// * The definition level becomes 1.
+///
+/// A batch with only a primitive that's nullable. ``:
+/// * The definition level becomes 1, as we increment it once.
+///
+/// A batch with a single non-nullable list (both list and child not null):
+/// * We calculate the level twice, for the list, and for the child.
+/// * At the list, the level becomes 1, where 0 indicates that the list is
+///  empty, and 1 says it's not (determined through offsets).
+/// * At the primitive level

Review comment:
   I didn't finish this

##
File path: rust/parquet/src/arrow/levels.rs
##