nevi-me commented on a change in pull request #8792:
URL: https://github.com/apache/arrow/pull/8792#discussion_r532205047



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -423,25 +313,64 @@ fn write_leaf(
     Ok(written as i64)
 }
 
-/// A struct that represents definition and repetition levels.
-/// Repetition levels are only populated if the parent or current leaf is 
repeated
-#[derive(Debug)]
-struct Levels {
-    definition: Vec<i16>,
-    repetition: Option<Vec<i16>>,
-}
-
 /// Compute nested levels of the Arrow array, recursing into lists and structs
-fn get_levels(
+/// Returns a list of `LevelInfo`, where each level is for nested primitive 
arrays.
+///
+/// The algorithm works by eagerly incrementing non-null values, and 
decrementing
+/// when a value is null.
+///
+/// *Examples:*
+///
+/// A record batch always starts at a populated definition = level 1.
+/// When a batch only has a primitive, i.e. `<batch<primitive[a]>>, column `a`
+/// can only have a maximum level of 1 if it is not null.
+/// If it is null, we decrement by 1, such that the null slots will = level 0.
+///
+/// If a batch has nested arrays (list, struct, union, etc.), then the 
incrementing
+/// takes place.
+/// A `<batch<struct[a]<primitive[b]>>` will have up to 2 levels (if nullable).
+/// When calculating levels for `a`, if the struct slot is not empty, we
+/// increment by 1, such that we'd have `[2, 2, 2]` if all 3 slots are not 
null.
+/// If there is an empty slot, we decrement, leaving us with `[2, 0, 2]` as the
+/// null slot effectively means that no record is populated for the row 
altogether.
+///
+/// *Lists*
+///
+/// TODO
+///
+/// *Non-nullable arrays*
+///
+/// If an array is non-nullable, this is accounted for when converting the 
Arrow
+/// schema to a Parquet schema.
+/// When dealing with `<batch<primitive[_]>>` there is no issue, as the meximum
+/// level will always be = 1.
+///
+/// When dealing with nested types, the logic becomes a bit complicate.
+/// A non-nullable struct; `<batch<struct{non-null}[a]<primitive[b]>>>` will 
only
+/// have 1 maximum level, where 0 means `b` is nul, and 1 means `b` is not 
null.
+///
+/// We account for the above by checking if the `Field` is nullable, and 
adjusting
+/// the [inc|dec]rement accordingly.
+fn calculate_array_levels(

Review comment:
       I've moved it there, along with the primitive levels function




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to