nevi-me commented on a change in pull request #8792: URL: https://github.com/apache/arrow/pull/8792#discussion_r532205047
########## File path: rust/parquet/src/arrow/arrow_writer.rs ########## @@ -423,25 +313,64 @@ fn write_leaf( Ok(written as i64) } -/// A struct that represents definition and repetition levels. -/// Repetition levels are only populated if the parent or current leaf is repeated -#[derive(Debug)] -struct Levels { - definition: Vec<i16>, - repetition: Option<Vec<i16>>, -} - /// Compute nested levels of the Arrow array, recursing into lists and structs -fn get_levels( +/// Returns a list of `LevelInfo`, where each level is for nested primitive arrays. +/// +/// The algorithm works by eagerly incrementing non-null values, and decrementing +/// when a value is null. +/// +/// *Examples:* +/// +/// A record batch always starts at a populated definition = level 1. +/// When a batch only has a primitive, i.e. `<batch<primitive[a]>>, column `a` +/// can only have a maximum level of 1 if it is not null. +/// If it is null, we decrement by 1, such that the null slots will = level 0. +/// +/// If a batch has nested arrays (list, struct, union, etc.), then the incrementing +/// takes place. +/// A `<batch<struct[a]<primitive[b]>>` will have up to 2 levels (if nullable). +/// When calculating levels for `a`, if the struct slot is not empty, we +/// increment by 1, such that we'd have `[2, 2, 2]` if all 3 slots are not null. +/// If there is an empty slot, we decrement, leaving us with `[2, 0, 2]` as the +/// null slot effectively means that no record is populated for the row altogether. +/// +/// *Lists* +/// +/// TODO +/// +/// *Non-nullable arrays* +/// +/// If an array is non-nullable, this is accounted for when converting the Arrow +/// schema to a Parquet schema. +/// When dealing with `<batch<primitive[_]>>` there is no issue, as the meximum +/// level will always be = 1. +/// +/// When dealing with nested types, the logic becomes a bit complicate. +/// A non-nullable struct; `<batch<struct{non-null}[a]<primitive[b]>>>` will only +/// have 1 maximum level, where 0 means `b` is nul, and 1 means `b` is not null. +/// +/// We account for the above by checking if the `Field` is nullable, and adjusting +/// the [inc|dec]rement accordingly. +fn calculate_array_levels( Review comment: I've moved it there, along with the primitive levels function ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org