jhorstmann commented on code in PR #10101:
URL: https://github.com/apache/arrow-rs/pull/10101#discussion_r3394923151


##########
parquet/src/arrow/array_reader/mod.rs:
##########
@@ -85,6 +85,66 @@ pub use struct_array::StructArrayReader;
 ///
 /// Data can either be read in batches using [`ArrayReader::next_batch`] or
 /// incrementally using [`ArrayReader::read_records`] and 
[`ArrayReader::skip_records`].
+///
+/// # Definition and repetition levels
+///
+/// Parquet encodes nesting, nulls, and empty lists using *definition* and
+/// *repetition* levels, based on the [Dremel paper]. Some example nested
+/// readers are:
+/// * [`ListArrayReader`]
+/// * [`FixedSizeListArrayReader`]
+/// * [`MapArrayReader`]
+/// * [`StructArrayReader`]
+///
+/// Each nested reader accesses the levels via [`ArrayReader::get_def_levels`]
+/// and [`ArrayReader::get_rep_levels`] and uses them to reconstruct nulls,
+/// empty lists, and list boundaries.
+///
+/// Each nested reader is built with a definition level `D` and a repetition
+/// level `R` taken from its [`ParquetField`] (see its `def_level` / 
`rep_level`
+/// fields). Given a child's level pair `(d, r)`, the two levels are 
interpreted
+/// as follows.
+///
+/// **Definition level** — how "present" the value is at this level:
+///
+/// ```text
+/// ┌───────────────────────────┬────────────────────────────────────┐
+/// │           State           │             def level (d)          │
+/// ├───────────────────────────┼────────────────────────────────────┤
+/// │ present, with a value     │ d >= D                             │
+/// │ present but empty (list)  │ d == D - 1                         │
+/// │ null                      │ d <= D - 2   ← "lower still"       │
+/// └───────────────────────────┴────────────────────────────────────┘
+/// ```
+///
+/// Note that not every reader uses all three states:
+/// * a non-nullable list has no `null` state — only `d >= D` (has values) vs
+///   the empty `d == D - 1`;
+/// * a [`StructArrayReader`] has no `empty` state — only present `d >= D` vs
+///   null `d < D`.
+///
+/// **Repetition level** — where a value attaches relative to this reader's 
list:

Review Comment:
   I'm now wondering whether there can actually be inconsistencies between 
repetition and definition levels, which might lead to different implementations 
interpreting the same file differently. Consider the following level data for a 
required list with required items:
   
   Rep levels: [0, 1]
   Def levels: [0, 1]
   
   The first row should be an empty list according to its def level, but then 
then the second rep level indicates a continuation/insert into the previously 
started list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to