[GitHub] [arrow-rs] nevi-me commented on a change in pull request #1477: Test and fix for bug writing a null list.

GitBox Wed, 23 Mar 2022 10:22:03 -0700


nevi-me commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833530557




##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), 
level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       That's my fault on the complexity here. Let's see if I can interpret the 
`LevelInfo`.
   
   A nullable list will have 3 levels if its child array is also nullable.
   
   So with the def levels:
   * 0: null list value `null`
   * 1: empty list `[]`
   * 2: list with null `[null]`
   * 3: list with value `[1]`
   
   Here's a comment earlier in this file 
(https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L402-L407).
 In this case, `l1 = 0`, `l2 = 1`, `l3 = 2`. With `l3`, the actual emptiness of 
the value (`null` or `1` above) gets determined by the child.
   
   If the child is a non-null array, then `def = 3` isn't needed as it'd always 
be true, so we'd only have `def.max() == 2`. With the same pattern, if the list 
is non-null, then we only need 2 levels.
   
   * 0: empty list
   * 1: list with value
   
   The repetition levels let us know how nested the list is, so 0 is the start 
of a new list item, and 1 is a continuation.
   
   If we have repetition like: `[0, 1, 2, 1, 0, 2]` plus its definitions, we 
can both construct the offsets (meaning that `LevelInfo` is inefficient as we 
don't really need its `array_offsets`) as follows:
   
   Offsets = every value that's 0 is a new record, but if there's a 0, we have 
to check if the definition has an empty value for that record.
   
   So we start with 2 list values:
   
   ```
   [0, 1, 2, 1]
   [0, 2]
   ```
   
   Then we expand further to:
   
   ```
   [ [0], [1, 2], [1] ] -> 3 nested values in the list slot
   [ [0, 2] ] -> 1 nested value in the list slot
   ```
   
   Using the offsets on your test (with reps shown):
   
   ```
   1.  [0, 1, 2] (rep: [0, 1, 1])
   2. []            (rep: [0])
   3. [3, 4, 5] (rep: [0, 1, 1])
   ```
   
   So in this case `filter_array_indices` is giving us the indices that have 
values, hence the [0, 1, 2, 3, 4, 5] result.
   _______
   
   So maybe a good approach for null list values is to add definition = 2. I'd 
expect whichever index that's got a def of 2 or less to be excluded. 
   
   I'll be able to have a look at this PR tomorrow morning GMT, so I can tweak 
it or answer questions in detail then if you don't mind @novemberkilo 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] nevi-me commented on a change in pull request #1477: Test and fix for bug writing a null list.

Reply via email to