nevi-me commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833530557
##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
// filter_array_indices should return the same indices in this case.
assert_eq!(level1.filter_array_indices(),
level2.filter_array_indices());
}
+
+ #[test]
+ fn test_filter_indices_for_lists() {
Review comment:
That's my fault on the complexity here. Let's see if I can interpret the
`LevelInfo`.
A nullable list will have 3 levels if its child array is also nullable.
So with the def levels:
* 0: null list value `null`
* 1: empty list `[]`
* 2: list with null `[null]`
* 3: list with value `[1]`
Here's a comment earlier in this file
(https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L402-L407).
In this case, `l1 = 0`, `l2 = 1`, `l3 = 2`. With `l3`, the actual emptiness of
the value (`null` or `1` above) gets determined by the child.
If the child is a non-null array, then `def = 3` isn't needed as it'd always
be true, so we'd only have `def.max() == 2`. With the same pattern, if the list
is non-null, then we only need 2 levels.
* 0: empty list
* 1: list with value
The repetition levels let us know how nested the list is, so 0 is the start
of a new list item, and 1 is a continuation.
If we have repetition like: `[0, 1, 2, 1, 0, 2]` plus its definitions, we
can both construct the offsets (meaning that `LevelInfo` is inefficient as we
don't really need its `array_offsets`) as follows:
Offsets = every value that's 0 is a new record, but if there's a 0, we have
to check if the definition has an empty value for that record.
So we start with 2 list values:
```
[0, 1, 2, 1]
[0, 2]
```
Then we expand further to:
```
[ [0], [1, 2], [1] ] -> 3 nested values in the list slot
[ [0, 2] ] -> 1 nested value in the list slot
```
Using the offsets on your test (with reps shown):
```
1. [0, 1, 2] (rep: [0, 1, 1])
2. [] (rep: [0])
3. [3, 4, 5] (rep: [0, 1, 1])
```
So in this case `filter_array_indices` is giving us the indices that have
values, hence the [0, 1, 2, 3, 4, 5] result.
_______
So maybe a good approach for null list values is to add definition = 2. I'd
expect whichever index that's got a def of 2 or less to be excluded.
I'll be able to have a look at this PR tomorrow morning GMT, so I can tweak
it or answer questions in detail then if you don't mind @novemberkilo
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]