[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

GitBox Sat, 15 Jan 2022 14:18:32 -0800


tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785352427




##########
File path: parquet/src/arrow/levels.rs
##########
@@ -759,13 +759,6 @@ impl LevelInfo {
     /// Given a level's information, calculate the offsets required to index 
an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
         // happy path if not dealing with lists
-        let is_nullable = match self.level_type {
-            LevelType::Primitive(is_nullable) => is_nullable,
-            _ => panic!(

Review comment:
       I think this assertion is still important - this method will return 
nonsense if called on something that isn't a leaf

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to 
describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially 
nested arrow array, producing a list of `LevelInfo` that it then writes. Each 
`LevelInfo` identifies for that leaf, what repetition and definition levels to 
write. This method is then used to compute which non-null indices from the 
source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == 
self.max_definition` it records that index as one that needs to be written to 
parquet. 
   
   The problem with the previous logic was it assumed that only a null or value 
at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index 
position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a 
parent nullable StructArray. Critically, **a null in a ListArray may not take 
up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle 
both cases by distinguishing list array nulls that don't take up space  :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Reply via email to