[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

GitBox Mon, 21 Mar 2022 06:21:19 -0700


tustvold commented on a change in pull request #1448:
URL: https://github.com/apache/arrow-rs/pull/1448#discussion_r831080215




##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -885,6 +885,9 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
+        ArrowType::Null => {

Review comment:
       Probably beyond the scope of this PR, but it occurs to me that this 
`remove_indices` function is basically an inverse of the arrow `take` kernel. I 
wonder if we could flip the construction of `null_list_indices` to be instead 
the non-null positions - and just use a regular take kernel?
   
   I might have a play with this :thinking: 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();

Review comment:
       Not part of this PR, but I wonder why this doesn't just initialize 
`field` to be `self.arrow_schema.field_with_name(parts[1]).ok()` and then do 
`skip(2)` given it has already done the length check above...

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);

Review comment:
       Perhaps a match block might be clearer?

##########
File path: parquet/src/arrow/arrow_reader.rs
##########
@@ -1210,4 +1210,19 @@ mod tests {
         assert_eq!(get_dict(&batches[3]), get_dict(&batches[4]));
         assert_eq!(get_dict(&batches[4]), get_dict(&batches[5]));
     }
+
+    #[test]
+    fn test_read_null_list() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);

Review comment:
       I will have a play with creating a test that generates a parquet file, 
so that we can get this PR in without waiting for parquet-testing. I will also 
file a ticket to start a discussion on a faster way to get parquet test files 
checked in, without relying on an upstream repo 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -1984,12 +1990,14 @@ impl<'a> ArrayReaderBuilder {
                     field = self.arrow_schema.field_with_name(part).ok();
                 } else if let Some(f) = field {
                     if let ArrowType::Struct(fields) = f.data_type() {
-                        field = fields.iter().find(|f| f.name() == part)
+                        field = fields.iter().find(|f| f.name() == part);
+                    } else if let ArrowType::List(list_field) = f.data_type() {

Review comment:
       If I'm not mistaken this change is necessary to support a StructArray 
containing a ListArray in general, without being specific to nulls?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1448: Fix Parquet reader for null list

Reply via email to