Re: [I] Support parquet statistics for struct columns [arrow-datafusion]

via GitHub Fri, 05 Apr 2024 17:19:29 -0700


edmondop commented on issue #8334:
URL: 
https://github.com/apache/arrow-datafusion/issues/8334#issuecomment-2040815727


   I read in statistics
   
   ```rust
   /// Lookups up the parquet column by name
   ///
   /// Returns the parquet column index and the corresponding arrow field
   pub(crate) fn parquet_column<'a>(
       parquet_schema: &SchemaDescriptor,
       arrow_schema: &'a Schema,
       name: &str,
   ) -> Option<(usize, &'a FieldRef)> {
       let (root_idx, field) = arrow_schema.fields.find(name)?;
       if field.data_type().is_nested() {
           // Nested fields are not supported and require non-trivial logic
           // to correctly walk the parquet schema accounting for the
           // logical type rules - 
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md>
           //
           // For example a ListArray could correspond to anything from 1 to 3 
levels
           // in the parquet schema
           return None;
       }
   
       // This could be made more efficient (#TBD)
       let parquet_idx = (0..parquet_schema.columns().len())
           .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?;
       Some((parquet_idx, field))
   }```
   
   `git blame` shows @alamb as an author of those lines... I'll look into the 
rules.  I suppose aggregation functions will need to be updated to walk the 
schema correctly? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Support parquet statistics for struct columns [arrow-datafusion]

Reply via email to