alamb commented on issue #8456: URL: https://github.com/apache/arrow-datafusion/issues/8456#issuecomment-1846240716
Thank you for the report @anlihust (and the reproducer) This sounds similar to https://github.com/apache/arrow-datafusion/issues/8335 which we fixed recently -- maybe additional code somewhere incorrectly maps column names to parquet columns. In particular using `parquet_column` is needed to find the correct file index ``` /// Lookups up the parquet column by name /// /// Returns the parquet column index and the corresponding arrow field pub(crate) fn parquet_column<'a>( parquet_schema: &SchemaDescriptor, arrow_schema: &'a Schema, name: &str, ) -> Option<(usize, &'a FieldRef)> { let (root_idx, field) = arrow_schema.fields.find(name)?; if field.data_type().is_nested() { // Nested fields are not supported and require non-trivial logic // to correctly walk the parquet schema accounting for the // logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md> // // For example a ListArray could correspond to anything from 1 to 3 levels // in the parquet schema return None; } // This could be made more efficient (#TBD) let parquet_idx = (0..parquet_schema.columns().len()) .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?; Some((parquet_idx, field)) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
