alamb commented on a change in pull request #1622: URL: https://github.com/apache/arrow-datafusion/pull/1622#discussion_r789859868
########## File path: datafusion/src/physical_plan/file_format/parquet.rs ########## @@ -385,9 +388,33 @@ fn build_row_group_predicate( } } +// Map projections from the schema which merges all file schemas to projections on a particular +// file +fn map_projections( + merged_schema: &Schema, + file_schema: &Schema, + projections: &[usize], +) -> Vec<usize> { + if merged_schema.fields().len() == file_schema.fields().len() { + projections.to_vec() + } else { + let mut mapped: Vec<usize> = vec![]; + for idx in projections { + let field = merged_schema.field(*idx); + if let Ok(mapped_idx) = file_schema.index_of(field.name().as_str()) { Review comment: I think this should probably raise an error if it is not possible to map the data type. ########## File path: datafusion/src/physical_plan/file_format/parquet.rs ########## @@ -473,6 +536,69 @@ mod tests { schema::types::SchemaDescPtr, }; + #[tokio::test] Review comment: Thanks! This is a great start on testing I was thinking there is no reason to use checked in parquet files for this test -- we can create files as part of the test. I went ahead and coded this up as https://github.com/coralogix/arrow-datafusion/pull/1 as the scaffolding was a bit annoying. With that code I suggest we test: 1. columns in different orders in different files (e.g. one file has (`a`, `b`, `c`) columns and one has (`b`, `a`) 2. projection (which I do see is covered here) 3. Column with the same name and different types 4. Columns with different subsets of the data (`a`, `b`) and (`b`, `c`) for example -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org