[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1622: Handle merging of evolved schemas in ParquetExec

GitBox Fri, 21 Jan 2022 09:30:55 -0800


alamb commented on a change in pull request #1622:
URL: https://github.com/apache/arrow-datafusion/pull/1622#discussion_r789859868




##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -385,9 +388,33 @@ fn build_row_group_predicate(
     }
 }
 
+// Map projections from the schema which merges all file schemas to 
projections on a particular
+// file
+fn map_projections(
+    merged_schema: &Schema,
+    file_schema: &Schema,
+    projections: &[usize],
+) -> Vec<usize> {
+    if merged_schema.fields().len() == file_schema.fields().len() {
+        projections.to_vec()
+    } else {
+        let mut mapped: Vec<usize> = vec![];
+        for idx in projections {
+            let field = merged_schema.field(*idx);
+            if let Ok(mapped_idx) = 
file_schema.index_of(field.name().as_str()) {

Review comment:
       I think  this should probably raise an error if it is not possible to 
map the data type. 

##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -473,6 +536,69 @@ mod tests {
         schema::types::SchemaDescPtr,
     };
 
+    #[tokio::test]

Review comment:
       Thanks! This is a great start on testing 
   
   I was thinking there is no reason to use checked in parquet files for this 
test -- we can create files as part of the test. I went ahead and coded this up 
as https://github.com/coralogix/arrow-datafusion/pull/1 as the scaffolding was 
a bit annoying. 
   
   With that code I suggest we test:
   1. columns in different orders in different files (e.g. one file has (`a`, 
`b`, `c`) columns and one has (`b`, `a`)
   2. projection (which I do see is covered here)
   3.  Column with the same name and different types
   4. Columns with different subsets of the data  (`a`, `b`) and (`b`, `c`) for 
example




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1622: Handle merging of evolved schemas in ParquetExec

Reply via email to