Dan Hecht has posted comments on this change. Change subject: IMPALA-2853: introduce RESOLVE_PARQUET_BY_NAME query option ......................................................................
Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/2384/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 2028: nit: remove blank line. Line 2057: } I think the thing that makes this function difficult to read is that all the cases are kind of interleaved together. For example, we check for top-level table 4 different types (using different methods). What do you think about breaking the cases out separately? I think that makes the code easier to follow and require less explanation. Something like: DCHECK_LT(next_idx, path.size()); int file_idx; int table_idx = path[next_idx]; if (state_->query_options().resolve_parquet_by_name) { if (next_idx == 0) { // Top level tables. Resolve by name by finding a child with the same name as the // column. DCHECK_LT(table_idx, scan_node_->hdfs_table()->col_descs().size()); const string& name = scan_node_->hdfs_table()->col_descs()[table_idx].name(); file_idx = FindChildWithName(node, name); } else if (col_type->type == TYPE_STRUCT) { // For structs, resolve by name finds the child with the same name as the // struct's field. DCHECK_LT(table_idx, col_type->field_names.size()); const string& name = col_type->field_names[table_idx]; file_idx = FindChildWithName(node, name); } else if (col_type->type == TYPE_ARRAY) { // Arrays have only one child in the file. DCHECK_EQ(table_idx, SchemaPathConstants::ARRAY_ITEM); file_idx = table_idx; } else { DCHECK_EQ(col_type->type, TYPE_MAP); // Maps have two ordered values. // QUESTION: are the child names required to be "key" and "value"? // If so, maybe this could be resolved by name too? DCHECK(table_idx == MAP_KEY || table_idx == MAP_VALUE); fild_idx = table_idx; } } else { // Resolution by ordinal. if (next_idx == 0) { // For top level tables the first index in a path includes the table's partition keys. file_idx = table_idx - scan_node_->num_partition_keys(); } else { file_idx = table_idx; } } http://gerrit.cloudera.org:8080/#/c/2384/2/be/src/service/query-options.h File be/src/service/query-options.h: Line 77: resolve_parquet_by_name let me think about the two vs one query option issue a bit more. but maybe call it parquet_resolve_by_name so that all parquet resolved options are nicely prefixed with 'parquet'. -- To view, visit http://gerrit.cloudera.org:8080/2384 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Silvius Rus <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-HasComments: Yes
