[Impala-CR](cdh5-trunk) IMPALA-2853: introduce RESOLVE_PARQUET_BY_NAME query option

Dan Hecht (Code Review) Tue, 15 Mar 2016 18:34:25 -0700

Dan Hecht has posted comments on this change.

Change subject: IMPALA-2853: introduce RESOLVE_PARQUET_BY_NAME query option
......................................................................



Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/2384/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 2028: 
nit: remove blank line.


Line 2057:   }
I think the thing that makes this function difficult to read is that all the 
cases are kind of interleaved together.  For example, we check for top-level 
table 4 different types (using different methods).  What do you think about 
breaking the cases out separately?  I think that makes the code easier to 
follow and require less explanation. Something like:

  DCHECK_LT(next_idx, path.size());

  int file_idx;
  int table_idx = path[next_idx];
  if (state_->query_options().resolve_parquet_by_name) {
    if (next_idx == 0) {
      // Top level tables. Resolve by name by finding a child with the same 
name as the
      // column.
      DCHECK_LT(table_idx, scan_node_->hdfs_table()->col_descs().size());
      const string& name = 
scan_node_->hdfs_table()->col_descs()[table_idx].name();
      file_idx = FindChildWithName(node, name);
    } else if (col_type->type == TYPE_STRUCT) {
      // For structs, resolve by name finds the child with the same name as the
      // struct's field.
      DCHECK_LT(table_idx, col_type->field_names.size());
      const string& name = col_type->field_names[table_idx];
      file_idx = FindChildWithName(node, name);
    } else if (col_type->type == TYPE_ARRAY) {
      // Arrays have only one child in the file.
      DCHECK_EQ(table_idx, SchemaPathConstants::ARRAY_ITEM);
      file_idx = table_idx;
    } else {
      DCHECK_EQ(col_type->type, TYPE_MAP);
      // Maps have two ordered values.
      // QUESTION: are the child names required to be "key" and "value"?
      // If so, maybe this could be resolved by name too?
      DCHECK(table_idx == MAP_KEY || table_idx == MAP_VALUE);
      fild_idx = table_idx;
    }
  } else {
    // Resolution by ordinal.
    if (next_idx == 0) {
      // For top level tables the first index in a path includes the table's 
partition keys.
      file_idx = table_idx - scan_node_->num_partition_keys();
    } else {
      file_idx = table_idx;
    }
  }


http://gerrit.cloudera.org:8080/#/c/2384/2/be/src/service/query-options.h
File be/src/service/query-options.h:

Line 77: resolve_parquet_by_name
let me think about the two vs one query option issue a bit more. but maybe call 
it parquet_resolve_by_name so that all parquet resolved options are nicely 
prefixed with 'parquet'.


-- 
To view, visit http://gerrit.cloudera.org:8080/2384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Silvius Rus <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2853: introduce RESOLVE_PARQUET_BY_NAME query option

Reply via email to