[Impala-CR](cdh5-trunk) PREVIEW do parquet schema resolution by name instead of ordinal

Skye Wanderman-Milne (Code Review) Tue, 15 Mar 2016 17:32:12 -0700

Skye Wanderman-Milne has posted comments on this change.

Change subject: PREVIEW do parquet schema resolution by name instead of ordinal
......................................................................



Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/2384/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 193:       
resolve_by_name_(state->query_options().resolve_parquet_by_name),
> let's just use the query option directly rather than having more state to k
Done


Line 1994:   if (resolve_by_name_ && (next_idx == 0 || col_type->type == 
TYPE_STRUCT)) {
> why is it that arrays and maps don't also take this path?  a comment explai
Added a comment.

I'm not sure how to refactor them, I think they're expressing sufficiently 
different conditions that it's easiest/clearest to just spell them out. Maybe 
I'm missing something though.

One thing that would make this and L1945 slightly simpler is changing the BE to 
have a single type tree rooted with a TYPE_STRUCT representing the whole table, 
rather than having to special-case the top-level table columns. I was thinking 
of proposing that after this patch goes in.


Line 2005:                          
scan_node_->hdfs_table()->col_descs()[table_idx].name()
         :                          : col_type->field_names[table_idx];
> nit: wrong indentation.
Done


Line 2010: "File $0 does not contain path '$1' 
> May be we can factor this string out or if you take the recommendation belo
Done


Line 2019:   if (node->children.size() <= file_idx) {
         :       VLOG_FILE << Substitute(
         :           "File '$0' does not contain path '$1' (resolving by 
ordinal)", filename(),
         :           PrintPath(path));
         :       *missing_field = true;
         :       return NULL;
         :     }
> It seems that this code block can be factored out to the common path. It ap
Good idea, done.


http://gerrit.cloudera.org:8080/#/c/2384/1/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 425: resolve_by_name
> query option
Removed this member


Line 598: its 
> nit: it's
Done


http://gerrit.cloudera.org:8080/#/c/2384/1/be/src/service/query-options.h
File be/src/service/query-options.h:

Line 77:   QUERY_OPT_FN(resolve_parquet_by_name, RESOLVE_PARQUET_BY_NAME);
> how about making this an enum (i.e. position, name, and eventually field-id
I think field-id will be a separate option (or unconditional default) on top of 
this. Once we have field ids, I think we'll want to always use them if they 
exist in the file + table metadata. However, for legacy files that don't 
contain ids, you'll still need to resolve by name or ordinal.


-- 
To view, visit http://gerrit.cloudera.org:8080/2384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Silvius Rus <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) PREVIEW do parquet schema resolution by name instead of ordinal

Reply via email to