[jira] [Commented] (IMPALA-9410) Support resolving ORC file columns by names

Quanlong Huang (Jira) Mon, 16 Mar 2020 20:50:27 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060613#comment-17060613
 ]


Quanlong Huang commented on IMPALA-9410:
----------------------------------------

Hive resolves nested struct columns (not table level columns) by names and it's 
case sensitive. So we have inconsistent results with Hive for this case:
{code:sql}
$ beeline -u jdbc:hive2://localhost:11050 -e "select nested_struct.a from 
functional_orc_def.complextypestbl"
+-------+
|   a   |
+-------+
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| -1    |
+-------+
$ bin/impala-shell.sh -q "select nested_struct.a from 
functional_orc_def.complextypestbl"
+-----------------+
| nested_struct.a |
+-----------------+
| -1              |
| 1               |
| NULL            |
| NULL            |
| NULL            |
| NULL            |
| NULL            |
| 7               |
+-----------------+
{code}
Table functional_orc_def.complextypestbl contains two files: nullable.orc and 
nonnullable.orc. The cause for this case is that nullable.orc has subcolumn "A" 
but not "a" in the "nested_struct" column. Here are the schemas of these two 
orc files:
{code}
nullable.orc:
struct<id:bigint,int_array:array<int>,int_array_Array:array<array<int>>,int_map:map<string,int>,int_Map_Array:array<map<string,int>>,nested_struct:struct<A:int,b:array<int>,C:struct<d:array<array<struct<E:int,F:string>>>>,g:map<string,struct<H:struct<i:array<double>>>>>>

nonnullable.orc:
struct<ID:bigint,Int_Array:array<int>,int_array_array:array<array<int>>,Int_Map:map<string,int>,int_map_array:array<map<string,int>>,nested_Struct:struct<a:int,B:array<int>,c:struct<D:array<array<struct<e:int,f:string>>>>,G:map<string,struct<h:struct<i:array<double>>>>>>
{code}

Impala currently resolves orc columns by index. We need to define the expected 
behavior (case sensitive or not in nested columns) when we support resolving 
column by names. 

> Support resolving ORC file columns by names
> -------------------------------------------
>
>                 Key: IMPALA-9410
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9410
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Quanlong Huang
>            Priority: Major
>
> Currently we resolve ORC file columns by indices. We should provide an query 
> option like PARQUET_FALLBACK_SCHEMA_RESOLUTION for Parquet (IMPALA-2835), to 
> resolve ORC file columns by names.
> Note that Hive only writes column names to ORC files after Hive-2.x 
> (HIVE-4243). For older versions of Hive, the column names in ORC files are 
> something like _col0, _col1,....,_col99. So this feature is only required 
> when deployed with Hive 2+.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9410) Support resolving ORC file columns by names

Reply via email to