[
https://issues.apache.org/jira/browse/IMPALA-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060613#comment-17060613
]
Quanlong Huang commented on IMPALA-9410:
----------------------------------------
Hive resolves nested struct columns (not table level columns) by names and it's
case sensitive. So we have inconsistent results with Hive for this case:
{code:sql}
$ beeline -u jdbc:hive2://localhost:11050 -e "select nested_struct.a from
functional_orc_def.complextypestbl"
+-------+
| a |
+-------+
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| -1 |
+-------+
$ bin/impala-shell.sh -q "select nested_struct.a from
functional_orc_def.complextypestbl"
+-----------------+
| nested_struct.a |
+-----------------+
| -1 |
| 1 |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| 7 |
+-----------------+
{code}
Table functional_orc_def.complextypestbl contains two files: nullable.orc and
nonnullable.orc. The cause for this case is that nullable.orc has subcolumn "A"
but not "a" in the "nested_struct" column. Here are the schemas of these two
orc files:
{code}
nullable.orc:
struct<id:bigint,int_array:array<int>,int_array_Array:array<array<int>>,int_map:map<string,int>,int_Map_Array:array<map<string,int>>,nested_struct:struct<A:int,b:array<int>,C:struct<d:array<array<struct<E:int,F:string>>>>,g:map<string,struct<H:struct<i:array<double>>>>>>
nonnullable.orc:
struct<ID:bigint,Int_Array:array<int>,int_array_array:array<array<int>>,Int_Map:map<string,int>,int_map_array:array<map<string,int>>,nested_Struct:struct<a:int,B:array<int>,c:struct<D:array<array<struct<e:int,f:string>>>>,G:map<string,struct<h:struct<i:array<double>>>>>>
{code}
Impala currently resolves orc columns by index. We need to define the expected
behavior (case sensitive or not in nested columns) when we support resolving
column by names.
> Support resolving ORC file columns by names
> -------------------------------------------
>
> Key: IMPALA-9410
> URL: https://issues.apache.org/jira/browse/IMPALA-9410
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Quanlong Huang
> Priority: Major
>
> Currently we resolve ORC file columns by indices. We should provide an query
> option like PARQUET_FALLBACK_SCHEMA_RESOLUTION for Parquet (IMPALA-2835), to
> resolve ORC file columns by names.
> Note that Hive only writes column names to ORC files after Hive-2.x
> (HIVE-4243). For older versions of Hive, the column names in ORC files are
> something like _col0, _col1,....,_col99. So this feature is only required
> when deployed with Hive 2+.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]