Jainik Vora created HIVE-24066:
----------------------------------
Summary: Hive query on parquet data should identify if column is
not present in file schema and show NULL value instead of Exception
Key: HIVE-24066
URL: https://issues.apache.org/jira/browse/HIVE-24066
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 2.3.5
Reporter: Jainik Vora
I created a hive table containing columns with struct data type
{code:java}
CREATE EXTERNAL TABLE abc_dwh.table_on_parquet (
`context` struct<`app`:struct<`build`:string, `name`:string,
`namespace`:string, `version`:string>, `screen`:struct<`height`:bigint,
`width`:bigint>, `timezone`:string>,
`messageid` string,
`timestamp` string,
`userid` string)
PARTITIONED BY (year string, month string, day string, hour string)
STORED as PARQUET
LOCATION 's3://abc/xyz';
{code}
All columns are nullable hence the parquet files read by the table don't always
contain all columns. If any file in a partition doesn't have "context.app"
struct and if "context.app.version" is queried, Hive throws an exception as
below. Same for "context.screen" as well.
{code:java}
Caused by: java.io.IOException: java.lang.RuntimeException: Primitive type
appshould not doesn't match typeapp[version]
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 25 more
Caused by: java.lang.RuntimeException: Primitive type appshould not doesn't
match typeapp[version]
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 26 more
{code}
Querying context.app shows as null
{code:java}
hive> select context.app from abc_dwh.table_on_parquet where year=2020 and
month='07' and day=26 and hour='03' limit 5;
OK
NULL
NULL
NULL
NULL
NULL
{code}
As a workaround, I tried querying "context.app.version" only if "context.app"
is not null but that also gave the same error. *To verify the case statement
for null check, I ran below query which should produce "0" in result for all
columns produced "1".* Distinct value of context.app for the partition is NULL
so ruled out differences in select with limit. Running the same query in
SparkSQL provides the correct result.
{code:java}
hive> select case when context.app is null then 0 else 1 end status from
abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and
hour='03' limit 5;
OK
1
1
1
1
1 {code}
Hive Version used: 2.3.5-amzn-0 (on AWS EMR){color:#888888}
{color}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)