Ganesha Shreedhara created HIVE-25494:
-
Summary: Hive query fails with IndexOutOfBoundsException when a
struct type column's field is missing in parquet file schema but present in
table schema
Key: HIVE-25494
URL: https://issues.apache.org/jira/browse/HIVE-25494
Project: Hive
Issue Type: Bug
Reporter: Ganesha Shreedhara
Attachments: test-struct.parquet
When a struct type column's field is missing in parquet file schema but present
in table schema and columns are accessed by names, the requestedSchema getting
sent from Hive to Parquet storage layer has type even for missing field since
we always add type as primitive type if a field is missing in file schema
([Ref|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).]
On a parquet side, this missing field gets pruned and since this field belongs
to struct type, it ends creating a GroupColumnIO without any children. This
causes query to fail with IndexOutOfBoundsException, stack trace is given below.
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value
at 0 in block -1 in file test-struct.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
{code}
Steps to reproduce:
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
-- Use the attached test-struct.parquet data file to load data to this table
LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet
{code}
Same query works fine in the following scenarios:
1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive> select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id
application_1630412697229_0031)
--
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING
FAILED