[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Ganesha Shreedhara (Jira) Tue, 31 Aug 2021 23:35:17 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ganesha Shreedhara updated HIVE-25494:
--------------------------------------
    Description: 
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema 
([Ref|#L130]). On a parquet side, this missing field gets pruned and since this 
field belongs to struct type, it ends creating a GroupColumnIO without any 
children. This causes query to fail with IndexOutOfBoundsException, stack trace 
is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct<child:string,extracol:string> COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0    
   0       
0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 
s----------------------------------------------------------------------------------------------
OK
NULL toplevel{code}
 

3) Create a copy of the same table and run the same query on the newly created 
table. 
{code:java}
hive> create table parquet_struct_test_copy like parquet_struct_test;
OK
hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs 
= 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App 
id application_1630412697229_0031)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0    
   0       
0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 
s----------------------------------------------------------------------------------------------
Loading data to table default.parquet_struct_test_copy
OK
hive> select parent.extracol, toplevel from parquet_struct_test_copy;
OK
NULL toplevel{code}
 

Also, this issue doesn't exist when only missing struct type column's field is 
selected or all the fields in table are selected. This issue exists only when 
combination of missing struct type column's field and another existing column 
are selected.

 

  was:
When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema 
([Ref|#L130]).]). On a parquet side, this missing field gets pruned and since 
this field belongs to struct type, it ends creating a GroupColumnIO without any 
children. This causes query to fail with IndexOutOfBoundsException, stack trace 
is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct<child:string,extracol:string> COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:

1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0    
   0       
0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 
s----------------------------------------------------------------------------------------------
OK
NULL toplevel{code}
 

3) Create a copy of the same table and run the same query on the newly created 
table. 
{code:java}
hive> create table parquet_struct_test_copy like parquet_struct_test;
OK
hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs 
= 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App 
id application_1630412697229_0031)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  
KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0    
   0       
0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 
s----------------------------------------------------------------------------------------------
Loading data to table default.parquet_struct_test_copy
OK
hive> select parent.extracol, toplevel from parquet_struct_test_copy;
OK
NULL toplevel{code}
 

Also, this issue doesn't exist when only missing struct type column's field is 
selected or all the fields in table are selected. This issue exists only when 
combination of missing struct type column's field and another existing column 
are selected.

 


> Hive query fails with IndexOutOfBoundsException when a struct type column's 
> field is missing in parquet file schema but present in table schema
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25494
>                 URL: https://issues.apache.org/jira/browse/HIVE-25494
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ganesha Shreedhara
>            Priority: Major
>         Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but 
> present in table schema and columns are accessed by names, the 
> requestedSchema getting sent from Hive to Parquet storage layer has type even 
> for missing field since we always add type as primitive type if a field is 
> missing in file schema ([Ref|#L130]). On a parquet side, this missing field 
> gets pruned and since this field belongs to struct type, it ends creating a 
> GroupColumnIO without any children. This causes query to fail with 
> IndexOutOfBoundsException, stack trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file test-struct.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at 
> org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
>  {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct<child:string,extracol:string> COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception 
> java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 0 in block -1 in file 
> hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When VectorizedParquetRecordReader is used
> {code:java}
> hive> set hive.fetch.task.conversion=none;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
> jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  
> KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0  
>      0       
> 0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 
> s----------------------------------------------------------------------------------------------
> OK
> NULL toplevel{code}
>  
> 3) Create a copy of the same table and run the same query on the newly 
> created table. 
> {code:java}
> hive> create table parquet_struct_test_copy like parquet_struct_test;
> OK
> hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
> Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total 
> jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster 
> with App id application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  
> KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0  
>      0       
> 0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 
> s----------------------------------------------------------------------------------------------
> Loading data to table default.parquet_struct_test_copy
> OK
> hive> select parent.extracol, toplevel from parquet_struct_test_copy;
> OK
> NULL toplevel{code}
>  
> Also, this issue doesn't exist when only missing struct type column's field 
> is selected or all the fields in table are selected. This issue exists only 
> when combination of missing struct type column's field and another existing 
> column are selected.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Reply via email to