[jira] [Created] (HIVE-25495) Upgrade to JLine3

2021-09-01 Thread David Mollitor (Jira)
David Mollitor created HIVE-25495:
-

 Summary: Upgrade to JLine3
 Key: HIVE-25495
 URL: https://issues.apache.org/jira/browse/HIVE-25495
 Project: Hive
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Jline 2 has been discontinued a long while ago.  Hadoop uses JLine3 so Hive 
should match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

2021-09-01 Thread Ganesha Shreedhara (Jira)
Ganesha Shreedhara created HIVE-25494:
-

 Summary: Hive query fails with IndexOutOfBoundsException when a 
struct type column's field is missing in parquet file schema but present in 
table schema
 Key: HIVE-25494
 URL: https://issues.apache.org/jira/browse/HIVE-25494
 Project: Hive
  Issue Type: Bug
Reporter: Ganesha Shreedhara
 Attachments: test-struct.parquet

When a struct type column's field is missing in parquet file schema but present 
in table schema and columns are accessed by names, the requestedSchema getting 
sent from Hive to Parquet storage layer has type even for missing field since 
we always add type as primitive type if a field is missing in file schema 
([Ref|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).]
 On a parquet side, this missing field gets pruned and since this field belongs 
to struct type, it ends creating a GroupColumnIO without any children. This 
causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file test-struct.parquet
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:98)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
 at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
 {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception 
java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:


1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id 
application_1630412697229_0031)
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED