[jira] [Comment Edited] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

Ganesha Shreedhara (Jira) Tue, 31 Mar 2020 03:36:17 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071663#comment-17071663
 ]


Ganesha Shreedhara edited comment on HIVE-22670 at 3/31/20, 10:35 AM:
----------------------------------------------------------------------

[~pvary] The data I have is a valid one. Hive query was working fine on the 
same dataset in 2.1.1 version. We started seeing this issue in 2.3 version 
because the vectorization support for parquet data type was added in 2.3 
version (Ref: HIVE-14826). The query works fine with vectorization disabled. 
Also, the same issue was reported by other users in SPARK-16334. This fix is 
same as the one done in SPARK-16334 (Ref: 
[https://github.com/apache/spark/pull/14941/files]). The performance impact 
seems to be less as per [this comment|#issuecomment-244487305].] 

Please let me know if there is a better way to fix this to avoid any side 
effects. 


was (Author: ganeshas):
[~pvary] The data I have is a valid one. Hive query was working fine on the 
same dataset in 2.1.1 version. We started seeing this issue in 2.3 version 
because the vectorization support for parquet data type was added in 2.3 
version (Ref: HIVE-14826). The query works fine with vectorization disabled. 
Also, the same issue was reported by other users in SPARK-16334. This fix is 
same as the one done in SPARK-16334 (Ref: 
[Spark_PR|[https://github.com/apache/spark/pull/14941/files]]). The performance 
impact seems to be less as per [this 
comment|[https://github.com/apache/spark/pull/14941#issuecomment-244487305].] 

Please let me know if there is a better way to fix this to avoid any side 
effects. 

> ArrayIndexOutOfBoundsException when vectorized reader is used for reading a 
> parquet file
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22670
>                 URL: https://issues.apache.org/jira/browse/HIVE-22670
>             Project: Hive
>          Issue Type: Bug
>          Components: Parquet, Vectorization
>    Affects Versions: 3.1.2, 2.3.6
>            Reporter: Ganesha Shreedhara
>            Assignee: Ganesha Shreedhara
>            Priority: Major
>         Attachments: HIVE-22670.1.patch, HIVE-22670.2.patch
>
>
> ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds 
> of a row group in parquet file with vectorization enabled. 
> *Exception stack trace:*
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>  at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92)
>  at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
>  ... 24 more{code}
>  
> This issue seems to be caused by re-using the same dictionary column vector 
> while reading consecutive row groups. This looks like one of the corner case 
> bug which occurs for a certain distribution of dictionary/plain encoded data 
> while we read/populate the underlying bit packed dictionary data into a 
> column-vector based data structure. 
> Similar issue issue was reported in spark (Ref: 
> https://issues.apache.org/jira/browse/SPARK-16334)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-22670) ArrayIndexOutOfBoundsException when vectorized reader is used for reading a parquet file

Reply via email to