[
https://issues.apache.org/jira/browse/HIVE-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071630#comment-17071630
]
Peter Vary commented on HIVE-22670:
-----------------------------------
[~ganeshas]: I have 2 concerns with this patch:
* The main goal with Vectorization to have a tight loop, every unnecessary
addition to the code makes it slower. We really need a justification to add any
new statement to the code - That is why I asked for example data. Is it
possible that the data you have is invalid, but other readers are not failing
because of their different approach?
* Vectorization reuses batches if we do not reset the data in case of null
values, other part of the code might end up reading wrong values from the
reused batch. This is very hard to test, since it usually happens in a race
condition.
Thanks,
Peter
> ArrayIndexOutOfBoundsException when vectorized reader is used for reading a
> parquet file
> ----------------------------------------------------------------------------------------
>
> Key: HIVE-22670
> URL: https://issues.apache.org/jira/browse/HIVE-22670
> Project: Hive
> Issue Type: Bug
> Components: Parquet, Vectorization
> Affects Versions: 3.1.2, 2.3.6
> Reporter: Ganesha Shreedhara
> Assignee: Ganesha Shreedhara
> Priority: Major
> Attachments: HIVE-22670.1.patch, HIVE-22670.2.patch
>
>
> ArrayIndexOutOfBoundsException is getting thrown while decoding dictionaryIds
> of a row group in parquet file with vectorization enabled.
> *Exception stack trace:*
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
> at
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:122)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.ParquetDataColumnReaderFactory$DefaultParquetDataColumnReader.readString(ParquetDataColumnReaderFactory.java:95)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.decodeDictionaryIds(VectorizedPrimitiveColumnReader.java:467)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedPrimitiveColumnReader.readBatch(VectorizedPrimitiveColumnReader.java:68)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:410)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:353)
> at
> org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.next(VectorizedParquetRecordReader.java:92)
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
> ... 24 more{code}
>
> This issue seems to be caused by re-using the same dictionary column vector
> while reading consecutive row groups. This looks like one of the corner case
> bug which occurs for a certain distribution of dictionary/plain encoded data
> while we read/populate the underlying bit packed dictionary data into a
> column-vector based data structure.
> Similar issue issue was reported in spark (Ref:
> https://issues.apache.org/jira/browse/SPARK-16334)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)