[
https://issues.apache.org/jira/browse/HIVE-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HIVE-19479:
------------------------------------
Description:
The PositionProvider offset is not updated correctly and an error like this may
happen:
{noformat}
Caused by: java.lang.IllegalArgumentException: Seek in LENGTH to 541 is outside
of the data
at
org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:161)
at
org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:123)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:331)
at
org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:298)
at
org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:258)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.repositionInStreams(OrcEncodedDataConsumer.java:250)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:134)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:62)
{noformat}
We found this happens when ORC writes a strange stream combination - data
stream for a RG has no values (the rows all have nulls), but there are values
(0-s) in length stream for the same rows. That is technically a valid ORC file,
although writing the 0s is completely useless.
was:
The PositionProvider offset is not updated correctly and an error like this may
happen:
{noformat}
Caused by: java.lang.IllegalArgumentException: Seek in LENGTH to 541 is outside
of the data
at
org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:161)
at
org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:123)
at
org.apache.orc.impl.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:331)
at
org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:298)
at
org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:258)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.repositionInStreams(OrcEncodedDataConsumer.java:250)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:134)
at
org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:62)
{noformat}
> encoded stream seek is incorrect for 0-length RGs in LLAP IO
> ------------------------------------------------------------
>
> Key: HIVE-19479
> URL: https://issues.apache.org/jira/browse/HIVE-19479
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Attachments: HIVE-19479.01.patch, HIVE-19479.patch
>
>
> The PositionProvider offset is not updated correctly and an error like this
> may happen:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Seek in LENGTH to 541 is
> outside of the data
> at
> org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:161)
> at
> org.apache.orc.impl.InStream$UncompressedStream.seek(InStream.java:123)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:331)
> at
> org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:298)
> at
> org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.seek(EncodedTreeReaderFactory.java:258)
> at
> org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.repositionInStreams(OrcEncodedDataConsumer.java:250)
> at
> org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:134)
> at
> org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:62)
> {noformat}
> We found this happens when ORC writes a strange stream combination - data
> stream for a RG has no values (the rows all have nulls), but there are values
> (0-s) in length stream for the same rows. That is technically a valid ORC
> file, although writing the 0s is completely useless.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)