zhaolong created ORC-1897:
-----------------------------

             Summary: ORC file Damaged has many different exception
                 Key: ORC-1897
                 URL: https://issues.apache.org/jira/browse/ORC-1897
             Project: ORC
          Issue Type: Bug
    Affects Versions: 2.1.2, 1.6.7
            Reporter: zhaolong


We have find many cases of ORC file corruption, and errors will be reported 
when reading.
 # java.lang.ArrayIndexOutOfBoundsException: 0 at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200)
 at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70)
 at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
 at 
org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373)
 at 
org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696)
 at 
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463)
 at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
 at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
 # java.lang.IllegalArgumentException: Buffer size too small. size    131072 
needed = 471700 in column 1 kind DICTIONARY DATA
at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487)
at org 
apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: 531)
at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538)
at org 
apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776)
at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.
 java:1740)
at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.
 java: 1491)
at 
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.
 java:2076)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: 1117)
at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java: 
1154)
at org apache.orc 
impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189)
at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251)
at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845)

Take the second figure as an example. The chunkLength should be between 32k 
(snappy) and 256k (lzo, zlib), but why  needed = 471700. We have tested the CPU 
and memory of the hardware, and no error is found. The EC policy of the HDFS is 
not configured.

So we want to read the orc file after hive write orc file  in filesinkoperator. 
However, considering the performance impact, we can only read orc metadata such 
as stripe size to check whether there is any problem.

    



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to