zhaolong created ORC-1897: ----------------------------- Summary: ORC file Damaged has many different exception Key: ORC-1897 URL: https://issues.apache.org/jira/browse/ORC-1897 Project: ORC Issue Type: Bug Affects Versions: 2.1.2, 1.6.7 Reporter: zhaolong
We have find many cases of ORC file corruption, and errors will be reported when reading. # java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236) # java.lang.IllegalArgumentException: Buffer size too small. size 131072 needed = 471700 in column 1 kind DICTIONARY DATA at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487) at org apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: 531) at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538) at org apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory. java:1740) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory. java: 1491) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory. java:2076) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: 1117) at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java: 1154) at org apache.orc impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189) at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251) at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845) Take the second figure as an example. The chunkLength should be between 32k (snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the CPU and memory of the hardware, and no error is found. The EC policy of the HDFS is not configured. So we want to read the orc file after hive write orc file in filesinkoperator. However, considering the performance impact, we can only read orc metadata such as stripe size to check whether there is any problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)