[ https://issues.apache.org/jira/browse/ORC-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952801#comment-17952801 ]
zhaolong commented on ORC-1897: ------------------------------- This problem is difficult to reproduce. No test case is available. > ORC file Damaged has many different exception > --------------------------------------------- > > Key: ORC-1897 > URL: https://issues.apache.org/jira/browse/ORC-1897 > Project: ORC > Issue Type: Bug > Affects Versions: 1.6.7, 2.1.2 > Reporter: zhaolong > Priority: Blocker > > We have find many cases of ORC file corruption, and errors will be reported > when reading. > # java.lang.ArrayIndexOutOfBoundsException: 0 at > org.apache.orc.impl.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:200) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:70) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:373) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:696) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:2463) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72) > at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236) > # java.lang.IllegalArgumentException: Buffer size too small. size 131072 > needed = 471700 in column 1 kind DICTIONARY DATA > at org.apache.orc.impl,InStream$CompressedStream.readHeader(InStream.java:487) > at org > apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java: > 531) > at org.apache.orc.impl:InStream$CompressedStream.available(InStream.java:538) > at org > apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:1776) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory. > java:1740) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory. > java: 1491) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory. > java:2076) > at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl. java: > 1117) > at org apache.orc impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java: > 1154) > at org apache.orc > impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1189) > at org.apache.orc.impl,RecordReaderImpl,<init>(RecordReaderImpl, java:251) > at org apache.orc impl.ReaderImpl.rows(ReaderImpl,java:851) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:845) > Take the second figure as an example. The chunkLength should be between 32k > (snappy) and 256k (lzo, zlib), but why needed = 471700. We have tested the > CPU and memory of the hardware, and no error is found. The EC policy of the > HDFS is also not configured. > So we want to read the orc file after hive write orc file in > filesinkoperator. However, considering the performance impact, we can only > read orc metadata such as stripe size to check whether there is any problem. > Is there any other way to solve the above problem? -- This message was sent by Atlassian Jira (v8.20.10#820010)