[ https://issues.apache.org/jira/browse/HIVE-29271 ]
Mohamed Ali deleted comment on HIVE-29271:
------------------------------------
was (Author: JIRAUSER311565):
Hi [~tarak271] ,
May I ask if you have identified any workaround for this issue?
> Skip corrupted files while reading an Orc table
> -----------------------------------------------
>
> Key: HIVE-29271
> URL: https://issues.apache.org/jira/browse/HIVE-29271
> Project: Hive
> Issue Type: Improvement
> Components: Hive, HiveServer2
> Reporter: Taraka Rama Rao Lethavadla
> Priority: Major
>
> *Scenario:*
> There are large number of corrupted files scattered across multiple
> partitions. They were created by some external tools. Now when we query the
> table, exceptions like below are thrown
> {noformat}
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> message contained an invalid tag (zero).
> at
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
> at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
> at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30246)
> at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30210)
> at
> org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30353)
> at
> org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30348)
> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
> at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
> at org.apache.orc.OrcProto$PostScript.parseFrom(OrcProto.java:30791)
> at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:644)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61){noformat}
> So, it is not possible to query the data from good files. The only way
> available today is to identify corrupted files from the table and remove them.
> Orc-tools is taking a lot of time to find out the corrupt files as it will
> traverse each file sequentially and show errors for corrupt file.
> *Proposal:*
> In spark we have a config, *ignoreCorruptFiles* using which we can read data
> from rest of the files skipping corrupt files.
> Can we also implement something like this in Hive as well?
> We can have a flag to enable this feature which is disabled by default.
>
> *Issues:*
> If we do not fail the queries, corrupt files may accumulate and may cause
> issues later like size of the table, incorrect results etc..
>
> The reason behind requesting this feature is that it is very difficult to
> identify faulty/corrupt files easily in a large table/s.
> So it is also good if we can list all the corrupt files using a simple Hive
> query, so that they can be deleted without disturbing the actual Hive query
> flow to skip them
--
This message was sent by Atlassian Jira
(v8.20.10#820010)