[ https://issues.apache.org/jira/browse/HIVE-29271 ]


    Mohamed Ali deleted comment on HIVE-29271:
    ------------------------------------

was (Author: JIRAUSER311565):
Hi [~tarak271] ,

May I ask if you have identified any workaround for this issue?

 

 

> Skip corrupted files while reading an Orc table
> -----------------------------------------------
>
>                 Key: HIVE-29271
>                 URL: https://issues.apache.org/jira/browse/HIVE-29271
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive, HiveServer2
>            Reporter: Taraka Rama Rao Lethavadla
>            Priority: Major
>
> *Scenario:*
> There are large number of corrupted files scattered across multiple 
> partitions. They were created by some external tools. Now when we query the 
> table, exceptions like below are thrown
> {noformat}
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>     at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>     at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>     at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30246)
>     at org.apache.orc.OrcProto$PostScript.<init>(OrcProto.java:30210)
>     at 
> org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30353)
>     at 
> org.apache.orc.OrcProto$PostScript$1.parsePartialFrom(OrcProto.java:30348)
>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>     at org.apache.orc.OrcProto$PostScript.parseFrom(OrcProto.java:30791)
>     at org.apache.orc.impl.ReaderImpl.extractPostScript(ReaderImpl.java:644)
>     at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:814)
>     at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:567)
>     at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61){noformat}
> So, it is not possible to query the data from good files. The only way 
> available today is to identify corrupted files from the table and remove them.
> Orc-tools is taking a lot of time to find out the corrupt files as it will 
> traverse each file sequentially and show errors for corrupt file. 
> *Proposal:*
> In spark we have a config, *ignoreCorruptFiles* using which we can read data 
> from rest of the files skipping corrupt files.
> Can we also implement something like this in Hive as well?
> We can have a flag to enable this feature which is disabled by default.
>  
> *Issues:*
> If we do not fail the queries, corrupt files may accumulate and may cause 
> issues later like size of the table, incorrect results etc..
>  
> The reason behind requesting this feature is that it is very difficult to 
> identify faulty/corrupt files easily in a large table/s. 
> So it is also good if we can list all the corrupt files using a simple Hive 
> query, so that they can be deleted without disturbing the actual Hive query 
> flow to skip them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to