[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail

Aditya Shah (JIRA) Sun, 17 Mar 2019 22:10:46 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aditya Shah updated HIVE-17404:
-------------------------------
          Attachment: HIVE-17404.patch
    Target Version/s:   (was: 3.0.0, 2.4.0)
              Status: Patch Available  (was: Open)

Have submitted a patch which Adds a check for ORC bytes in Orctail before 
putting it in the local cache. This issue was faced because in HIVE-16133 we 
minimize the tail data stored in the cache. This cause a call to extractTails 
which rebuilds the OrcTail while using it. This further causes a check for 
footer and results in an error being thrown. Because for old orc files when the 
tail is not present we check the head for the “ORC” text, but in the case where 
we just have a tail as in this call, it causes an exception.

cc [~prasanth_j] [~rajesh.balamohan] [~andrewom]

> Orc split generation cache does not handle files without file tail
> ------------------------------------------------------------------
>
>                 Key: HIVE-17404
>                 URL: https://issues.apache.org/jira/browse/HIVE-17404
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.4.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Aditya Shah
>            Priority: Critical
>         Attachments: HIVE-17404.patch
>
>
> Some old files do not have Orc FileTail. If file tail does not exist, split 
> generation should fallback to old way of storing footers. 
> This can result in exceptions like below
> {code}
> ORC split generation failed with exception: Malformed ORC file. Invalid 
> postscript length 9
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822)
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450)
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
>       at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>       at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
>       at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>       at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid 
> postscript length 9
>       at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297)
>       at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470)
>       at 
> org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763)
>       at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707)
>       ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail

Reply via email to