[jira] [Work logged] (HIVE-23758) OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch

ASF GitHub Bot (Jira) Thu, 25 Jun 2020 08:40:24 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23758?focusedWorklogId=451141&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-451141
 ]


ASF GitHub Bot logged work on HIVE-23758:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Jun/20 15:39
            Start Date: 25/Jun/20 15:39
    Worklog Time Spent: 10m 
      Work Description: pgaref commented on pull request #1174:
URL: https://github.com/apache/hive/pull/1174#issuecomment-649631270


   Hey @abstractdog  -- was just looking at this issue which is quite 
interesting as the second bucket file is totally empty.
   It seems that this could be a compaction leftover of some sort(?) 
   
   Regarding the FIX even though it helps the particular issue its not solving 
the underlying issue which is essentially schema missmatch -- for example if 
the wrong bucket file had the same column size (or smaller) it would still be 
read (leading potentially to wrong results).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 451141)
    Time Spent: 20m  (was: 10m)

> OrcInputFormat.getSargColumnNames might be more failsafe in case of schema 
> mismatch
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-23758
>                 URL: https://issues.apache.org/jira/browse/HIVE-23758
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: orc_dump.log
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There was a customer case, where a bucket file was somehow placed into a 
> partition directory, which contained another bucket file with valid acid 
> schema (refer  [^orc_dump.log]  for details), and query failed while split 
> generation with below error at [this 
> line|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L497]
> {code}
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959)
>         at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>         at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>         at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>         at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>         at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>         at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>         at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>         ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865)
>         ... 17 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>         at java.util.ArrayList.get(ArrayList.java:429)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> haven't figured out the origin of the second, empty file, but a sanity check 
> of length helped to skip this issue by ignoring that file while split 
> generation, which I'm about to try out in the first version of the pull 
> request:
> in tez app logs after the patch:
> {code}
> 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: 
> possible schema mismatch, asked for column with index:6. column but there is 
> only 6 types defined (isOriginal: false, originalColumnNames.length: 1), 
> cannot get sarg col names...
> 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: 
> Skipping split elimination for 
> hdfs://ns1/warehouse/tablespace/managed/hive/bdaa28846/cda_date=20200601/cda_job_name=core_base/base_0000001/bucket_00001
>  as column names is null
> {code}
> where bucket_00001 was the second, problematic file, so the patch helped 
> split generation recover from this strange state...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23758) OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch

Reply via email to