[ 
https://issues.apache.org/jira/browse/HIVE-26147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Solimando updated HIVE-26147:
----------------------------------------
    Description: 
When _hive.acid.key.index_ is missing for an acid ORC file _OrcRawRecordMerger_ 
throws as follows:

{noformat}
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.discoverKeyBounds(OrcRawRecordMerger.java:795)
 ~[hive-exec-4.0.0-alpha-2-SNAPS
HOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1053)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.
0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2096)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-a
lpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1991)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4
.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:769)
 ~[hive-exec-4.0.0-alpha
-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-
alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:529) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-
SNAPSHOT]
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.Driver.getFetchingTableResults(Driver.java:719) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNA
PSHOT]
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:671) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:233) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
-2-SNAPSHOT]
        at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:489)
 ~[hive-service-4.0.0-alpha-2-SNAPSHOT.jar:
4.0.0-alpha-2-SNAPSHOT]
        ... 24 more
{noformat}

For this situation to happen, the ORC file must have more than one stripe, and 
the offset of the element to seek should either locate it beyond the first 
stripe (but before the last one), or in the first one if not the last one, as 
the code shows:

{code:java}
    if (firstStripe != 0) {
      minKey = keyIndex[firstStripe - 1];
    }
    if (!isTail) {
      maxKey = keyIndex[firstStripe + stripeCount - 1];
    }
{code}

However, in the context of the detection of the original issue, the NPE was 
triggered even by a simple "select *" over a table with ORC files missing the 
_hive.acid.key.index_ metadata information, but it was never failing for ORC 
files with a single stripe. The file was generated after a major compaction of 
acid and non-acid data.

If the "select *" is not triggering the NPE, either pick the values of the row 
obtained with "select * from $table limit 1", or try to select based on 
different values trying to get into the sought situation with a filter like 
this:

{code:sql}
select * from $table where c = $value
{code}

_OrcRawRecordMerger_ should simply leave as "null" the min and max keys when 
the _hive.acid.key.index_ metadata is missing.

  was:
When _hive.acid.key.index_ is missing for an acid ORC file _OrcRawRecordMerger_ 
throws as follows:

{noformat}
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.discoverKeyBounds(OrcRawRecordMerger.java:795)
 ~[hive-exec-4.0.0-alpha-2-SNAPS
HOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1053)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.
0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2096)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-a
lpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1991)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4
.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:769)
 ~[hive-exec-4.0.0-alpha
-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335)
 ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-
alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:529) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-
SNAPSHOT]
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.Driver.getFetchingTableResults(Driver.java:719) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNA
PSHOT]
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:671) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:233) 
~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
-2-SNAPSHOT]
        at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:489)
 ~[hive-service-4.0.0-alpha-2-SNAPSHOT.jar:
4.0.0-alpha-2-SNAPSHOT]
        ... 24 more
{noformat}

For this situation to happen, the ORC file must have more than one stripe, and 
the offset of the element to seek should be locate it beyond the first stripe 
but before the tail one, as the code clearly suggests:

{code:java}
    if (firstStripe != 0) {
      minKey = keyIndex[firstStripe - 1];
    }
    if (!isTail) {
      maxKey = keyIndex[firstStripe + stripeCount - 1];
    }
{code}

However, in the context of the detection of the original issue, the NPE was 
triggered even by a simple "select *" over a table with ORC files missing the 
_hive.acid.key.index_ metadata information, but it was never failing for ORC 
files with a single stripe. The file was generated after a major compaction of 
acid and non-acid data.

In order to force an offset located in a stripe in the middle, one can use the 
following query, knowing in what stripe a particular value exists:

{code:sql}
select * from $table where c = $value
{code}

_OrcRawRecordMerger_ should simply leave as "null" the min and max keys when 
the _hive.acid.key.index_ metadata is missing.


> OrcRawRecordMerger throws NPE when hive.acid.key.index is missing for an acid 
> file
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-26147
>                 URL: https://issues.apache.org/jira/browse/HIVE-26147
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC, Transactions
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> When _hive.acid.key.index_ is missing for an acid ORC file 
> _OrcRawRecordMerger_ throws as follows:
> {noformat}
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.discoverKeyBounds(OrcRawRecordMerger.java:795)
>  ~[hive-exec-4.0.0-alpha-2-SNAPS
> HOT.jar:4.0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:1053)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.
> 0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:2096)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-a
> lpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1991)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4
> .0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:769)
>  ~[hive-exec-4.0.0-alpha
> -2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-
> alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
> -2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:529) 
> ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-
> SNAPSHOT]
>         at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
> ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.Driver.getFetchingTableResults(Driver.java:719) 
> ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNA
> PSHOT]
>         at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:671) 
> ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:233)
>  ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha
> -2-SNAPSHOT]
>         at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:489)
>  ~[hive-service-4.0.0-alpha-2-SNAPSHOT.jar:
> 4.0.0-alpha-2-SNAPSHOT]
>         ... 24 more
> {noformat}
> For this situation to happen, the ORC file must have more than one stripe, 
> and the offset of the element to seek should either locate it beyond the 
> first stripe (but before the last one), or in the first one if not the last 
> one, as the code shows:
> {code:java}
>     if (firstStripe != 0) {
>       minKey = keyIndex[firstStripe - 1];
>     }
>     if (!isTail) {
>       maxKey = keyIndex[firstStripe + stripeCount - 1];
>     }
> {code}
> However, in the context of the detection of the original issue, the NPE was 
> triggered even by a simple "select *" over a table with ORC files missing the 
> _hive.acid.key.index_ metadata information, but it was never failing for ORC 
> files with a single stripe. The file was generated after a major compaction 
> of acid and non-acid data.
> If the "select *" is not triggering the NPE, either pick the values of the 
> row obtained with "select * from $table limit 1", or try to select based on 
> different values trying to get into the sought situation with a filter like 
> this:
> {code:sql}
> select * from $table where c = $value
> {code}
> _OrcRawRecordMerger_ should simply leave as "null" the min and max keys when 
> the _hive.acid.key.index_ metadata is missing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to