[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events

Eugene Koifman (JIRA) Mon, 17 Sep 2018 11:54:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Koifman updated HIVE-16812:
----------------------------------
    Description: 
the c'tor of VectorizedOrcAcidRowBatchReader has
{noformat}
    // Clone readerOptions for deleteEvents.
    Reader.Options deleteEventReaderOptions = readerOptions.clone();
    // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because
    // we always want to read all the delete delta files.
    deleteEventReaderOptions.range(0, Long.MAX_VALUE);
{noformat}

This is suboptimal since base and deltas are sorted by ROW__ID.  So for each 
split if base we can find min/max ROW_ID and only load events from delta that 
are in [min,max] range.  This will reduce the number of delete events we load 
in memory (to no more than there in the split).

When we support sorting on PK, the same should apply but we'd need to make sure 
to store PKs in ORC index

See {{OrcRawRecordMerger.discoverKeyBounds()}}

{{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should 
know min/max easily for any file written by {{OrcRecordUpdater}}

  was:
the c'tor of VectorizedOrcAcidRowBatchReader has
{noformat}
    // Clone readerOptions for deleteEvents.
    Reader.Options deleteEventReaderOptions = readerOptions.clone();
    // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because
    // we always want to read all the delete delta files.
    deleteEventReaderOptions.range(0, Long.MAX_VALUE);
{noformat}

This is suboptimal since base and deltas are sorted by ROW__ID.  So for each 
split if base we can find min/max ROW_ID and only load events from delta that 
are in [min,max] range.  This will reduce the number of delete events we load 
in memory (to no more than there in the split).

When we support sorting on PK, the same should apply but we'd need to make sure 
to store PKs in ORC index

See OrcRawRecordMerger.discoverKeyBounds()


> VectorizedOrcAcidRowBatchReader doesn't filter delete events
> ------------------------------------------------------------
>
>                 Key: HIVE-16812
>                 URL: https://issues.apache.org/jira/browse/HIVE-16812
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>    Affects Versions: 2.3.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>
> the c'tor of VectorizedOrcAcidRowBatchReader has
> {noformat}
>     // Clone readerOptions for deleteEvents.
>     Reader.Options deleteEventReaderOptions = readerOptions.clone();
>     // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX 
> because
>     // we always want to read all the delete delta files.
>     deleteEventReaderOptions.range(0, Long.MAX_VALUE);
> {noformat}
> This is suboptimal since base and deltas are sorted by ROW__ID.  So for each 
> split if base we can find min/max ROW_ID and only load events from delta that 
> are in [min,max] range.  This will reduce the number of delete events we load 
> in memory (to no more than there in the split).
> When we support sorting on PK, the same should apply but we'd need to make 
> sure to store PKs in ORC index
> See {{OrcRawRecordMerger.discoverKeyBounds()}}
> {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should 
> know min/max easily for any file written by {{OrcRecordUpdater}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events

Reply via email to