[
https://issues.apache.org/jira/browse/IMPALA-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-9512:
--------------------------------------
Description:
Minor compactions can compact several delta directories into a single delta
directory. The current directory filtering algorithm needs to be modified to
handle minor compacted directories and prefer those over plain delta
directories.
On top of that, in minor compacted directories we need to filter out rows we
cannot see. E.g. we can have the following delta directory:
{noformat}
full_acid/delta_0000001_0000010_0000/0000 # minWriteId: 1
# maxWriteId: 10
{noformat}
So this delta dir contains rows with write ids between 1 and 10. But maybe we
are only allowed to see write ids less than 5. Therefore we need to check the
ACID write id column (named originalTransaction) for each row to decide whether
this row is valid or not.
There are several ways to optimize this. E.g. based on the min/max write ids of
the delta directory, and the validWriteIdList, we can decide whether we need to
validate the rows at all. Or, when we reach the high watermark (that tells us
the max valid write id) we can stop the scanner since rows are ordered based on
record ID.
was:
Minor compactions can compact several delta directories into a single delta
directory. The current directory filtering algorithm needs to be modified to
handle minor compacted directories and prefer those over plain delta
directories.
On top of that, in minor compacted directories we need to filter out rows we
cannot see. E.g. we can have the following delta directory:
full_acid/delta_0000001_0000010_0000/0000 # minWriteId: 1
# maxWriteId: 10
So this delta dir contains rows with write ids between 1 and 10. But maybe we
are only allowed to see write ids less than 5. Therefore we need to check the
ACID write id column (named originalTransaction) for each row to decide whether
this row is valid or not.
There are several ways to optimize this. E.g. based on the min/max write ids of
the delta directory, and the validWriteIdList, we can decide whether we need to
validate the rows at all. Or, when we reach the high watermark (that tells us
the max valid write id) we can stop the scanner since rows are ordered based on
record ID.
> Milestone 2: Validate each row against the valid write id list
> --------------------------------------------------------------
>
> Key: IMPALA-9512
> URL: https://issues.apache.org/jira/browse/IMPALA-9512
> Project: IMPALA
> Issue Type: Sub-task
> Reporter: Zoltán Borók-Nagy
> Priority: Major
> Labels: impala-acid
>
> Minor compactions can compact several delta directories into a single delta
> directory. The current directory filtering algorithm needs to be modified to
> handle minor compacted directories and prefer those over plain delta
> directories.
> On top of that, in minor compacted directories we need to filter out rows we
> cannot see. E.g. we can have the following delta directory:
>
> {noformat}
> full_acid/delta_0000001_0000010_0000/0000 # minWriteId: 1
> # maxWriteId: 10
> {noformat}
>
> So this delta dir contains rows with write ids between 1 and 10. But maybe we
> are only allowed to see write ids less than 5. Therefore we need to check the
> ACID write id column (named originalTransaction) for each row to decide
> whether this row is valid or not.
> There are several ways to optimize this. E.g. based on the min/max write ids
> of the delta directory, and the validWriteIdList, we can decide whether we
> need to validate the rows at all. Or, when we reach the high watermark (that
> tells us the max valid write id) we can stop the scanner since rows are
> ordered based on record ID.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]