[jira] [Updated] (PHOENIX-7473) Eliminating index maintenance for CDC index

Kadir Ozdemir (Jira) Mon, 25 Nov 2024 22:53:07 -0800


     [ 
https://issues.apache.org/jira/browse/PHOENIX-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kadir Ozdemir updated PHOENIX-7473:
-----------------------------------
    Description: 
A CDC index is a log of row keys for each data table row mutations. These index 
rows are ordered by mutation timestamp for a given row. This index is used for 
capturing recent changes to a data table. It only stores the changes for the 
max lookback period. It is only used for CDC queries.

For regular indexes we do index maintenance such that if a data table row is 
deleted, we also delete the corresponding index row. This is especially needed 
for covered indexes for correctness as we use the index alone to serve the 
queries.

For uncovered indexes, this delete is not necessary for correctness but needed 
for performance reason not to scan deleted rows again and again, and not to 
attempt to scan the corresponding deleted data table rows. However, none of 
these reasons are really applicable to CDC indexes. Since CDC index table rows 
expires quickly we do not really need to delete them. It is also expected that 
a CDC index row is scanned once.

For CDC indexes we add an extra delete markers for each deleted row to have two 
delete markers, one with the embedded row timestamp value that is equal to the 
delete operation timestamp and the other with the embedded row timestamp value 
that is equal to the latest put operation timestamp of this row. However, we 
need only the former one for including the delete operation in the correct 
order.

For the same reasons, we do not need to read repair index rows for CDC indexes 
either. The read repair is done currently to delete the orphan index rows. 
These rows happens if index put succeed but the corresponding data put does 
not. 

Eliminating index maintenance will improve index performance and simplify its 
code. 

  was:
A CDC index is a log of row keys for each data table row mutations. These index 
rows are ordered by mutation timestamp for a given row. This index is used for 
capturing recent changes to a data table. It only stores the changes for the 
max lookback period. It is only used for CDC queries.

For regular indexes we do index maintenance such that if a data table row is 
deleted, we also delete the corresponding index row. This is especially needed 
for covered indexes for correctness as we use the index alone to serve the 
queries.

For uncovered indexes, this delete is not necessary for correctness but needed 
for performance reason not to scan deleted rows again and again, and not to 
attempt to scan the corresponding deleted data table rows. However, none of 
these reasons are really applicable to CDC indexes. Since CDC index table rows 
expires quickly we do not really delete them. It is also expected that an index 
row is scanned once.

In fact for CDC indexes we add an extra delete markers for each deleted row to 
have two delete markers, one with the embedded row timestamp value that is 
equal to the delete operation timestamp and the other with the embedded row 
timestamp value that is equal to the latest put operation timestamp of this 
row. However, we needs only former one for reporting the delete operation in 
the correct order.

For the same reasons, we do not need to read repair index rows either. The read 
repair is done currently to delete the orphan index rows. These rows happens if 
index put succeed but the corresponding data put does not. 

Eliminating index maintenance will improve index performance and simplify its 
code. 


> Eliminating index maintenance for CDC index
> -------------------------------------------
>
>                 Key: PHOENIX-7473
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7473
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Kadir Ozdemir
>            Assignee: Kadir Ozdemir
>            Priority: Major
>
> A CDC index is a log of row keys for each data table row mutations. These 
> index rows are ordered by mutation timestamp for a given row. This index is 
> used for capturing recent changes to a data table. It only stores the changes 
> for the max lookback period. It is only used for CDC queries.
> For regular indexes we do index maintenance such that if a data table row is 
> deleted, we also delete the corresponding index row. This is especially 
> needed for covered indexes for correctness as we use the index alone to serve 
> the queries.
> For uncovered indexes, this delete is not necessary for correctness but 
> needed for performance reason not to scan deleted rows again and again, and 
> not to attempt to scan the corresponding deleted data table rows. However, 
> none of these reasons are really applicable to CDC indexes. Since CDC index 
> table rows expires quickly we do not really need to delete them. It is also 
> expected that a CDC index row is scanned once.
> For CDC indexes we add an extra delete markers for each deleted row to have 
> two delete markers, one with the embedded row timestamp value that is equal 
> to the delete operation timestamp and the other with the embedded row 
> timestamp value that is equal to the latest put operation timestamp of this 
> row. However, we need only the former one for including the delete operation 
> in the correct order.
> For the same reasons, we do not need to read repair index rows for CDC 
> indexes either. The read repair is done currently to delete the orphan index 
> rows. These rows happens if index put succeed but the corresponding data put 
> does not. 
> Eliminating index maintenance will improve index performance and simplify its 
> code. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PHOENIX-7473) Eliminating index maintenance for CDC index

Reply via email to