Kadir Ozdemir created PHOENIX-7473:
--------------------------------------
Summary: Eliminating index maintenance for CDC index
Key: PHOENIX-7473
URL: https://issues.apache.org/jira/browse/PHOENIX-7473
Project: Phoenix
Issue Type: Improvement
Reporter: Kadir Ozdemir
Assignee: Kadir Ozdemir
A CDC index is a log of row keys for each data table row mutations. These index
rows are ordered by mutation timestamp for a given row. This index is used for
capturing recent changes to a data table. It only stores the changes for the
max lookback period. It is only used for CDC queries.
For regular indexes we do index maintenance such that if a data table row is
deleted, we also delete the corresponding index row. This is especially needed
for covered indexes for correctness as we use the index alone to serve the
queries.
For uncovered indexes, this delete is not necessary for correctness but needed
for performance reason not to scan deleted rows again and again, and not to
attempt to scan the corresponding deleted data table rows. However, none of
these reasons are really applicable to CDC indexes. Since CDC index table rows
expires quickly we do not really delete them. It is also expected that an index
row is scanned once.
In fact for CDC indexes we add an extra delete markers for each deleted row to
have two delete markers, one with the embedded row timestamp value that is
equal to the delete operation timestamp and the other with the embedded row
timestamp value that is equal to the latest put operation timestamp of this
row. However, we needs only former one for reporting the delete operation in
the correct order.
For the same reasons, we do not need to read repair index rows either. The read
repair is done currently to delete the orphan index rows. These rows happens if
index put succeed but the corresponding data put does not.
Eliminating index maintenance will improve index performance and simplify its
code.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)