Viraj Jasani created PHOENIX-7878:
-------------------------------------

             Summary: CDC perf improvement - skip redundant cell versions on 
data table scans
                 Key: PHOENIX-7878
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7878
             Project: Phoenix
          Issue Type: Improvement
            Reporter: Viraj Jasani


When a CDC query runs with pre, post, and/or change scopes, it scans the data 
table to reconstruct each change event (the change image plus the pre-image, 
and for the consumer path the full data-row state). Today that data table scan 
is a raw, all-versions scan, so for every data row we read back _every_ version 
of _every_ column - even though, for a given batch of changes, we only need two 
cells per column per change: the cell at the change timestamp, and the most 
recent cell just below it (the pre-image). On rows that are updated frequently 
this means we read, transfer, and process far more cells than the event 
reconstruction actually uses, which adds CPU, memory, and network overhead to 
CDC reads.

The purpose of this Jira is to add new CDCVersionFilter, in addition to 
SkipScanFilter on the data table scans. For each row it is given the set of 
change timestamps from the current batch and keeps only the cells that matter: 
the cell at each change timestamp, the first cell below each change timestamp 
(the pre-image), and all DeleteFamily markers (needed for deletion tracking), 
other cells are skipped to avoid redundant data transfer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to