[
https://issues.apache.org/jira/browse/PHOENIX-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani reassigned PHOENIX-7878:
-------------------------------------
Assignee: Viraj Jasani
> CDC perf improvement - skip redundant cell versions on data table scans
> -----------------------------------------------------------------------
>
> Key: PHOENIX-7878
> URL: https://issues.apache.org/jira/browse/PHOENIX-7878
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
>
> When a CDC query runs with pre, post, and/or change scopes, it scans the data
> table to reconstruct each change event (the change image plus the pre-image,
> and for the consumer path the full data-row state). Today that data table
> scan is a raw, all-versions scan, so for every data row we read back _every_
> version of _every_ column - even though, for a given batch of changes, we
> only need two cells per column per change: the cell at the change timestamp,
> and the most recent cell just below it (the pre-image). On rows that are
> updated frequently this means we read, transfer, and process far more cells
> than the event reconstruction actually uses, which adds CPU, memory, and
> network overhead to CDC reads.
> The purpose of this Jira is to add new CDCVersionFilter, in addition to
> SkipScanFilter on the data table scans. For each row it is given the set of
> change timestamps from the current batch and keeps only the cells that
> matter: the cell at each change timestamp, the first cell below each change
> timestamp (the pre-image), and all DeleteFamily markers (needed for deletion
> tracking), other cells are skipped to avoid redundant data transfer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)