Viraj Jasani created PHOENIX-7456:
-------------------------------------

             Summary: Change Data Capture for Phoenix Stream
                 Key: PHOENIX-7456
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7456
             Project: Phoenix
          Issue Type: New Feature
            Reporter: Viraj Jasani


Apache Phoenix provides Change Data Capture (CDC) with PHOENIX-7001. The CDC 
design in Phoenix leverages the write-optimized Uncovered Index as well as Max 
Lookback features. The changes are captured in the time-ordered event of row 
level modifications with an option to provide pre-image and post-image with 
every modification.

Since the CDC uses an uncovered index on PHOENIX_ROW_TIMESTAMP(), the CDC 
consumer needs to provide time duration for which the records are expected to 
be read from the CDC Index. In this approach, any event of HBase table region 
split does not have any impact on the consumer queries as the index 
regionserver performs the scans on multiple data table regions depending on how 
many regions have been involved with the table data modifications in the given 
time range. Therefore, the consumer does not have the option to consume only 
the given table region (partition) specific change events. It is specifically 
important for the cloud native applications that consume Change Stream records 
to be able to identify how much compute units (memory, CPU, IO etc) needs to be 
allocated according to the number of data table regions involved for the given 
time range. As the region size grows beyond a certain limit, HBase considers 
splitting the given region based on the split policy. The default split policy 
is based on region size growth. For instance, regions are not allowed to grow 
beyond 10 GB by default.

The proposed solution requires a new framework introducing the Streaming 
concepts for Phoenix CDC. The solution needs to provide one active stream for 
the given table on which the CDC is enabled by the client or consumer.

 

*Change Stream:*

Phoenix Stream captures a time-ordered sequence of row-level modifications in 
any table and stores this information in a log for up to TTL window (24 hour by 
default). Client applications can access this log and view the changes with an 
optional support of how the data appeared before and after the row is modified, 
in near-real time.

*Stream Partitions:*

Stream records are organized into groups, or partitions. Each partition acts as 
a container for multiple stream records and contains information required for 
accessing and iterating through these records. The stream records within a 
partition are removed automatically after the TTL window.

 

*Note:* PHOENIX-7425 introduces partition id while generating the index 
mutations for the CDC.

 

Detailed design doc: 
[https://docs.google.com/document/d/157KXONp4ZoWmhUBsy-lK3LNAanCsuJ3cb7qYTmozgZo/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to