Viraj Jasani created PHOENIX-7456:
-------------------------------------
Summary: Change Data Capture for Phoenix Stream
Key: PHOENIX-7456
URL: https://issues.apache.org/jira/browse/PHOENIX-7456
Project: Phoenix
Issue Type: New Feature
Reporter: Viraj Jasani
Apache Phoenix provides Change Data Capture (CDC) with PHOENIX-7001. The CDC
design in Phoenix leverages the write-optimized Uncovered Index as well as Max
Lookback features. The changes are captured in the time-ordered event of row
level modifications with an option to provide pre-image and post-image with
every modification.
Since the CDC uses an uncovered index on PHOENIX_ROW_TIMESTAMP(), the CDC
consumer needs to provide time duration for which the records are expected to
be read from the CDC Index. In this approach, any event of HBase table region
split does not have any impact on the consumer queries as the index
regionserver performs the scans on multiple data table regions depending on how
many regions have been involved with the table data modifications in the given
time range. Therefore, the consumer does not have the option to consume only
the given table region (partition) specific change events. It is specifically
important for the cloud native applications that consume Change Stream records
to be able to identify how much compute units (memory, CPU, IO etc) needs to be
allocated according to the number of data table regions involved for the given
time range. As the region size grows beyond a certain limit, HBase considers
splitting the given region based on the split policy. The default split policy
is based on region size growth. For instance, regions are not allowed to grow
beyond 10 GB by default.
The proposed solution requires a new framework introducing the Streaming
concepts for Phoenix CDC. The solution needs to provide one active stream for
the given table on which the CDC is enabled by the client or consumer.
*Change Stream:*
Phoenix Stream captures a time-ordered sequence of row-level modifications in
any table and stores this information in a log for up to TTL window (24 hour by
default). Client applications can access this log and view the changes with an
optional support of how the data appeared before and after the row is modified,
in near-real time.
*Stream Partitions:*
Stream records are organized into groups, or partitions. Each partition acts as
a container for multiple stream records and contains information required for
accessing and iterating through these records. The stream records within a
partition are removed automatically after the TTL window.
*Note:* PHOENIX-7425 introduces partition id while generating the index
mutations for the CDC.
Detailed design doc:
[https://docs.google.com/document/d/157KXONp4ZoWmhUBsy-lK3LNAanCsuJ3cb7qYTmozgZo/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)