Viraj Jasani created PHOENIX-7456: ------------------------------------- Summary: Change Data Capture for Phoenix Stream Key: PHOENIX-7456 URL: https://issues.apache.org/jira/browse/PHOENIX-7456 Project: Phoenix Issue Type: New Feature Reporter: Viraj Jasani
Apache Phoenix provides Change Data Capture (CDC) with PHOENIX-7001. The CDC design in Phoenix leverages the write-optimized Uncovered Index as well as Max Lookback features. The changes are captured in the time-ordered event of row level modifications with an option to provide pre-image and post-image with every modification. Since the CDC uses an uncovered index on PHOENIX_ROW_TIMESTAMP(), the CDC consumer needs to provide time duration for which the records are expected to be read from the CDC Index. In this approach, any event of HBase table region split does not have any impact on the consumer queries as the index regionserver performs the scans on multiple data table regions depending on how many regions have been involved with the table data modifications in the given time range. Therefore, the consumer does not have the option to consume only the given table region (partition) specific change events. It is specifically important for the cloud native applications that consume Change Stream records to be able to identify how much compute units (memory, CPU, IO etc) needs to be allocated according to the number of data table regions involved for the given time range. As the region size grows beyond a certain limit, HBase considers splitting the given region based on the split policy. The default split policy is based on region size growth. For instance, regions are not allowed to grow beyond 10 GB by default. The proposed solution requires a new framework introducing the Streaming concepts for Phoenix CDC. The solution needs to provide one active stream for the given table on which the CDC is enabled by the client or consumer. *Change Stream:* Phoenix Stream captures a time-ordered sequence of row-level modifications in any table and stores this information in a log for up to TTL window (24 hour by default). Client applications can access this log and view the changes with an optional support of how the data appeared before and after the row is modified, in near-real time. *Stream Partitions:* Stream records are organized into groups, or partitions. Each partition acts as a container for multiple stream records and contains information required for accessing and iterating through these records. The stream records within a partition are removed automatically after the TTL window. *Note:* PHOENIX-7425 introduces partition id while generating the index mutations for the CDC. Detailed design doc: [https://docs.google.com/document/d/157KXONp4ZoWmhUBsy-lK3LNAanCsuJ3cb7qYTmozgZo/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010)