[ https://issues.apache.org/jira/browse/PHOENIX-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen Yuan Jiang reassigned PHOENIX-7425: ------------------------------------------- Assignee: Kadir Ozdemir > Partitioned CDC Index for eliminating salting > --------------------------------------------- > > Key: PHOENIX-7425 > URL: https://issues.apache.org/jira/browse/PHOENIX-7425 > Project: Phoenix > Issue Type: Improvement > Reporter: Kadir Ozdemir > Assignee: Kadir Ozdemir > Priority: Major > > The CDC future (PHOENX-7001) uses a global uncovered time > (PHOENIX_ROW_TIMESTAM()) based index. Such an index will likely create hot > spotting during writes. This is because the same index region will keep > getting updated as the row key of the index table isĀ PHOENIX_ROW_TIMESTAMP() > + data table row key. > The same hot spotting can happen during reads as a small subset of index > regions can be used for a given time range. For example, the most recent > changes will be retrieved through one or two index regions. > To address these hot spotting issues, PHOENX-7001 suggests salting the index. > There are three main issues with salting. > The first one is that the number of salt buckets is static and needs to be > determined when the index is created. > The second is that salting does not work well with batch writes as it results > in breaking a batch of writes into separate mini batches, one for each salt > bucket. This leads to using more client threads and server RPC handlers, one > for each salt bucket. > The last issue is that the salt buckets are not visible to applications and > thus they cannot take advantage of the parallelism that comes with salting > during reads. For example, there is no way for applications to use multiple > threads, one thread for each salt bucket, for their queries. > To address all these issues that come with salting, this Jira introduces a > built-in function for CDC indexes called PARTITION_ID(). PARTITION_ID() will > be the prefix of an index row key (= PARTITION_ID() + PHOENIX_ROW_TIMESTAMP() > + data table row key). PARTITION_ID() will identify the data table region of > the data table row key. PARTITION_ID() can be the encoded name of the data > table region. > Like PHOENIX_ROW_TIMESTAMP(), PARTITION_ID() can be used in CDC index queries. > By including PARTITION_ID() in the row key of an index table, we essentially > create the effect of local index such that all index mutations for a given > data table region are written to one index region determined by the > PARITION_ID(). However, here we will not have the local index problem with > region splits where copying index rows during data table region splits is > required. > It is worthwhile to note that even if we attempt to use local index as CDC > index, applications cannot directly query individual local index regions, > which will be available with global indexes with PARTITION_ID(). The > PARTITION_ID() creates a new class of global indexes that can be called > partitioned global indexes. These will likely be the new local indexes for > Phoenix. > The partitioned CDC indexes will eliminate the need for salting CDC indexes. > Adding partition id will increase the row key size of the CDC index. This > will not be an issue for storage footprint as the partition id will be the > row key prefix and it will be compressed using row kew prefix encoding or > block compression. -- This message was sent by Atlassian Jira (v8.20.10#820010)