[ 
https://issues.apache.org/jira/browse/PHOENIX-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Yuan Jiang reassigned PHOENIX-7425:
-------------------------------------------

    Assignee: Kadir Ozdemir

> Partitioned CDC Index for eliminating salting
> ---------------------------------------------
>
>                 Key: PHOENIX-7425
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7425
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Kadir Ozdemir
>            Assignee: Kadir Ozdemir
>            Priority: Major
>
> The CDC future (PHOENX-7001) uses a global uncovered time 
> (PHOENIX_ROW_TIMESTAM()) based index. Such an index will likely create hot 
> spotting during writes. This is because the same index region will keep 
> getting updated as the row key of the index table isĀ  PHOENIX_ROW_TIMESTAMP() 
> + data table row key.
> The same hot spotting can happen during reads as a small subset of index 
> regions can be used for a given time range. For example, the most recent 
> changes will be retrieved through one or two index regions.
> To address these hot spotting issues, PHOENX-7001 suggests salting the index. 
> There are three main issues with salting.
> The first one is that the number of salt buckets is static and needs to be 
> determined when the index is created.
> The second is that salting does not work well with batch writes as it results 
> in breaking a batch of writes into separate mini batches, one for each salt 
> bucket. This leads to using more client threads and server RPC handlers, one 
> for each salt bucket.
> The last issue is that the salt buckets are not visible to applications and 
> thus they cannot take advantage of the parallelism that comes with salting 
> during reads. For example, there is no way for applications to use multiple 
> threads, one thread for each salt bucket, for their queries.
> To address all these issues that come with salting, this Jira introduces a 
> built-in function for CDC indexes called PARTITION_ID(). PARTITION_ID() will 
> be the prefix of an index row key (= PARTITION_ID() + PHOENIX_ROW_TIMESTAMP() 
> + data table row key). PARTITION_ID() will identify the data table region of 
> the data table row key. PARTITION_ID() can be the encoded name of the data 
> table region.
> Like PHOENIX_ROW_TIMESTAMP(), PARTITION_ID() can be used in CDC index queries.
> By including PARTITION_ID() in the row key of an index table, we essentially 
> create the effect of local index such that all index mutations for a given 
> data table region are written to one index region determined by the 
> PARITION_ID(). However, here we will not have the local index problem with 
> region splits where copying index rows during data table region splits is 
> required.
> It is worthwhile to note that even if we attempt to use local index as CDC 
> index, applications cannot directly query individual local index regions, 
> which will be available with global indexes with PARTITION_ID(). The 
> PARTITION_ID() creates a new class of global indexes that can be called 
> partitioned global indexes. These will likely be the new local indexes for 
> Phoenix.
> The partitioned CDC indexes will eliminate the need for salting CDC indexes.
> Adding partition id will increase the row key size of the CDC index. This 
> will not be an issue for storage footprint as the partition id will be the 
> row key prefix and it will be compressed using row kew prefix encoding or 
> block compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to