[ 
https://issues.apache.org/jira/browse/PHOENIX-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kadir OZDEMIR updated PHOENIX-5791:
-----------------------------------
    Description: 
IndexTool verification generates an expected list of index mutations from the 
data table rows and uses this list to check if index table rows are consistent 
with the data table. To do that it follows the following steps:
 # The data table rows are scanned with a raw scan. This raw scan is configured 
to read all versions of rows. 
 # For each scanned row, the cells that are scanned are grouped into two sets: 
put and delete. The put set is the set of put cells and the delete set is the 
set of delete cells.
 # The put and delete sets for a given row are further grouped based on their 
timestamps into put and delete mutations such that all the cells in a mutation 
have the timestamp. 
 # The put and delete mutations are then sorted within a single list. Mutations 
in this list are sorted in ascending order of their timestamp. 

The above process assumes that for each data table update, the index table will 
be updated with the correct index row key. However, this assumption does not 
hold in the presence of concurrent updates.

>From the consistent indexing design (PHOENIX-5156) perspective, two or more 
>pending updates from different batches on the same data row are concurrent if 
>and only if for all of these updates the data table row state is read from 
>HBase under the row lock and for none of them the row lock has been acquired 
>the second time for updating the data table. In other words, all of them are 
>in the first update phase concurrently. For concurrent updates, the first two 
>update phases are done but the last update phase is skipped. This means the 
>data table row will be updated by these updates but the corresponding index 
>table rows will be left with the unverified status. Then, the read repair 
>process will repair these unverified index rows during scans.

In addition to leaving index rows unverified, the concurrent updates may 
generate index row with incorrect row keys. For example, consider that 
application issues the verify first two upserts on the same row concurrently 
and the second update does not include one or more of the indexed columns. When 
these updates arrive concurrently to IndexRegionObserver, the existing row 
state would be found null for both of these updates. This mean the index 
updates will be generated solely from the pending updates. The partial upsert 
with missing indexed columns will generate an index row by assuming missing 
indexed columns have null value, and this assumption may not true as the other 
concurrent upsert may have non-null values for indexed columns. 

Since expected index mutations are derived from the data table row after these 
concurrent mutations are applied, the expected list would not match with the 
actual list of index mutations.  

 

  was:
IndexTool verification generates an expected list of index mutations from the 
data table rows and uses this list to check if index table rows are consistent 
with the data table. To do that it follows the following steps:
 # The data table rows are scanned with a raw scan. This raw scan is configured 
to read all versions of rows. 
 # For each scanned row, the cells that are scanned are grouped into two sets: 
put and delete. The put set is the set of put cells and the delete set is the 
set of delete cells.
 # The put and delete sets for a given row are further grouped based on their 
timestamps into put and delete mutations such that all the cells in a mutation 
have the timestamp. 
 # The put and delete mutations are then sorted within a single list. Mutations 
in this list are sorted in ascending order of their timestamp. 

The above process assumes that for each data table update, the index table will 
be updated with the correct index row key. However, this assumption does not 
hold in the presence of concurrent updates.

>From the consistent indexing design (PHOENIX-5156) perspective, two or more 
>pending updates from different batches on the same data row are concurrent if 
>and only if for all of these updates the data table row state is read from 
>HBase under the row lock and for none of them the row lock has been acquired 
>the second time for updating the data table. In other words, all of them are 
>in the first update phase concurrently. For concurrent updates, the first two 
>update phases are done but the last update phase is skipped. This means the 
>data table row will be updated by these updates but the corresponding index 
>table rows will be left with the unverified status. Then, the read repair 
>process will repair these unverified index rows during scans.

In addition to leaving index rows unverified, the concurrent updates may 
generate index row with incorrect row keys. For example, consider that 
application issues the verify first two upserts on the same row concurrently 
and the second update does not include one or more of the indexed columns. When 
these updates arrive concurrently to IndexRegionObserver, the existing row 
state would be found null for both of these updates. This mean the index 
updates will be generated solely from the pending updates. The partial upsert 
with missing indexed columns will generate an index row by assuming missing 
indexed columns have null value, and this assumption may not true as the other 
concurrent upsert may have non-null values for indexed columns. 

Since expected index mutations are derived from the data table row after these 
concurrent mutations are applied, the expected list would not match with the 
actual list of index mutations. Please note that this does not pose a 
correctness issue as these index rows are unverified and the data table row key 
can be correctly extracted from them by the read repair process even though 
their index row keys are incorrect. 

 


> Eliminate false invalid row detection due to concurrent updates 
> ----------------------------------------------------------------
>
>                 Key: PHOENIX-5791
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5791
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Kadir OZDEMIR
>            Assignee: Kadir OZDEMIR
>            Priority: Major
>
> IndexTool verification generates an expected list of index mutations from the 
> data table rows and uses this list to check if index table rows are 
> consistent with the data table. To do that it follows the following steps:
>  # The data table rows are scanned with a raw scan. This raw scan is 
> configured to read all versions of rows. 
>  # For each scanned row, the cells that are scanned are grouped into two 
> sets: put and delete. The put set is the set of put cells and the delete set 
> is the set of delete cells.
>  # The put and delete sets for a given row are further grouped based on their 
> timestamps into put and delete mutations such that all the cells in a 
> mutation have the timestamp. 
>  # The put and delete mutations are then sorted within a single list. 
> Mutations in this list are sorted in ascending order of their timestamp. 
> The above process assumes that for each data table update, the index table 
> will be updated with the correct index row key. However, this assumption does 
> not hold in the presence of concurrent updates.
> From the consistent indexing design (PHOENIX-5156) perspective, two or more 
> pending updates from different batches on the same data row are concurrent if 
> and only if for all of these updates the data table row state is read from 
> HBase under the row lock and for none of them the row lock has been acquired 
> the second time for updating the data table. In other words, all of them are 
> in the first update phase concurrently. For concurrent updates, the first two 
> update phases are done but the last update phase is skipped. This means the 
> data table row will be updated by these updates but the corresponding index 
> table rows will be left with the unverified status. Then, the read repair 
> process will repair these unverified index rows during scans.
> In addition to leaving index rows unverified, the concurrent updates may 
> generate index row with incorrect row keys. For example, consider that 
> application issues the verify first two upserts on the same row concurrently 
> and the second update does not include one or more of the indexed columns. 
> When these updates arrive concurrently to IndexRegionObserver, the existing 
> row state would be found null for both of these updates. This mean the index 
> updates will be generated solely from the pending updates. The partial upsert 
> with missing indexed columns will generate an index row by assuming missing 
> indexed columns have null value, and this assumption may not true as the 
> other concurrent upsert may have non-null values for indexed columns. 
> Since expected index mutations are derived from the data table row after 
> these concurrent mutations are applied, the expected list would not match 
> with the actual list of index mutations.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to