[
https://issues.apache.org/jira/browse/PHOENIX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078812#comment-17078812
]
Kadir OZDEMIR commented on PHOENIX-5528:
----------------------------------------
The earlier approaches proposed so far did not adequately address the problem.
I am going to describe another approach which I have discussed with
[~abhishek.chouhan] and we think that it addresses the problem.
As I mentioned before and the summary of this issue also implied, the root
cause of returning multiple index rows for a given data table row is that the
pending mutations are visible to scans. There is no requirement for these
mutations to be visible to scans as their completion status has not been
returned to the client. Actually, they should not be visible to scans. So, we
can safely exclude these pending mutations.
Based on this observation, we can get a maxTs value from the data table
regions. Here, maxTs is the highest timestamp value assigned to a committed
data table mutation since the table region is opened on its current region
server. If the region server has not committed a mutation on this table region
since the open, then the maxTs is the current time of the region server at the
time the table region is opened on this region server.
After collecting these maxTs values, we can identify the maximum of them and
use it as the maxTs value for all the scans to be done on the index table
regions. Before starting a scan, we ask the table regions to cancel their
pending mutations that are still in the first update phase and their timestamps
are lower than maxTs. Here, cancelling means that these mutations do not move
to the second update phase (i.e., the data table update phase) and an IO
exception is returned to the HBase client which will retry these mutations.
When these mutations are retried, they will get new timestamps. Here, we
assume that region server clocks do not go backward.
The cost of this approach is to make two sets of parallel RPCs to the data
table regions that needed to be scanned if the query were to be executed on the
data table. Here, the query is actually executed on the index table. In the
worst case, all table regions are contacted.
> Race condition in index verification causes multiple index rows to be
> returned for single data table row
> --------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-5528
> URL: https://issues.apache.org/jira/browse/PHOENIX-5528
> Project: Phoenix
> Issue Type: Bug
> Reporter: Vincent Poon
> Assignee: Kadir OZDEMIR
> Priority: Major
> Attachments: PHOENIX-5528.master.001.patch
>
>
> Warning: This is an artificially generated scenario that likely has a very
> low probability of happening in practice. But a race condition nevertheless.
> Unfortunately I don't have a test case, but was able to produce this by
> debugging a local regionserver and adding breakpoints at the right places to
> produce the ordering here.
> The core problem is that when we do an update to the data table, we produce
> two unverified index rows at first. When we scan both of these index rows
> and attempt to verify via rebuilding the data table row, we cannot guarantee
> that both verifications happen before the data table update, or both happen
> after the data table update.
> I use multiple index regions here to demonstrate, but I believe it could
> happen within a single region as well.
> Steps:
> 1) Create a test table with "pk" and "indexed_val" columns, and a global
> index on "indexed_val".
> 2) upsert into test values ('test_pk', 'test_val');
> 3) Split the index table on 'test_pk':
> hbase shell: split 'test_index', 'test_pk'.
> This creates two regions, call them regionA and regionB (which holds the
> existing index row)
> 3) start an update: upsert into test values ('test_pk', 'new_val');
> The first thing the indexing code does is create two unverified index
> rows: one is a new version of the existing index row, and the other is for
> the new indexed value.
> We pause the thread after this is done, before the row locks and data
> table write happens.
> 4) select indexed_val from test;
> This scans both the index regions in parallel. Each scan picks up a
> unverified row in its region. We pause in GlobalIndexChecker.
> Let the regionB scan proceed. It will attempt to rebuild the data table
> row. The data table still has 'test_val' as the indexed value. The rebuild
> succeeds.
> scan on regionA still paused.
> 5) The original update proceeds to update the data table indexed value to
> 'new_val'.
> 6) The scan on regionA proceeds, and attempted to rebuild the data table row.
> The rebuild succeeds with 'new_val' as the indexed value.
> 7) Both 'test_val' and 'new_val' are returned to the client, because both
> rebuilds succeeded.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)