[ 
https://issues.apache.org/jira/browse/PHOENIX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076765#comment-17076765
 ] 

Kadir OZDEMIR commented on PHOENIX-5528:
----------------------------------------

[~gjacoby], I think there is a simpler solution to this problem than what I 
suggested before. Instead of trying to answer your questions (which I may not 
be able to answer adequately), I would like you and other ([~vincentpoon], 
[~abhishek.chouhan], [~larsh]) to consider the following:

The main reason for retuning multiple index rows for the same data row is that 
scans can return the mutations that are added while scan is in progress. To 
skip these new mutations that are added after scan starts, one relatively 
straightforward solution is to run scan on table region point-in-time images (I 
am not referring to HBase snapshots here). This point-in-time image effect is 
achieved as follows. For each table region, the time range for the scan on the 
table region is set to [0, maxTs+1] where maxTs is (1) greater than or equal to 
the maximum timestamp used in a table region, and (2) less than the timestamp 
that will be assigned to the next mutation.  For the data table rows, the 
timestamp is assigned to the mutations on the server side using the server wall 
clock. The timestamps for index table mutations are copied from their data 
table mutations. 

One solution to determine the maxTS value for a table region is that an index 
table region is scanned to find the maximum timestamp in the table region each 
time the table region starts on a region server. After that, the maxTs value is 
maintained and updated in memory based on the new mutation timestamps. Finally, 
the scan time range is updated on the server side using the maxTs value.

The above solution has one drawback which is the impact of the initial scan on 
the table region startup time. One way to address this is not to do the scan, 
instead we can use an approximation for the initial maxTs value. 

If we can assume that the wall clock is a mostly monotonically increasing 
counter, occasionally goes backward, and when it goes backward it is because of 
the clock skew between servers and the skew will be less than deltaT that is 
the time to take a region to restart, then when the table region start it sets 
maxTS to current wall clock time minus - deltaT. 

Does this makes sense to you? If so, what would be a good value for deltaT?
 

> Race condition in index verification causes multiple index rows to be 
> returned for single data table row
> --------------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5528
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5528
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Vincent Poon
>            Assignee: Kadir OZDEMIR
>            Priority: Major
>         Attachments: PHOENIX-5528.master.001.patch
>
>
> Warning: This is an artificially generated scenario that likely has a very 
> low probability of happening in practice.  But a race condition nevertheless. 
>  Unfortunately I don't have a test case, but was able to produce this by 
> debugging a local regionserver and adding breakpoints at the right places to 
> produce the ordering here.
> The core problem is that when we do an update to the data table, we produce 
> two unverified index rows at first.  When we scan both of these index rows 
> and attempt to verify via rebuilding the data table row, we cannot guarantee 
> that both verifications happen before the data table update, or both happen 
> after the data table update.
> I use multiple index regions here to demonstrate, but I believe it could 
> happen within a single region as well.
> Steps:
> 1) Create a test table with "pk" and "indexed_val" columns, and a global 
> index on "indexed_val".
> 2) upsert into test values ('test_pk', 'test_val');
> 3) Split the index table on 'test_pk':
>    hbase shell: split 'test_index', 'test_pk'.
>    This creates two regions, call them regionA and regionB (which holds the 
> existing index row)
> 3) start an update: upsert into test values ('test_pk', 'new_val');
>    The first thing the indexing code does is create two unverified index 
> rows: one is a new version of the existing index row, and the other is for 
> the new indexed value.
>    We pause the thread after this is done, before the row locks and data 
> table write happens.
> 4) select indexed_val from test;
>    This scans both the index regions in parallel.  Each scan picks up a 
> unverified row in its region.  We pause in GlobalIndexChecker.
>    Let the regionB scan proceed.  It will attempt to rebuild the data table 
> row.  The data table still has 'test_val' as the indexed value.  The rebuild 
> succeeds.
>    scan on regionA still paused.
> 5) The original update proceeds to update the data table indexed value to 
> 'new_val'.
> 6) The scan on regionA proceeds, and attempted to rebuild the data table row. 
>  The rebuild succeeds with 'new_val' as the indexed value.
> 7) Both 'test_val' and 'new_val' are returned to the client, because both 
> rebuilds succeeded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to