[ 
https://issues.apache.org/jira/browse/HBASE-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203690#comment-14203690
 ] 

Jeffrey Zhong commented on HBASE-12319:
---------------------------------------

Since this issue may cause data loss or inconsistent data read, I marked it as 
critical. The symptom of the issue is that a region open doesn't wait for the 
previous region close completes so the newly opened region may not open all 
stores files if the previous region close may flush more data to disk.

The test testOpenCloseRacing failure after the fix is a test issue.  During the 
test, the region is opened twice therefore after the fix the region is opened 
in another RS while AM returns the first RS the region previously is assigned 
to. Before the fix, the test case doesn't wait for previous region open cancel 
complete, the test case can see the second region assignment immediately. If 
you put a sleep after the final assertion in the test case, you will see the 
meta location will be updated again by the previous canceled region opening. 
Below is the log after I put a 60-secs sleep after the final assert and you can 
see region ff976daf00708ecad200b113349fc4b4 in "OPEN" state and still got 
another "OPENED" which was from the previous assignment.

{noformat}
2014-11-08 12:48:32,238 DEBUG [FifoRpcScheduler.handler1-thread-2] 
master.AssignmentManager(4077): Got transition OPENED for 
{ff976daf00708ecad200b113349fc4b4 state=PENDING_OPEN, ts=1415479712217, 
server=10.10.8.224,55613,1415479709023} from 10.10.8.224,55613,1415479709023
…
2014-11-08 12:48:32,936 DEBUG [FifoRpcScheduler.handler1-thread-4] 
master.AssignmentManager(4077): Got transition OPENED for 
{ff976daf00708ecad200b113349fc4b4 state=OPEN, ts=1415479712238, 
server=10.10.8.224,55613,1415479709023} from 10.10.8.224,55609,1415479708922
{noformat}

The v2 patch amend the test case and make sure that region opening 
cleanupFailedOpen wait for region close before returning 
NotServingRegionException. Thanks.


> Inconsistencies during region recovery due to close/open of a region during 
> recovery
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-12319
>                 URL: https://issues.apache.org/jira/browse/HBASE-12319
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.7, 0.99.1
>            Reporter: Devaraj Das
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 2.0.0, 0.98.9, 0.99.2
>
>         Attachments: HBASE-12319.patch
>
>
> In one of my test runs, I saw the following:
> {noformat}
> 2014-10-14 13:45:30,782 DEBUG 
> [StoreOpener-51af4bd23dc32a940ad2dd5435f00e1d-1] regionserver.HStore: loaded 
> hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d/test_cf/d6df5cfe15ca41d68c619489fbde4d04,
>  isReference=false, isBulkLoadResult=false, seqid=141197, majorCompaction=true
> 2014-10-14 13:45:30,788 DEBUG [RS_OPEN_REGION-hor9n01:60020-1] 
> regionserver.HRegion: Found 3 recovered edits file(s) under 
> hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d
> .............
> .............
> 2014-10-14 13:45:31,916 WARN  [RS_OPEN_REGION-hor9n01:60020-1] 
> regionserver.HRegion: Null or non-existent edits file: 
> hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d/recovered.edits/0000000000000198080
> {noformat}
> The above logs is from a regionserver, say RS2. From the initial analysis it 
> seemed like the master asked a certain regionserver to open the region (let's 
> say RS1) and for some reason asked it to close soon after. The open was still 
> proceeding on RS1 but the master reassigned the region to RS2. This also 
> started the recovery but it ended up seeing an inconsistent view of the 
> recovered-edits files (it reports missing files as per the logs above) since 
> the first regionserver (RS1) deleted some files after it completed the 
> recovery. When RS2 really opens the region, it might not see the recent data 
> that was written by flushes on hor9n10 during the recovery process. Reads of 
> that data would have inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to