[ 
https://issues.apache.org/jira/browse/HBASE-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707861#comment-13707861
 ] 

stack commented on HBASE-8940:
------------------------------

Chatting w/ Jimmy friday afternoon, there is a little hole where we have 
published the region as available on regionserver X -- because it has been set 
in .META. and master has been informed the region has been opened via a 
transitioning of znode -- but the region may not be added to the regionserver 
online regions set just yet (it is done after we move region from OPENING to 
OPENED up in zk).  If a client comes in in the meantime, which is possible when 
all threads are running in the one jvm as we have in unit tests, client will 
get NotServingRegionException (or RegionOpeningException which is a subclass in 
this case)... which is the 'truth' in that we have not yet put up the region 
online.  The client will retry usually but in this case, in merge, there is no 
retry since it is the regionserver itself making a call on itself; there is no 
client.  Adding retries inside the regionserver seems wrong.  Regionserver 
knows its own state.

Talking w/ Jimmy, changing the order in which we online the region in the 
regionserver so we do it before we 'publish' via znode could open us up to 
races where region could be open in more than one place so it is safer to leave 
things as they are regards region onlining and instead just fix tests, or 
better, HBaseTestingUtility; i.e. not start the merge until for sure the 
regions are online.

Let me put up a patch.

                
> TestRegionMergeTransactionOnCluster#testWholesomeMerge may fail due to race 
> in opening region
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8940
>                 URL: https://issues.apache.org/jira/browse/HBASE-8940
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 8940-v1.txt
>
>
> From 
> http://54.241.6.143/job/HBase-TRUNK-Hadoop-2/org.apache.hbase$hbase-server/395/testReport/org.apache.hadoop.hbase.regionserver/TestRegionMergeTransactionOnCluster/testWholesomeMerge/
>  :
> {code}
> 013-07-11 09:33:44,154 INFO  [AM.ZK.Worker-pool-2-thread-2] 
> master.RegionStates(309): Offlined 3ffefd878a234031675de6b2c70b2ead from 
> ip-10-174-118-204.us-west-1.compute.internal,60498,1373535184820
> 2013-07-11 09:33:44,154 INFO  [AM.ZK.Worker-pool-2-thread-2] 
> master.AssignmentManager$4(1223): The master has opened 
> testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.
>  that was online on 
> ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> 2013-07-11 09:33:44,182 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1] 
> zookeeper.ZKAssign(862): regionserver:59210-0x13fcd13a20c0002 Successfully 
> transitioned node 3ffefd878a234031675de6b2c70b2ead from RS_ZK_REGION_OPENING 
> to RS_ZK_REGION_OPENED
> 2013-07-11 09:33:44,182 INFO  
> [MASTER_TABLE_OPERATIONS-ip-10-174-118-204:39405-0] 
> handler.DispatchMergingRegionHandler(154): Failed send MERGE REGIONS RPC to 
> server ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884 for 
> region 
> testWholesomeMerge,,1373535210124.efcb10dcfa250e31bfd50dc6c7049f32.,testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.,
>  focible=false, org.apache.hadoop.hbase.exceptions.RegionOpeningException: 
> Region is being opened: 3ffefd878a234031675de6b2c70b2ead
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2566)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3862)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.mergeRegions(HRegionServer.java:3649)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14400)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2124)
>       at 
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1831)
> 2013-07-11 09:33:44,182 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1] 
> handler.OpenRegionHandler(373): region transitioned to opened in zookeeper: 
> {ENCODED => 3ffefd878a234031675de6b2c70b2ead, NAME => 
> 'testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.',
>  STARTKEY => 'testRow0020', ENDKEY => 'testRow0040'}, server: 
> ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> 2013-07-11 09:33:44,183 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1] 
> handler.OpenRegionHandler(186): Opened 
> testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.
>  on server:ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> {code}
> We can see that MASTER_TABLE_OPERATIONS thread couldn't get region 
> 3ffefd878a234031675de6b2c70b2ead because RS_OPEN_REGION thread finished 
> region opening 1 millisecond later.
> One solution is to retry operation when receiving RegionOpeningException

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to