[
https://issues.apache.org/jira/browse/HBASE-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707861#comment-13707861
]
stack commented on HBASE-8940:
------------------------------
Chatting w/ Jimmy friday afternoon, there is a little hole where we have
published the region as available on regionserver X -- because it has been set
in .META. and master has been informed the region has been opened via a
transitioning of znode -- but the region may not be added to the regionserver
online regions set just yet (it is done after we move region from OPENING to
OPENED up in zk). If a client comes in in the meantime, which is possible when
all threads are running in the one jvm as we have in unit tests, client will
get NotServingRegionException (or RegionOpeningException which is a subclass in
this case)... which is the 'truth' in that we have not yet put up the region
online. The client will retry usually but in this case, in merge, there is no
retry since it is the regionserver itself making a call on itself; there is no
client. Adding retries inside the regionserver seems wrong. Regionserver
knows its own state.
Talking w/ Jimmy, changing the order in which we online the region in the
regionserver so we do it before we 'publish' via znode could open us up to
races where region could be open in more than one place so it is safer to leave
things as they are regards region onlining and instead just fix tests, or
better, HBaseTestingUtility; i.e. not start the merge until for sure the
regions are online.
Let me put up a patch.
> TestRegionMergeTransactionOnCluster#testWholesomeMerge may fail due to race
> in opening region
> ---------------------------------------------------------------------------------------------
>
> Key: HBASE-8940
> URL: https://issues.apache.org/jira/browse/HBASE-8940
> Project: HBase
> Issue Type: Bug
> Reporter: Ted Yu
> Assignee: Ted Yu
> Attachments: 8940-v1.txt
>
>
> From
> http://54.241.6.143/job/HBase-TRUNK-Hadoop-2/org.apache.hbase$hbase-server/395/testReport/org.apache.hadoop.hbase.regionserver/TestRegionMergeTransactionOnCluster/testWholesomeMerge/
> :
> {code}
> 013-07-11 09:33:44,154 INFO [AM.ZK.Worker-pool-2-thread-2]
> master.RegionStates(309): Offlined 3ffefd878a234031675de6b2c70b2ead from
> ip-10-174-118-204.us-west-1.compute.internal,60498,1373535184820
> 2013-07-11 09:33:44,154 INFO [AM.ZK.Worker-pool-2-thread-2]
> master.AssignmentManager$4(1223): The master has opened
> testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.
> that was online on
> ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> 2013-07-11 09:33:44,182 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1]
> zookeeper.ZKAssign(862): regionserver:59210-0x13fcd13a20c0002 Successfully
> transitioned node 3ffefd878a234031675de6b2c70b2ead from RS_ZK_REGION_OPENING
> to RS_ZK_REGION_OPENED
> 2013-07-11 09:33:44,182 INFO
> [MASTER_TABLE_OPERATIONS-ip-10-174-118-204:39405-0]
> handler.DispatchMergingRegionHandler(154): Failed send MERGE REGIONS RPC to
> server ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884 for
> region
> testWholesomeMerge,,1373535210124.efcb10dcfa250e31bfd50dc6c7049f32.,testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.,
> focible=false, org.apache.hadoop.hbase.exceptions.RegionOpeningException:
> Region is being opened: 3ffefd878a234031675de6b2c70b2ead
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2566)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3862)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.mergeRegions(HRegionServer.java:3649)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14400)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2124)
> at
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1831)
> 2013-07-11 09:33:44,182 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1]
> handler.OpenRegionHandler(373): region transitioned to opened in zookeeper:
> {ENCODED => 3ffefd878a234031675de6b2c70b2ead, NAME =>
> 'testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.',
> STARTKEY => 'testRow0020', ENDKEY => 'testRow0040'}, server:
> ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> 2013-07-11 09:33:44,183 DEBUG [RS_OPEN_REGION-ip-10-174-118-204:59210-1]
> handler.OpenRegionHandler(186): Opened
> testWholesomeMerge,testRow0020,1373535210125.3ffefd878a234031675de6b2c70b2ead.
> on server:ip-10-174-118-204.us-west-1.compute.internal,59210,1373535184884
> {code}
> We can see that MASTER_TABLE_OPERATIONS thread couldn't get region
> 3ffefd878a234031675de6b2c70b2ead because RS_OPEN_REGION thread finished
> region opening 1 millisecond later.
> One solution is to retry operation when receiving RegionOpeningException
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira