[
https://issues.apache.org/jira/browse/HBASE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746222#comment-15746222
]
Josh Elser commented on HBASE-17306:
------------------------------------
bq. Shortly before the test failure, the server was shutdown:
This shutdown/restart was due to ChaosMonkey? My worry would be that your fix
would just very quickly retry and fail 3 times, leaving us with the same
problem. It looks like the 5 minutes went by before the RS was restarted.
I'm not familiar enough with the RSGroups feature: are groups defined by
hostname or the actual ServerName (hostname+port+timestamp)?
I would think it would be more reliable to stop CM (or whatever process is
stopping RegionServers) before trying to restore the cluster back to "normal".
Granted, we could still run into this in the normal case, but, if RSGroups
requires the server to be online to change groups, I'm not coming up with a way
to fix the test (as we would have to block until the server came back online
for correctness).
> IntegrationTestRSGroup#testRegionMove may fail due to region server not online
> ------------------------------------------------------------------------------
>
> Key: HBASE-17306
> URL: https://issues.apache.org/jira/browse/HBASE-17306
> Project: HBase
> Issue Type: Test
> Reporter: Ted Yu
> Priority: Minor
> Attachments: 17306.v1.txt
>
>
> {code}
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|2)
> testRegionMove(org.apache.hadoop.hbase.rsgroup.IntegrationTestRSGroup)
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 -
> run()|org.apache.hadoop.hbase.constraint.ConstraintException:
> org.apache.hadoop.hbase.constraint. ConstraintException:
> Server ctr-e77-1481596162056-0240-01-000005.a.com:16020 is not an online
> server in default group.
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at
> org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:135)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at
> org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint.moveServers(RSGroupAdminEndpoint.java:169)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at
> org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.
> callMethod(RSGroupAdminProtos.java:11136)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at
> org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:679)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2
> {code}
> Shortly before the test failure, the server was shutdown:
> {code}
> 2016-12-13 05:21:25,428 INFO
> [MASTER_SERVER_OPERATIONS-ctr-e77-1481596162056-0240-01-000008:20000-4]
> handler.ServerShutdownHandler: Finished processing of shutdown of ctr-
> e77-1481596162056-0240-01-000005.a.com,16020,1481606309159
> ...
> 2016-12-13 05:26:57,935 INFO
> [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=20000]
> master.ServerManager: Registering
> server=ctr-e77-1481596162056-0240-01-000005.hwx. site,16020,1481606803303
> 2016-12-13 05:27:06,219 DEBUG [main-EventThread]
> zookeeper.RegionServerTracker: Added tracking of RS
> /hbase-secure/rs/ctr-e77-1481596162056-0240-01-000005.a.com,16020,
> 1481606803303
> {code}
> The registration of the new server (start code1481606803303) happened shortly
> after the test failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)