[jira] [Commented] (HBASE-17306) IntegrationTestRSGroup#testRegionMove may fail due to region server not online

Josh Elser (JIRA) Tue, 13 Dec 2016 12:48:18 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746222#comment-15746222
 ]


Josh Elser commented on HBASE-17306:
------------------------------------

bq. Shortly before the test failure, the server was shutdown:

This shutdown/restart was due to ChaosMonkey? My worry would be that your fix 
would just very quickly retry and fail 3 times, leaving us with the same 
problem. It looks like the 5 minutes went by before the RS was restarted.

I'm not familiar enough with the RSGroups feature: are groups defined by 
hostname or the actual ServerName (hostname+port+timestamp)?

I would think it would be more reliable to stop CM (or whatever process is 
stopping RegionServers) before trying to restore the cluster back to "normal". 
Granted, we could still run into this in the normal case, but, if RSGroups 
requires the server to be online to change groups, I'm not coming up with a way 
to fix the test (as we would have to block until the server came back online 
for correctness).

> IntegrationTestRSGroup#testRegionMove may fail due to region server not online
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-17306
>                 URL: https://issues.apache.org/jira/browse/HBASE-17306
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Priority: Minor
>         Attachments: 17306.v1.txt
>
>
> {code}
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|2) 
> testRegionMove(org.apache.hadoop.hbase.rsgroup.IntegrationTestRSGroup)
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - 
> run()|org.apache.hadoop.hbase.constraint.ConstraintException: 
> org.apache.hadoop.hbase.constraint.                    ConstraintException: 
> Server ctr-e77-1481596162056-0240-01-000005.a.com:16020 is not an online 
> server in default group.
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at 
> org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:135)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at 
> org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint.moveServers(RSGroupAdminEndpoint.java:169)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at 
> org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.
>                           callMethod(RSGroupAdminProtos.java:11136)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at 
> org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:679)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2
> {code}
> Shortly before the test failure, the server was shutdown:
> {code}
> 2016-12-13 05:21:25,428 INFO  
> [MASTER_SERVER_OPERATIONS-ctr-e77-1481596162056-0240-01-000008:20000-4] 
> handler.ServerShutdownHandler: Finished processing of shutdown of ctr-  
> e77-1481596162056-0240-01-000005.a.com,16020,1481606309159
> ...
> 2016-12-13 05:26:57,935 INFO  
> [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=20000] 
> master.ServerManager: Registering 
> server=ctr-e77-1481596162056-0240-01-000005.hwx. site,16020,1481606803303
> 2016-12-13 05:27:06,219 DEBUG [main-EventThread] 
> zookeeper.RegionServerTracker: Added tracking of RS 
> /hbase-secure/rs/ctr-e77-1481596162056-0240-01-000005.a.com,16020,       
> 1481606803303
> {code}
> The registration of the new server (start code1481606803303) happened shortly 
> after the test failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-17306) IntegrationTestRSGroup#testRegionMove may fail due to region server not online

Reply via email to