[jira] [Comment Edited] (HBASE-20632) Failure of RSes belonging to RSgroup for System tables makes the cluster unavailable

Andrew Purtell (JIRA) Wed, 23 May 2018 18:54:51 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-20632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488297#comment-16488297
 ]


Andrew Purtell edited comment on HBASE-20632 at 5/24/18 1:53 AM:
-----------------------------------------------------------------

bq. It would be better if system tables are not restricted to a rsgroup so that 
they can be serviced by any available region server in the cluster making it 
more available.

No, that breaks the intent of the design. -1 to this, please, or at least make 
it configurable, and off by default.

But if all servers in the system RSgroup go down,  and then one or more come 
back, and still the system tables are not redeployed, that is a bug.



was (Author: apurtell):
bq. It would be better if system tables are not restricted to a rsgroup so that 
they can be serviced by any available region server in the cluster making it 
more available.

No, that breaks the intent of the design. -1 to this, please

But if all servers in the system RSgroup go down,  and then one or more come 
back, and still the system tables are not redeployed, that is a bug.


> Failure of RSes belonging to RSgroup for System tables makes the cluster 
> unavailable
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-20632
>                 URL: https://issues.apache.org/jira/browse/HBASE-20632
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 3.0.0
>            Reporter: Biju Nair
>            Assignee: Ted Yu
>            Priority: Critical
>         Attachments: 20632.v1.txt
>
>
> This was done on a local cluster (non hdfs) and following are the steps
>  * Start a single node cluster and start an additional RS using 
> {{local-regionservers.sh}}
>  * Through hbase shell add a new rs group
>  * 
> {noformat}
> hbase(main):001:0> add_rsgroup 'test_rsgroup'
> Took 0.5503 seconds
> hbase(main):002:0> list_rsgroups
> NAME SERVER / TABLE
> test_rsgroup
> default server dob2-r3n13:16020
> server dob2-r3n13:16022
> table hbase:meta
> table hbase:acl
> table hbase:quota
> table hbase:namespace
> table hbase:rsgroup
> 2 row(s)
> Took 0.0419 seconds{noformat}
>  * Move one of the region servers to the new {{rsgroup}}
>  * 
> {noformat}
> hbase(main):004:0> move_servers_rsgroup 'test_rsgroup',['dob2-r3n13:16020']
> Took 6.4894 seconds
> hbase(main):005:0> exit{noformat}
>  * Stop the regionserver which is left in the {{default}} rsgroup
>  * 
> {noformat}
> local-regionservers.sh stop 2{noformat}
> The cluster becomes unusable even if the region server is restarted or even 
> if all the services were brought down and brought up.
> In {{1.1.x}} version, the cluster recovers fine. Looks like {{meta}} is 
> assigned to a {{dummy}} regionserver and when the regionserver gets restarted 
> it gets assigned. The following is what we can see in {{master}} UI when the 
> {{rs}} is down
> {noformat}
> 1588230740    hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Wed May 23 
> 18:24:01 EDT 2018 (1s ago), server=localhost,1,1{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-20632) Failure of RSes belonging to RSgroup for System tables makes the cluster unavailable

Reply via email to