[jira] [Commented] (HBASE-6721) RegionServer Group based Assignment

Enis Soztutar (JIRA) Fri, 30 Oct 2015 17:17:44 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983663#comment-14983663
 ]


Enis Soztutar commented on HBASE-6721:
--------------------------------------

Finally got around testing the v15 patch on 1.1 code base with a 7 node 
cluster. Here are my test notes. Nothing too concerning, but we have to address 
some of these in the patch. This is the configuration to add to enable groups: 
{code}
    <property>
      <name>hbase.coprocessor.master.classes</name>
      <value>org.apache.hadoop.hbase.group.GroupAdminEndpoint</value>
    </property>
    <property>
      <name>hbase.master.loadbalancer.class</name>
      <value>org.apache.hadoop.hbase.group.GroupBasedLoadBalancer</value>
    </property>
{code}


1. Need to add this diff, so that new PB files get compiled with 
-Pcompile-protobuf command: 
{code}
diff --git hbase-protocol/pom.xml hbase-protocol/pom.xml
index 8034576..d352373 100644
--- hbase-protocol/pom.xml
+++ hbase-protocol/pom.xml
@@ -180,6 +180,8 @@
                           <include>ErrorHandling.proto</include>
                           <include>Filter.proto</include>
                           <include>FS.proto</include>
+                          <include>Group.proto</include>
+                          <include>GroupAdmin.proto</include>
                           <include>HBase.proto</include>
                           <include>HFile.proto</include>
                           <include>LoadBalancer.proto</include>
{code}

2. NPE in group shell commands with nonexisting groups: 
{code}
hbase(main):015:0* balance_group 'nonexisting' 

ERROR: java.io.IOException
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.group.GroupAdminServer.groupGetRegionsInTransition(GroupAdminServer.java:412)
        at 
org.apache.hadoop.hbase.group.GroupAdminServer.balanceGroup(GroupAdminServer.java:348)
        at 
org.apache.hadoop.hbase.group.GroupAdminEndpoint.balanceGroup(GroupAdminEndpoint.java:229)
        at 
org.apache.hadoop.hbase.protobuf.generated.GroupAdminProtos$GroupAdminService.callMethod(GroupAdminProtos.java:11156)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:666)
        at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:51121)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
{code}

and 

{code}
hbase(main):030:0> get_group 'nonexisting'
GROUP INFORMATION                                                               
                                                                                
                                                                                
                   
Servers:                                                                        
                                                                                
                                                                                
                   

ERROR: undefined method `getServers' for nil:NilClass

Here is some help for this command:
Get a region server group's information.

Example:

  hbase> get_group 'default'
{code}

and 

{code}
hbase(main):077:0* move_group_tables 'nonexisting'

ERROR: undefined method `each' for nil:NilClass

Here is some help for this command:
Reassign tables from one group to another.

  hbase> move_group_tables 'dest',['table1','table2']
{code}

and 
{code}
hbase(main):173:0* move_group_servers 'nonexisting'

ERROR: undefined method `each' for nil:NilClass

Here is some help for this command:
Reassign a region server from one group to another.

  hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}

3. Group names should be restricted to alphanumeric only. This one is pretty 
easy, but important. This following caused the master to abort, and the master 
cannot restart after this point (without manually removing the rsgroup entry 
from the table which you cannot do without master). I had to nuke the hdfs and 
zk to start over. 
{code}
hbase(main):033:0> add_group 'a-/:*'

ERROR: java.io.IOException: Failed to write to groupZNode
        at 
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.flushConfig(GroupInfoManagerImpl.java:419)
        at 
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.addGroup(GroupInfoManagerImpl.java:152)
        at 
org.apache.hadoop.hbase.group.GroupAdminServer.addGroup(GroupAdminServer.java:298)
        at 
org.apache.hadoop.hbase.group.GroupAdminEndpoint.addGroup(GroupAdminEndpoint.java:197)
        at 
org.apache.hadoop.hbase.protobuf.generated.GroupAdminProtos$GroupAdminService.callMethod(GroupAdminProtos.java:11146)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:666)
        at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:51121)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /hbase-unsecure/groupInfo/a-/:*
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1261)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1250)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1233)
        at 
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.flushConfig(GroupInfoManagerImpl.java:408)
{code}
 
4. {{get_table_group}} and {{get_server_group}} shell commands do not work
{code}
hbase(main):019:0* get_table_group 'nonexisting'

ERROR: undefined local variable or method `s' for 
#<Hbase::GroupAdmin:0x64518270>

Here is some help for this command:
Get the group name the given table is a member of.

  hbase> get_table_group 'myTable'

 
hbase(main):022:0* get_server_group 'server'

ERROR: undefined local variable or method `s' for 
#<Hbase::GroupAdmin:0x64518270>

Here is some help for this command:
Get the group name the given region server is a member of.

  hbase> get_server_group 'server1:port1
{code}

5. {{move_group_servers}} and {{move_group_tables}} arguments are listed as 1, 
although should be 2: 
{code}
hbase(main):033:0* move_group_servers 

ERROR: wrong number of arguments (0 for 1)

Here is some help for this command:
Reassign a region server from one group to another.

  hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}

6. Adding a server without port throws error, but no explanation (this one is a 
minor, not that important). 
{code}
hbase(main):070:0> move_group_servers 'group2', 
['os-enis-hbase-oct27-a-3.novalocal']  

ERROR: 

Here is some help for this command:
Reassign a region server from one group to another.

  hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}

7. From all the above, it is clear that we need a unit test over the new shell 
commands. 

Other than these, the feature is working as expected. Defining groups, moving 
servers and tables work. Regions get reassigned according to their groups. 
Restarting the cluster keeps assignments, etc. 

Some more findings: 
Test 1: 
Killed the last regionserver of a group, all 15 regions are in FAILED_OPEN 
state. 
 - restarted the master, regions still in FAILED_OPEN state (which is expected)
 - Added a new server to the group which had no remaining servers, regions 
still in FAILED_OPEN state (this is probably due to how assignment works, we 
give up after 10 retries and wait for manual assignment or master restart)
 - Started the region server that was killed before, still in FAILED_OPEN
 - Master restart reassigned these regions. 

Test 2: 
Tried to move all servers to a single group. Correctly handles last server in 
the default group by not allowing it to change. 

Test 3: 
Killed the last server in the default group, while all system tables are in the 
default group (and hence in that server). 
 -> hbase:meta was always in PENDING_OPEN in bogus server localhost,1,1. 
 -> Upon restarting the killed server, meta and other tables in the default 
group (including rsgroup table) got reassigned. 
 As a side note, having not enough servers in the group that has the meta or 
rsgroup table seems like a very good way of shoothing yourself in the foot. 
However, as discussed before this maybe needed for strong isolation. 


- Add non-existing server to the group. Is not allowed. 
- Checked JMX
- Adding group information for tables and regionserver to the master UI would 
be helpful. We can leave this to a follow up. 
- Obviously there should be a follow up to add at least some basic 
documentation on how to enable and configure and use RS groups in the book. 






> RegionServer Group based Assignment
> -----------------------------------
>
>                 Key: HBASE-6721
>                 URL: https://issues.apache.org/jira/browse/HBASE-6721
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Francis Liu
>            Assignee: Francis Liu
>              Labels: hbase-6721
>         Attachments: 6721-master-webUI.patch, HBASE-6721 
> GroupBasedLoadBalancer Sequence Diagram.xml, HBASE-6721-DesigDoc.pdf, 
> HBASE-6721-DesigDoc.pdf, HBASE-6721-DesigDoc.pdf, HBASE-6721-DesigDoc.pdf, 
> HBASE-6721_0.98_2.patch, HBASE-6721_10.patch, HBASE-6721_11.patch, 
> HBASE-6721_12.patch, HBASE-6721_13.patch, HBASE-6721_14.patch, 
> HBASE-6721_15.patch, HBASE-6721_8.patch, HBASE-6721_9.patch, 
> HBASE-6721_9.patch, HBASE-6721_94.patch, HBASE-6721_94.patch, 
> HBASE-6721_94_2.patch, HBASE-6721_94_3.patch, HBASE-6721_94_3.patch, 
> HBASE-6721_94_4.patch, HBASE-6721_94_5.patch, HBASE-6721_94_6.patch, 
> HBASE-6721_94_7.patch, HBASE-6721_98_1.patch, HBASE-6721_98_2.patch, 
> HBASE-6721_hbase-6721_addendum.patch, HBASE-6721_trunk.patch, 
> HBASE-6721_trunk.patch, HBASE-6721_trunk.patch, HBASE-6721_trunk1.patch, 
> HBASE-6721_trunk2.patch, balanceCluster Sequence Diagram.svg, 
> hbase-6721-v15-branch-1.1.patch, immediateAssignments Sequence Diagram.svg, 
> randomAssignment Sequence Diagram.svg, retainAssignment Sequence Diagram.svg, 
> roundRobinAssignment Sequence Diagram.svg
>
>
> In multi-tenant deployments of HBase, it is likely that a RegionServer will 
> be serving out regions from a number of different tables owned by various 
> client applications. Being able to group a subset of running RegionServers 
> and assign specific tables to it, provides a client application a level of 
> isolation and resource allocation.
> The proposal essentially is to have an AssignmentManager which is aware of 
> RegionServer groups and assigns tables to region servers based on groupings. 
> Load balancing will occur on a per group basis as well. 
> This is essentially a simplification of the approach taken in HBASE-4120. See 
> attached document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-6721) RegionServer Group based Assignment

Reply via email to