[
https://issues.apache.org/jira/browse/HBASE-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983663#comment-14983663
]
Enis Soztutar commented on HBASE-6721:
--------------------------------------
Finally got around testing the v15 patch on 1.1 code base with a 7 node
cluster. Here are my test notes. Nothing too concerning, but we have to address
some of these in the patch. This is the configuration to add to enable groups:
{code}
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.group.GroupAdminEndpoint</value>
</property>
<property>
<name>hbase.master.loadbalancer.class</name>
<value>org.apache.hadoop.hbase.group.GroupBasedLoadBalancer</value>
</property>
{code}
1. Need to add this diff, so that new PB files get compiled with
-Pcompile-protobuf command:
{code}
diff --git hbase-protocol/pom.xml hbase-protocol/pom.xml
index 8034576..d352373 100644
--- hbase-protocol/pom.xml
+++ hbase-protocol/pom.xml
@@ -180,6 +180,8 @@
<include>ErrorHandling.proto</include>
<include>Filter.proto</include>
<include>FS.proto</include>
+ <include>Group.proto</include>
+ <include>GroupAdmin.proto</include>
<include>HBase.proto</include>
<include>HFile.proto</include>
<include>LoadBalancer.proto</include>
{code}
2. NPE in group shell commands with nonexisting groups:
{code}
hbase(main):015:0* balance_group 'nonexisting'
ERROR: java.io.IOException
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2156)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hbase.group.GroupAdminServer.groupGetRegionsInTransition(GroupAdminServer.java:412)
at
org.apache.hadoop.hbase.group.GroupAdminServer.balanceGroup(GroupAdminServer.java:348)
at
org.apache.hadoop.hbase.group.GroupAdminEndpoint.balanceGroup(GroupAdminEndpoint.java:229)
at
org.apache.hadoop.hbase.protobuf.generated.GroupAdminProtos$GroupAdminService.callMethod(GroupAdminProtos.java:11156)
at
org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:666)
at
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:51121)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
{code}
and
{code}
hbase(main):030:0> get_group 'nonexisting'
GROUP INFORMATION
Servers:
ERROR: undefined method `getServers' for nil:NilClass
Here is some help for this command:
Get a region server group's information.
Example:
hbase> get_group 'default'
{code}
and
{code}
hbase(main):077:0* move_group_tables 'nonexisting'
ERROR: undefined method `each' for nil:NilClass
Here is some help for this command:
Reassign tables from one group to another.
hbase> move_group_tables 'dest',['table1','table2']
{code}
and
{code}
hbase(main):173:0* move_group_servers 'nonexisting'
ERROR: undefined method `each' for nil:NilClass
Here is some help for this command:
Reassign a region server from one group to another.
hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}
3. Group names should be restricted to alphanumeric only. This one is pretty
easy, but important. This following caused the master to abort, and the master
cannot restart after this point (without manually removing the rsgroup entry
from the table which you cannot do without master). I had to nuke the hdfs and
zk to start over.
{code}
hbase(main):033:0> add_group 'a-/:*'
ERROR: java.io.IOException: Failed to write to groupZNode
at
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.flushConfig(GroupInfoManagerImpl.java:419)
at
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.addGroup(GroupInfoManagerImpl.java:152)
at
org.apache.hadoop.hbase.group.GroupAdminServer.addGroup(GroupAdminServer.java:298)
at
org.apache.hadoop.hbase.group.GroupAdminEndpoint.addGroup(GroupAdminEndpoint.java:197)
at
org.apache.hadoop.hbase.protobuf.generated.GroupAdminProtos$GroupAdminService.callMethod(GroupAdminProtos.java:11146)
at
org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:666)
at
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:51121)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /hbase-unsecure/groupInfo/a-/:*
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1261)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1250)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1233)
at
org.apache.hadoop.hbase.group.GroupInfoManagerImpl.flushConfig(GroupInfoManagerImpl.java:408)
{code}
4. {{get_table_group}} and {{get_server_group}} shell commands do not work
{code}
hbase(main):019:0* get_table_group 'nonexisting'
ERROR: undefined local variable or method `s' for
#<Hbase::GroupAdmin:0x64518270>
Here is some help for this command:
Get the group name the given table is a member of.
hbase> get_table_group 'myTable'
hbase(main):022:0* get_server_group 'server'
ERROR: undefined local variable or method `s' for
#<Hbase::GroupAdmin:0x64518270>
Here is some help for this command:
Get the group name the given region server is a member of.
hbase> get_server_group 'server1:port1
{code}
5. {{move_group_servers}} and {{move_group_tables}} arguments are listed as 1,
although should be 2:
{code}
hbase(main):033:0* move_group_servers
ERROR: wrong number of arguments (0 for 1)
Here is some help for this command:
Reassign a region server from one group to another.
hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}
6. Adding a server without port throws error, but no explanation (this one is a
minor, not that important).
{code}
hbase(main):070:0> move_group_servers 'group2',
['os-enis-hbase-oct27-a-3.novalocal']
ERROR:
Here is some help for this command:
Reassign a region server from one group to another.
hbase> move_group_servers 'dest',['server1:port','server2:port']
{code}
7. From all the above, it is clear that we need a unit test over the new shell
commands.
Other than these, the feature is working as expected. Defining groups, moving
servers and tables work. Regions get reassigned according to their groups.
Restarting the cluster keeps assignments, etc.
Some more findings:
Test 1:
Killed the last regionserver of a group, all 15 regions are in FAILED_OPEN
state.
- restarted the master, regions still in FAILED_OPEN state (which is expected)
- Added a new server to the group which had no remaining servers, regions
still in FAILED_OPEN state (this is probably due to how assignment works, we
give up after 10 retries and wait for manual assignment or master restart)
- Started the region server that was killed before, still in FAILED_OPEN
- Master restart reassigned these regions.
Test 2:
Tried to move all servers to a single group. Correctly handles last server in
the default group by not allowing it to change.
Test 3:
Killed the last server in the default group, while all system tables are in the
default group (and hence in that server).
-> hbase:meta was always in PENDING_OPEN in bogus server localhost,1,1.
-> Upon restarting the killed server, meta and other tables in the default
group (including rsgroup table) got reassigned.
As a side note, having not enough servers in the group that has the meta or
rsgroup table seems like a very good way of shoothing yourself in the foot.
However, as discussed before this maybe needed for strong isolation.
- Add non-existing server to the group. Is not allowed.
- Checked JMX
- Adding group information for tables and regionserver to the master UI would
be helpful. We can leave this to a follow up.
- Obviously there should be a follow up to add at least some basic
documentation on how to enable and configure and use RS groups in the book.
> RegionServer Group based Assignment
> -----------------------------------
>
> Key: HBASE-6721
> URL: https://issues.apache.org/jira/browse/HBASE-6721
> Project: HBase
> Issue Type: New Feature
> Reporter: Francis Liu
> Assignee: Francis Liu
> Labels: hbase-6721
> Attachments: 6721-master-webUI.patch, HBASE-6721
> GroupBasedLoadBalancer Sequence Diagram.xml, HBASE-6721-DesigDoc.pdf,
> HBASE-6721-DesigDoc.pdf, HBASE-6721-DesigDoc.pdf, HBASE-6721-DesigDoc.pdf,
> HBASE-6721_0.98_2.patch, HBASE-6721_10.patch, HBASE-6721_11.patch,
> HBASE-6721_12.patch, HBASE-6721_13.patch, HBASE-6721_14.patch,
> HBASE-6721_15.patch, HBASE-6721_8.patch, HBASE-6721_9.patch,
> HBASE-6721_9.patch, HBASE-6721_94.patch, HBASE-6721_94.patch,
> HBASE-6721_94_2.patch, HBASE-6721_94_3.patch, HBASE-6721_94_3.patch,
> HBASE-6721_94_4.patch, HBASE-6721_94_5.patch, HBASE-6721_94_6.patch,
> HBASE-6721_94_7.patch, HBASE-6721_98_1.patch, HBASE-6721_98_2.patch,
> HBASE-6721_hbase-6721_addendum.patch, HBASE-6721_trunk.patch,
> HBASE-6721_trunk.patch, HBASE-6721_trunk.patch, HBASE-6721_trunk1.patch,
> HBASE-6721_trunk2.patch, balanceCluster Sequence Diagram.svg,
> hbase-6721-v15-branch-1.1.patch, immediateAssignments Sequence Diagram.svg,
> randomAssignment Sequence Diagram.svg, retainAssignment Sequence Diagram.svg,
> roundRobinAssignment Sequence Diagram.svg
>
>
> In multi-tenant deployments of HBase, it is likely that a RegionServer will
> be serving out regions from a number of different tables owned by various
> client applications. Being able to group a subset of running RegionServers
> and assign specific tables to it, provides a client application a level of
> isolation and resource allocation.
> The proposal essentially is to have an AssignmentManager which is aware of
> RegionServer groups and assigns tables to region servers based on groupings.
> Load balancing will occur on a per group basis as well.
> This is essentially a simplification of the approach taken in HBASE-4120. See
> attached document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)