Enis Soztutar created HBASE-13605:
-------------------------------------
Summary: RegionStates should not keep its list of dead servers
Key: HBASE-13605
URL: https://issues.apache.org/jira/browse/HBASE-13605
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Fix For: 2.0.0, 1.0.2, 1.1.1
As mentioned in
https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761
and HBASE-12844 we should have only 1 source of cluster membership.
The list of dead server and RegionStates doing it's own liveliness check
(ServerManager.isServerReachable()) has caused an assignment problem again in a
test cluster where the region states "thinks" that the server is dead and SSH
will handle the region assignment. However the RS is not dead at all, living
happily, and never gets zk expiry or YouAreDeadException or anything. This
leaves the list of regions unassigned in OFFLINE state.
master assigning the region:
{code}
15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates:
Onlined 77dddcd50c22e56bfff133c0e1f9165b on
os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED =>
77dddcd50c
{code}
Master then disabled the table, and unassigned the region:
{code}
2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1]
zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING to
DISABLING
Starting unassign of
loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining),
current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, ts=1429520545780,
server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268}
bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to
os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region
loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.
2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates:
Offlined 77dddcd50c22e56bfff133c0e1f9165b from
os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
{code}
On table re-enable, AM does not assign the region:
{code}
2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3]
balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart
assignment.ยท
2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3]
procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 server(s),
retainAssignment=true
l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't
reach online server
os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: Updating
the state to OFFLINE to allow to be reassigned by SSH
nmentManager: Skip assigning
loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead
but not processed yet server:
os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)