[
https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Elser resolved HBASE-13605.
--------------------------------
Resolution: Won't Fix
Re-stumbled onto this one. Let's just won't-fix and wait for the new greatness
coming imminently. Very unlikely someone else will come back and pick this up.
> RegionStates should not keep its list of dead servers
> -----------------------------------------------------
>
> Key: HBASE-13605
> URL: https://issues.apache.org/jira/browse/HBASE-13605
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Priority: Critical
> Fix For: 2.0.0, 1.5.0
>
> Attachments: hbase-13605_v1.patch, hbase-13605_v3-branch-1.1.patch,
> hbase-13605_v4-branch-1.1.patch, hbase-13605_v4-master.patch
>
>
> As mentioned in
> https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761
> and HBASE-12844 we should have only 1 source of cluster membership.
> The list of dead server and RegionStates doing it's own liveliness check
> (ServerManager.isServerReachable()) has caused an assignment problem again in
> a test cluster where the region states "thinks" that the server is dead and
> SSH will handle the region assignment. However the RS is not dead at all,
> living happily, and never gets zk expiry or YouAreDeadException or anything.
> This leaves the list of regions unassigned in OFFLINE state.
> master assigning the region:
> {code}
> 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates:
> Onlined 77dddcd50c22e56bfff133c0e1f9165b on
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED =>
> 77dddcd50c
> {code}
> Master then disabled the table, and unassigned the region:
> {code}
> 2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1]
> zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING
> to DISABLING
> Starting unassign of
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining),
> current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN,
> ts=1429520545780,
> server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268}
> bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.
> 2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates:
> Offlined 77dddcd50c22e56bfff133c0e1f9165b from
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}
> On table re-enable, AM does not assign the region:
> {code}
> 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3]
> balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart
> assignment.ยท
> 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3]
> procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5
> server(s), retainAssignment=true
> l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't
> reach online server
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager:
> Updating the state to OFFLINE to allow to be reassigned by SSH
> nmentManager: Skip assigning
> loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead
> but not processed yet server:
> os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)