[
https://issues.apache.org/jira/browse/HBASE-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen Yuan Jiang updated HBASE-17023:
---------------------------------------
Attachment: HBASE-17023.v0-branch-1.1.patch
> Region left unassigned due to AM and SSH each thinking others would do the
> assignment work
> ------------------------------------------------------------------------------------------
>
> Key: HBASE-17023
> URL: https://issues.apache.org/jira/browse/HBASE-17023
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 1.1.0
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Attachments: HBASE-17023.v0-branch-1.1.patch
>
>
> Another Assignment Manager and SSH issue. This issue is similar to
> HBASE-13330, except this time the code path goes through ClosedRegionHandler
> and we should apply the same fix of HBASE-13330 to ClosedRegionHandler.
> Basically, the AssignmentManager thinks the ServerShutdownHandler would
> assign the region and the ServerShutdownHandler thinks that the
> AssignmentManager would assign the region. The region
> (23e0186c4d2b5cc09f25de35fe174417) ultimately never gets assigned. Below is
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to
> {{rs42.prod.foo.com,16020,1476293566365}}.
> 2. The {{rs42.prod.foo.com,16020,1476293566365}} stops and sends the CLOSE
> request to master.
> 3. ServerShutdownHandler(SSH) runs to assign this region to
> {{rs44.prod.foo.com,16020,1476294287692}}, but assign failed.
> 4. When the master restarted it did a scan of the meta to learn about the
> regions in the cluster. It found this region still being assigned to
> {{rs42} from the meta record.
> 5. However, this {{rs42}} server was not alive anymore. So, the
> AssignmentManager queued up a ServerShutdownHandling task for this (that
> asynchronously executes):
> 6. In the meantime, the AssignmentManager proceeded to read the RIT nodes
> from ZK. It found this region as well is in RS_ZK_REGION_FAILED_OPEN in the
> {{rs44}} RS.
> 7. The region was moved to CLOSED state:
> {noformat}
> 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6]
> master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN,
> server=rs44.prod.foo.com,16020,1476294287692,
> region=23e0186c4d2b5cc09f25de35fe174417,
> current_state={23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN,
> ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692}
> 2016-10-12 17:45:11,637 INFO [AM.ZK.Worker-pool2-t6] master.RegionStates:
> Transition {23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN,
> ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692} to
> {23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311637,
> server=rs44.prod.foo.com,16020,1476294287692}
> 2016-10-12 17:45:11,637 WARN [AM.ZK.Worker-pool2-t6] master.RegionStates:
> 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on
> rs44.prod.foo.com,16020,1476294287692, expected
> rs42.prod.foo.com,16020,1476293566365
> {noformat}
> 8. After that the AssignmentManager tried to assign it again. However, the
> assignment didn't happen because the ServerShutdownHandling task queued
> earlier didn't yet execute:
> {noformat}
> 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6]
> master.AssignmentManager: Found an existing plan for
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.
> destination server is rs44.prod.foo.com,16020,1476294287692 accepted as a
> dest server = false
> 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6]
> master.AssignmentManager: No previous transition plan found (or ignoring an
> existing plan) for
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.;
> generated random
> plan=hri=table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.,
> src=, dest=rs28.prod.foo.com,16020,1476294291314; 10 (online=11) available
> servers, forceNewPlan=true
> 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6]
> handler.ClosedRegionHandler: Handling CLOSED event for
> 23e0186c4d2b5cc09f25de35fe174417
> 2016-10-12 17:45:11,697 WARN [AM.ZK.Worker-pool2-t6] master.RegionStates:
> 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on
> rs44.prod.foo.com,16020,1476294287692, expected
> rs42.prod.foo.com,16020,1476293566365
> 2016-10-12 17:45:11,697 INFO [AM.ZK.Worker-pool2-t6]
> master.AssignmentManager: Skip assigning
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.,
> it's host rs42.prod.foo.com,16020,1476293566365 is dead but not processed yet
> 2016-10-12 17:45:11,884 INFO [MASTER_SERVER_OPERATIONS-server01:16000-3]
> master.RegionStates: Transitioning {23e0186c4d2b5cc09f25de35fe174417
> state=CLOSED, ts=1476294311697, server=rs44.prod.foo.com,16020,1476294287692}
> will be handled by SSH for rs42.prod.foo.com,16020,1476293566365
> {noformat}
> 9. When the ServerShutdownHandling task reaches to this region, it also
> skipped the region in question. This was because this region was in RIT, and
> the ServerShutdownHandling task thinks that the AssignmentManager would
> assign it as part of handling the RIT nodes:
> {noformat}
> 2016-10-12 17:45:11,892 INFO [MASTER_SERVER_OPERATIONS-server01:16000-3]
> handler.ServerShutdownHandler: Skip assigning region in transition on other
> server{23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311697,
> server=rs44.prod.foo.com,16020,1476294287692}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)