[ 
https://issues.apache.org/jira/browse/HBASE-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Yuan Jiang updated HBASE-17023:
---------------------------------------
    Attachment: HBASE-17023.v0-branch-1.1.patch

> Region left unassigned due to AM and SSH each thinking others would do the 
> assignment work
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-17023
>                 URL: https://issues.apache.org/jira/browse/HBASE-17023
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.1.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>         Attachments: HBASE-17023.v0-branch-1.1.patch
>
>
> Another Assignment Manager and SSH issue.  This issue is similar to 
> HBASE-13330, except this time the code path goes through ClosedRegionHandler 
> and we should apply the same fix of HBASE-13330 to ClosedRegionHandler.
> Basically, the AssignmentManager thinks the ServerShutdownHandler would 
> assign the region and the ServerShutdownHandler thinks that the 
> AssignmentManager would assign the region. The region 
> (23e0186c4d2b5cc09f25de35fe174417) ultimately never gets assigned. Below is 
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to 
> {{rs42.prod.foo.com,16020,1476293566365}}.
> 2. The {{rs42.prod.foo.com,16020,1476293566365}} stops and sends the CLOSE 
> request to master.
> 3. ServerShutdownHandler(SSH) runs to assign this region to 
> {{rs44.prod.foo.com,16020,1476294287692}}, but assign failed.
> 4. When the master restarted it did a scan of the meta to learn about the 
> regions in the cluster. It found this region still being assigned to
> {{rs42} from the meta record.
> 5. However, this {{rs42}} server was not alive anymore. So, the 
> AssignmentManager queued up a ServerShutdownHandling task for this (that 
> asynchronously executes):
> 6. In the meantime, the AssignmentManager proceeded to read the RIT nodes 
> from ZK. It found this region as well is in RS_ZK_REGION_FAILED_OPEN in the 
> {{rs44}} RS.
> 7. The region was moved to CLOSED state:
> {noformat}
> 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6] 
> master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN, 
> server=rs44.prod.foo.com,16020,1476294287692, 
> region=23e0186c4d2b5cc09f25de35fe174417, 
> current_state={23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN, 
> ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692}
> 2016-10-12 17:45:11,637 INFO  [AM.ZK.Worker-pool2-t6] master.RegionStates: 
> Transition {23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN, 
> ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692} to 
> {23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311637, 
> server=rs44.prod.foo.com,16020,1476294287692}
> 2016-10-12 17:45:11,637 WARN  [AM.ZK.Worker-pool2-t6] master.RegionStates: 
> 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on 
> rs44.prod.foo.com,16020,1476294287692, expected 
> rs42.prod.foo.com,16020,1476293566365
> {noformat}
> 8. After that the AssignmentManager tried to assign it again. However, the 
> assignment didn't happen because the ServerShutdownHandling task queued 
> earlier didn't yet execute:
> {noformat}
> 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6] 
> master.AssignmentManager: Found an existing plan for 
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417. 
> destination server is rs44.prod.foo.com,16020,1476294287692 accepted as a 
> dest server = false
> 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6] 
> master.AssignmentManager: No previous transition plan found (or ignoring an 
> existing plan) for 
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.; 
> generated random 
> plan=hri=table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.,
>  src=, dest=rs28.prod.foo.com,16020,1476294291314; 10 (online=11) available 
> servers, forceNewPlan=true
> 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6] 
> handler.ClosedRegionHandler: Handling CLOSED event for 
> 23e0186c4d2b5cc09f25de35fe174417
> 2016-10-12 17:45:11,697 WARN  [AM.ZK.Worker-pool2-t6] master.RegionStates: 
> 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on 
> rs44.prod.foo.com,16020,1476294287692, expected 
> rs42.prod.foo.com,16020,1476293566365
> 2016-10-12 17:45:11,697 INFO  [AM.ZK.Worker-pool2-t6] 
> master.AssignmentManager: Skip assigning 
> table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417., 
> it's host rs42.prod.foo.com,16020,1476293566365 is dead but not processed yet
> 2016-10-12 17:45:11,884 INFO  [MASTER_SERVER_OPERATIONS-server01:16000-3] 
> master.RegionStates: Transitioning {23e0186c4d2b5cc09f25de35fe174417 
> state=CLOSED, ts=1476294311697, server=rs44.prod.foo.com,16020,1476294287692} 
> will be handled by SSH for rs42.prod.foo.com,16020,1476293566365
> {noformat}
> 9. When the ServerShutdownHandling task reaches to this region, it also 
> skipped the region in question. This was because this region was in RIT, and 
> the ServerShutdownHandling task thinks that the AssignmentManager would 
> assign it as part of handling the RIT nodes:
> {noformat}
> 2016-10-12 17:45:11,892 INFO  [MASTER_SERVER_OPERATIONS-server01:16000-3] 
> handler.ServerShutdownHandler: Skip assigning region in transition on other 
> server{23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311697, 
> server=rs44.prod.foo.com,16020,1476294287692}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to