[ 
https://issues.apache.org/jira/browse/HBASE-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150057#comment-15150057
 ] 

Stephen Yuan Jiang commented on HBASE-15251:
--------------------------------------------

[~claraxiong], what I tried to say was that I thought that only the 
{{checkWals()}} change helped improve clean restart performance.  The logic you 
were adding to check whether 0 region from a dead server would set 
'failover=true' if dead servers has assigned region.  Since your scenario is 
'clean restart', failover would be false and you will go to the normal logic.  
I don't see any shortcut that the check would end earlier and declared that 
'failover' indeed is 'false'.  

Overall, the patch looks good.  (At this time, you have not uploaded the new 
patch that addresses Ted's comments.  Please do so and let the pre-commit to 
verify)

> During a cluster restart, Hmaster thinks it is a failover by mistake
> --------------------------------------------------------------------
>
>                 Key: HBASE-15251
>                 URL: https://issues.apache.org/jira/browse/HBASE-15251
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0, 0.98.15
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>         Attachments: HBASE-15251-master.patch
>
>
> We often need to do cluster restart as part of release for a cluster of > 
> 1000 nodes. We have tried our best to get clean shutdown but 50% of the time, 
> hmaster still thinks it is a failover. This increases the restart time from 5 
> min to 30 min and decreases locality from 99% to 5% since we didn't use a 
> locality-aware balancer. We had a bug HBASE-14129 but the fix didn't work. 
> After adding more logging and inspecting the logs, we identified two things 
> that trigger the failover handling:
> 1.  When Hmaster.AssignmentManager detects any dead servers on service 
> manager during joinCluster(), it determines this is a failover without 
> further check. I added a check whether there is even any region assigned to 
> these servers. During a clean restart, the regions are not even assigned.
> 2. When there are some leftover empty folders for log and split directories 
> or empty wal files, it is also treated as a failover. I added a check for 
> that. Although this can be resolved by manual cleanup, it is still too 
> tedious for restarting a large cluster.
> Patch will follow shortly. The fix is tested and used in production now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to