[ 
https://issues.apache.org/jira/browse/HBASE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047818#comment-14047818
 ] 

Lars Hofhansl commented on HBASE-10673:
---------------------------------------

Just came across this. Seems serious.

> [replication] Recovering a large number of queues can cause OOME
> ----------------------------------------------------------------
>
>                 Key: HBASE-10673
>                 URL: https://issues.apache.org/jira/browse/HBASE-10673
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.94.5
>            Reporter: Ian Friedman
>
> Recently we had an accidental shutdown of approx 400 RegionServers in our 
> cluster. Our cluster runs master-master replication to our secondary 
> datacenter. Once we got the nodes sorted out and added back to the cluster, 
> we noticed RSes continuing to die from OOMExceptions. Investigating the heap 
> dumps from the dead RSes I found them full of WALEdits, and a lot of 
> ReplicationSource threads generated by the NodeFailoverWorker replicating a 
> lot of queues for the dead RSes.
> After doing some digging into the code and with a lot of help from 
> [~jdcryans] we found that it is possible for the node failover process in a 
> very few regionservers to pick up the majority of queues when a lot of RSes 
> die simultaneously. This causes a lot of HLogs to be opened and read 
> simultaneously, thus overflowing the heap.
> as per [~jdcryans]:
> {quote}
> Well there's a race to grab the queues to recover, and the surviving region 
> servers are just scanning the znode with all the RS to try to determine which 
> ones are dead (and when it finds a dead one it forks off a 
> NodeFailoverWorker) so that if one RS got the notification that one RS died 
> but in reality a bunch of them died, it could win a lot of those races.
> My guess is after that one died, its queues were redistributed, and maybe you 
> had another hero RS, but as more and more died the queues eventually got 
> manageable since a single RS cannot win all the races. Something like this:
> - 400 die, one RS manages to grab 350
> - That RS dies, one other RS manages to grab 300 of the 350 (the other queues 
> going to other region servers)
> - That second RS dies, hero #3 grabs 250
> etc.
> ...
> We don't have the concept of a limit of recovered queues, less so the concept 
> of respecting it. Aggregating the queues into one is one solution, but it 
> would perform poorly as show with your use case where one region server would 
> be responsible to replicate the data for 300 others. It would scale better if 
> the queues were more evenly distributed (and it was the original intent, 
> except that in practice the dynamics are different).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to