[
https://issues.apache.org/jira/browse/HBASE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047818#comment-14047818
]
Lars Hofhansl commented on HBASE-10673:
---------------------------------------
Just came across this. Seems serious.
> [replication] Recovering a large number of queues can cause OOME
> ----------------------------------------------------------------
>
> Key: HBASE-10673
> URL: https://issues.apache.org/jira/browse/HBASE-10673
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 0.94.5
> Reporter: Ian Friedman
>
> Recently we had an accidental shutdown of approx 400 RegionServers in our
> cluster. Our cluster runs master-master replication to our secondary
> datacenter. Once we got the nodes sorted out and added back to the cluster,
> we noticed RSes continuing to die from OOMExceptions. Investigating the heap
> dumps from the dead RSes I found them full of WALEdits, and a lot of
> ReplicationSource threads generated by the NodeFailoverWorker replicating a
> lot of queues for the dead RSes.
> After doing some digging into the code and with a lot of help from
> [~jdcryans] we found that it is possible for the node failover process in a
> very few regionservers to pick up the majority of queues when a lot of RSes
> die simultaneously. This causes a lot of HLogs to be opened and read
> simultaneously, thus overflowing the heap.
> as per [~jdcryans]:
> {quote}
> Well there's a race to grab the queues to recover, and the surviving region
> servers are just scanning the znode with all the RS to try to determine which
> ones are dead (and when it finds a dead one it forks off a
> NodeFailoverWorker) so that if one RS got the notification that one RS died
> but in reality a bunch of them died, it could win a lot of those races.
> My guess is after that one died, its queues were redistributed, and maybe you
> had another hero RS, but as more and more died the queues eventually got
> manageable since a single RS cannot win all the races. Something like this:
> - 400 die, one RS manages to grab 350
> - That RS dies, one other RS manages to grab 300 of the 350 (the other queues
> going to other region servers)
> - That second RS dies, hero #3 grabs 250
> etc.
> ...
> We don't have the concept of a limit of recovered queues, less so the concept
> of respecting it. Aggregating the queues into one is one solution, but it
> would perform poorly as show with your use case where one region server would
> be responsible to replicate the data for 300 others. It would scale better if
> the queues were more evenly distributed (and it was the original intent,
> except that in practice the dynamics are different).
> {quote}
--
This message was sent by Atlassian JIRA
(v6.2#6252)