Ian Friedman created HBASE-10673:
------------------------------------
Summary: [replication] Recovering a large number of queues can
cause OOME
Key: HBASE-10673
URL: https://issues.apache.org/jira/browse/HBASE-10673
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 0.94.5
Reporter: Ian Friedman
Recently we had an accidental shutdown of approx 400 RegionServers in our
cluster. Our cluster runs master-master replication to our seconday datacenter.
Once we got the nodes sorted out and added back to the cluster, we noticed RSes
continuing to die from OOMExceptions. Investigating the heap dumps from the
dead RSes I found them full of WALEdits, and a lot of ReplicationSource threads
generated by the NodeFailoverWorker replicating a lot of queues for the dead
RSes.
After doing some digging into the code and with a lot of help from [~jdcryans]
we found that it is possible for the node failover process in a very few
regionservers to pick up the majority of queues when a lot of RSes die
simultaneously. This causes a lot of HLogs to be opened and read simultaneously.
as per [~jdcryans]:
{quote}
Well there's a race to grab the queues to recover, and the surviving region
servers are just scanning the znode with all the RS to try to determine which
ones are dead (and when it finds a dead one it forks off a NodeFailoverWorker)
so that if one RS got the notification that one RS died but in reality a bunch
of them died, it could win a lot of those races.
My guess is after that one died, its queues were redistributed, and maybe you
had another hero RS, but as more and more died the queues eventually got
manageable since a single RS cannot win all the races. Something like this:
- 400 die, one RS manages to grab 350
- That RS dies, one other RS manages to grab 300 of the 350 (the other queues
going to other region servers)
- That second RS dies, hero #3 grabs 250
etc.
...
We don't have the concept of a limit of recovered queues, less so the concept
of respecting it. Aggregating the queues into one is one solution, but it would
perform poorly as show with your use case where one region server would be
responsible to replicate the data for 300 others. It would scale better if the
queues were more evenly distributed (and it was the original intent, except
that in practice the dynamics are different).
{quote}
--
This message was sent by Atlassian JIRA
(v6.2#6252)