Memory leak in HBase replication ?
Hi, I am fairly new to Hbase. We are trying to setup OpenTSDB system here and just started setting up production clusters. We have 2 datacenters, on a west/east coasts and we want to have 2 active-passive Hbase clusters with Hbase replication between them. Right now each cluster has 4 nodes (1 master, 3 slave), we will add more nodes as the load ramps up. Setup went fine and data started getting replicating from one cluster to another, but as soon as load picked up regionservers on slave cluster started running out of heap and getting killed. I increased heap size on regionservers from default 1000M to 2000M, but result was the same. I also updated Hbase from the version that came with Hortonworks (hbase-0.94.6.1.3.0.0-107-security) to hbase-0.94.9 - still the same. Now the load on source cluster is still very little. There is one active table - tsdb, and compressed size is less than 200M. But as soon as I start replication the usedHeapMB metric on regionservers in slave cluster starts going up, then full GC kicks in and eventually process is killed because -XX:OnOutOfMemoryError=kill -9 %p is set. I did the heap dump and ran Eclipse memory analyzer and here is what it reported: One instance of java.util.concurrent.LinkedBlockingQueue loaded by system class loader occupies 1,411,643,656 (67.87%) bytes. The instance is referenced by org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 0x7831c37f0 , loaded by sun.misc.Launcher$AppClassLoader @ 0x783130980. The memory is accumulated in one instance of java.util.concurrent.LinkedBlockingQueue$Node loaded by system class loader. And 502,763 instances of org.apache.hadoop.hbase.client.Put, loaded by sun.misc.Launcher$AppClassLoader @ 0x783130980 occupy 244,957,616 (11.78%) bytes. There is nothing in the logs until full GC kicks in at which point all hell breaks loose, things start timing out etc. I did bunch of searching but came up with nothing. I could add more RAM to the nodes and increase heap size, but I suspect that will only prolong the time until heap gets full. Any help would be appreciated. Limus
RE: Memory leak in HBase replication ?
J-D, I have log level org.apache=WARN and there is only following in the logs before GC happens: 2013-07-17 10:56:45,830 ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false 2013-07-17 10:56:47,395 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available I'll try upping log level to DEBUG to see if that shows anything and will run jstack. Thanks, Limus
Re: Memory leak in HBase replication ?
Yean WARN won't give us anything, and please try to get us a fat log. Post it on pastebin or such. Thx, J-D On Wed, Jul 17, 2013 at 11:03 AM, Anusauskas, Laimonas lanusaus...@corp.untd.com wrote: J-D, I have log level org.apache=WARN and there is only following in the logs before GC happens: 2013-07-17 10:56:45,830 ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false 2013-07-17 10:56:47,395 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available I'll try upping log level to DEBUG to see if that shows anything and will run jstack. Thanks, Limus
RE: Memory leak in HBase replication ?
Ok, here is log from data node 1: http://pastebin.com/yCYYEG2r And out log containing GC log: http://pastebin.com/wzt1fbTA I started replication around 11:16 and with 1000M heap it got full pretty fast. Limus
RE: Memory leak in HBase replication ?
And here is the jstack output. http://pastebin.com/JKnQYqRg
Re: Memory leak in HBase replication ?
1GB is a pretty small heap and it could be that the default size for logs to replicate is set to high. The default for replication.source.size.capacity is 64MB. Can you set it much lower on your master cluster (on each RS), like 2MB, and see if it makes a difference? The logs and the jstack seem to correlate in that sense. Thx, J-D On Wed, Jul 17, 2013 at 1:40 PM, Anusauskas, Laimonas lanusaus...@corp.untd.com wrote: And here is the jstack output. http://pastebin.com/JKnQYqRg
RE: Memory leak in HBase replication ?
Thanks, setting replication.source.size.capacity to 2MB resolved this. I see heap growing to about 700MB but then going down and full GC is only triggered occasionally. And while primary cluster is has very little load ( 100 requests/sec) the standby cluster is now pretty loaded at 5K requests/sec, presumable because it has to replicate all the pending changes. So perhaps this is the issue that happens when standby cluster goes away for a while and then has to catch up. Really appreciate the help. Limus
Re: Memory leak in HBase replication ?
Yes... your master cluster must have helluva backup to replicate :) Seems to make a good argument to lower the default setting. What do you think? J-D On Wed, Jul 17, 2013 at 3:37 PM, Anusauskas, Laimonas lanusaus...@corp.untd.com wrote: Thanks, setting replication.source.size.capacity to 2MB resolved this. I see heap growing to about 700MB but then going down and full GC is only triggered occasionally. And while primary cluster is has very little load ( 100 requests/sec) the standby cluster is now pretty loaded at 5K requests/sec, presumable because it has to replicate all the pending changes. So perhaps this is the issue that happens when standby cluster goes away for a while and then has to catch up. Really appreciate the help. Limus
RE: Memory leak in HBase replication ?
I don't know how this works well enough to suggest lowering default setting, maybe 64MB really helps the throughput for other setups ? At least there could be a note in Hbase requirements about heap sizes and replication. Ideally there should be throttling of some kind so that if target regionserver cannot keep up with replication requests the rate of replication is slowed down but at least the regionserver does not run out of free heap space. Limus