Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
Hi,

I am fairly new to Hbase. We are trying to setup OpenTSDB system here and just 
started setting up production clusters. We have 2 datacenters, on a west/east 
coasts and we want to have 2 active-passive Hbase clusters with Hbase 
replication between them. Right now each cluster has 4 nodes (1 master, 3 
slave), we will add more nodes as the load ramps up.  Setup went fine and data 
started getting replicating from one cluster to another, but as soon as load 
picked up regionservers on slave cluster started running out of heap and 
getting killed. I increased heap size on regionservers from default 1000M to 
2000M, but result was the same. I also updated Hbase from the version that came 
with Hortonworks (hbase-0.94.6.1.3.0.0-107-security) to hbase-0.94.9 - still 
the same.

Now the load on source cluster is still very little. There is one active table 
- tsdb, and compressed size is less than 200M. But as soon as I start 
replication the usedHeapMB metric on regionservers in slave cluster starts 
going up, then full GC kicks in and eventually process is killed because  
-XX:OnOutOfMemoryError=kill -9 %p is set.

I did the heap dump and ran Eclipse memory analyzer and here is what it 
reported:

One instance of java.util.concurrent.LinkedBlockingQueue loaded by system 
class loader occupies 1,411,643,656 (67.87%) bytes. The instance is 
referenced by org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 
0x7831c37f0 , loaded by sun.misc.Launcher$AppClassLoader @ 0x783130980. The 
memory is accumulated in one instance of 
java.util.concurrent.LinkedBlockingQueue$Node loaded by system class 
loader.

And

502,763 instances of org.apache.hadoop.hbase.client.Put, loaded by 
sun.misc.Launcher$AppClassLoader @ 0x783130980 occupy 244,957,616 (11.78%) 
bytes.

There is nothing in the logs until full GC kicks in at which point all hell 
breaks loose, things start timing out etc.

I did bunch of searching but came up with nothing. I could add more RAM to the 
nodes and increase heap size, but I suspect that will only prolong the time 
until heap gets full.

Any help would be appreciated.

Limus


RE: Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
J-D,

I have log level org.apache=WARN and there is only following in the logs before 
GC happens:

2013-07-17 10:56:45,830 ERROR 
org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent 
configuration. Previous configuration for using table name in metrics: true, 
new configuration: false
2013-07-17 10:56:47,395 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: 
Snappy native library is available

I'll try upping log level to DEBUG to see if that shows anything and will run 
jstack.

Thanks,

Limus







Re: Memory leak in HBase replication ?

2013-07-17 Thread Jean-Daniel Cryans
Yean WARN won't give us anything, and please try to get us a fat log. Post
it on pastebin or such.

Thx,

J-D


On Wed, Jul 17, 2013 at 11:03 AM, Anusauskas, Laimonas 
lanusaus...@corp.untd.com wrote:

 J-D,

 I have log level org.apache=WARN and there is only following in the logs
 before GC happens:

 2013-07-17 10:56:45,830 ERROR
 org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent
 configuration. Previous configuration for using table name in metrics:
 true, new configuration: false
 2013-07-17 10:56:47,395 WARN
 org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is
 available

 I'll try upping log level to DEBUG to see if that shows anything and will
 run jstack.

 Thanks,

 Limus








RE: Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
Ok, here is log from data node 1:

http://pastebin.com/yCYYEG2r

And out log containing GC log:

http://pastebin.com/wzt1fbTA

I started replication around 11:16 and with 1000M heap it got full pretty fast.

Limus


RE: Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
And here is the jstack output. 

http://pastebin.com/JKnQYqRg




Re: Memory leak in HBase replication ?

2013-07-17 Thread Jean-Daniel Cryans
1GB is a pretty small heap and it could be that the default size for logs
to replicate is set to high. The default
for replication.source.size.capacity is 64MB. Can you set it much lower on
your master cluster (on each RS), like 2MB, and see if it makes a
difference?

The logs and the jstack seem to correlate in that sense.

Thx,

J-D


On Wed, Jul 17, 2013 at 1:40 PM, Anusauskas, Laimonas 
lanusaus...@corp.untd.com wrote:

 And here is the jstack output.

 http://pastebin.com/JKnQYqRg





RE: Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
Thanks, setting replication.source.size.capacity to 2MB resolved this. I see 
heap growing to about 700MB but then going down and full GC is only triggered 
occasionally. 

And while primary cluster is has very little load ( 100 requests/sec) the 
standby cluster is  now pretty loaded at 5K requests/sec, presumable because it 
has to replicate all the pending changes. So perhaps this is the issue that 
happens when standby cluster goes away for a while and then has to catch up. 

Really appreciate the help.

Limus



Re: Memory leak in HBase replication ?

2013-07-17 Thread Jean-Daniel Cryans
Yes... your master cluster must have helluva backup to replicate :)

Seems to make a good argument to lower the default setting. What do you
think?

J-D


On Wed, Jul 17, 2013 at 3:37 PM, Anusauskas, Laimonas 
lanusaus...@corp.untd.com wrote:

 Thanks, setting replication.source.size.capacity to 2MB resolved this. I
 see heap growing to about 700MB but then going down and full GC is only
 triggered occasionally.

 And while primary cluster is has very little load ( 100 requests/sec) the
 standby cluster is  now pretty loaded at 5K requests/sec, presumable
 because it has to replicate all the pending changes. So perhaps this is the
 issue that happens when standby cluster goes away for a while and then has
 to catch up.

 Really appreciate the help.

 Limus




RE: Memory leak in HBase replication ?

2013-07-17 Thread Anusauskas, Laimonas
I don't know how this works well enough to suggest lowering default setting, 
maybe 64MB really helps the throughput for other setups ? At least there could 
be a note in Hbase requirements about heap sizes and replication. 

Ideally there should be throttling of some kind so that if target regionserver 
cannot keep up with replication requests the rate of replication is slowed down 
but at least the regionserver does not run out of free heap space. 

Limus