sorry for the novel...

I've been experiencing some problems with my hbase cluster and hoping someone 
can point me in the right direction.  I have a 40 node cluster running 0.19.0.  
Each node has 4 cores, 8GB (2GB dedicated to the regionserver), and 1TB data 
disk.  The master is on a dedicated machine separate from the namenode and the 
jobtracker.  There is a single table with 4 column families and 3700 regions 
evenly spread across the 40 nodes.  The TTL's match our loading pace well 
enough that we don't typically see too many splits anymore.

In trying to troubleshoot some larger issues with bulk loads on this cluster I 
have created a test scenario to try and narrow the problem based on various 
symptoms.  This test is map/reduce job that is using the HRegionPartitioner (as 
an easy way to generate some traffic to the master for meta data).  I've been 
running this job with various size inputs to gauge the effect of different 
numbers of mappers and have found that as the number of concurrent mappers 
creeps up to what I think are still small numbers (<50 mappers), the 
performance of the master is dramatically impacted.  I'm judging the 
performance here simply by checking the response time of the UI on the master, 
since that has historically been a good indication of when the cluster is 
getting into trouble during our loads (which I'm sure could mean a lot of 
things), although i suppose it's possible to two are unrelated.

The UI normally takes about 5-7 seconds to refresh master.jsp.  Running a job 
with 5 mappers doesn't seem to impact it too much, but a job with 38 mappers 
makes the UI completely unresponsive for anywhere from 30 seconds to several 
minutes during the run.  During this time, there is nothing happening in the 
logs, scans/gets from within the shell continue to work fine, and ganglia/top 
show the box to be virtually idle.  All links off of master.jsp work fine, so I 
presume it's something about the master pulling info from the individual nodes, 
but those UI's are also perfectly responsive.

This same cluster used to run on just 20 nodes without issue, so I'm curious if 
I've crossed some threshold of horizontal scalability or if there is just a 
tuning parameter that I'm missing that might take care of this, or if there is 
something known between 0.19.0 and 0.19.3 that might be a factor.

Thanks

jeremy


The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.

Reply via email to