What Ryan said and then can you try same test after a major compaction?
Does it make a difference?  You can force it in shell by doing "hbase>
major_compaction '.META.'" IIRC (Type 'tools' in shell to get help
syntax).   What size are your jobs?  Short-lived?  Seconds or minutes?  Each
job needs to build up cache or region locations.  To do this, its trip to
.META.  Longer-lived jobs will save on trips to .META.  Also, take a thread
dump when its slow ("kill -QUIT PID_OF_MASTER") and send it to us.  Do it a
few times.  We'll take a look see.

Should be better in 0.20.0 but maybe a few things we can do meantime.

St.Ack

On Mon, Jun 1, 2009 at 5:31 PM, Jeremy Pinkham <[email protected]> wrote:

>
> sorry for the novel...
>
> I've been experiencing some problems with my hbase cluster and hoping
> someone can point me in the right direction.  I have a 40 node cluster
> running 0.19.0.  Each node has 4 cores, 8GB (2GB dedicated to the
> regionserver), and 1TB data disk.  The master is on a dedicated machine
> separate from the namenode and the jobtracker.  There is a single table with
> 4 column families and 3700 regions evenly spread across the 40 nodes.  The
> TTL's match our loading pace well enough that we don't typically see too
> many splits anymore.
>
> In trying to troubleshoot some larger issues with bulk loads on this
> cluster I have created a test scenario to try and narrow the problem based
> on various symptoms.  This test is map/reduce job that is using the
> HRegionPartitioner (as an easy way to generate some traffic to the master
> for meta data).  I've been running this job with various size inputs to
> gauge the effect of different numbers of mappers and have found that as the
> number of concurrent mappers creeps up to what I think are still small
> numbers (<50 mappers), the performance of the master is dramatically
> impacted.  I'm judging the performance here simply by checking the response
> time of the UI on the master, since that has historically been a good
> indication of when the cluster is getting into trouble during our loads
> (which I'm sure could mean a lot of things), although i suppose it's
> possible to two are unrelated.
>
> The UI normally takes about 5-7 seconds to refresh master.jsp.  Running a
> job with 5 mappers doesn't seem to impact it too much, but a job with 38
> mappers makes the UI completely unresponsive for anywhere from 30 seconds to
> several minutes during the run.  During this time, there is nothing
> happening in the logs, scans/gets from within the shell continue to work
> fine, and ganglia/top show the box to be virtually idle.  All links off of
> master.jsp work fine, so I presume it's something about the master pulling
> info from the individual nodes, but those UI's are also perfectly
> responsive.
>
> This same cluster used to run on just 20 nodes without issue, so I'm
> curious if I've crossed some threshold of horizontal scalability or if there
> is just a tuning parameter that I'm missing that might take care of this, or
> if there is something known between 0.19.0 and 0.19.3 that might be a
> factor.
>
> Thanks
>
> jeremy
>
>
> The information transmitted in this email is intended only for the
> person(s) or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, retransmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this email in error, please contact the sender and permanently
> delete the email from any computer.
>
>

Reply via email to