On Tue, 2010-02-16 at 16:45 -0600, Stack wrote:
> On Tue, Feb 16, 2010 at 2:25 PM, James Baldassari <ja...@dataxu.com> wrote:
> > On Tue, 2010-02-16 at 14:05 -0600, Stack wrote:
> >> On Tue, Feb 16, 2010 at 10:50 AM, James Baldassari <ja...@dataxu.com> 
> >> wrote:
> >
> > Whether the keys themselves are evenly distributed is another matter.
> > Our keys are user IDs, and they should be fairly random.  If we do a
> > status 'detailed' in the hbase shell we see the following distribution
> > for the value of "requests" (not entirely sure what this value means):
> > hdfs01: 7078
> > hdfs02: 5898
> > hdfs03: 5870
> > hdfs04: 3807
> >
> That looks like they are evenly distributed.  Requests are how many
> hits a second.  See the UI on master port 60010.  The numbers should
> match.

So the total across all 4 region servers would be 22,653/second?  Hmm,
that doesn't seem too bad.  I guess we just need a little more
throughput...

> 
> 
> > There are no order of magnitude differences here, and the request count
> > doesn't seem to map to the load on the server.  Right now hdfs02 has a
> > load of 16 while the 3 others have loads between 1 and 2.
> 
> 
> This is interesting.  I went back over your dumps of cache stats above
> and the 'loaded' servers didn't have any attribute there that
> differentiated it from others.  For example, the number of storefiles
> seemed about same.
> 
> I wonder what is making for the high load?  Can you figure it?  Is it
> high CPU use (unlikely).  Is it then high i/o?  Can you try and figure
> whats different about the layout under the loaded server and that of
> an unloaded server?  Maybe do a ./bin/hadoop fs -lsr /hbase and see if
> anything jumps out at you.

It's I/O wait that is killing the highly loaded server.  The CPU usage
reported by top is just about the same across all servers (around 100%
on an 8-core node), but one server at any given time has a much higher
load due to I/O.

> 
> If you want to post the above or a loaded servers log to pastbin we'll
> take a looksee.

I'm not really sure what to look for, but maybe someone else will notice
something, so here's the output of hadoop fs -lsr /hbase:
http://pastebin.com/m98096de

And here is today's region server log from hdfs02, which seems to get
hit particularly hard: http://pastebin.com/m1d8a1e5f

Please note that we restarted it several times today, so some of those
errors are probably just due to restarting the region server.

> 
> 
> Applying
> > HBASE-2180 did not make any measurable difference.  There are no errors
> > in the region server logs.  However, looking at the Hadoop datanode
> > logs, I'm seeing lots of these:
> >
> > 2010-02-16 17:07:54,064 ERROR 
> > org.apache.hadoop.hdfs.server.datanode.DataNode: 
> > DatanodeRegistration(10.24.183.165:50010, 
> > storageID=DS-1519453437-10.24.183.165-50010-1265907617548, infoPort=50075, 
> > ipcPort=50020):DataXceiver
> > java.io.EOFException
> >        at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >        at 
> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
> >        at java.lang.Thread.run(Thread.java:619)
> 
> You upped xceivers on your hdfs cluster?  If you look at otherend of
> the above EOFE, can you see why it died?

Max xceivers = 3072; datanode handler count = 20; region server handler count = 
100

I can't find the other end of the EOFException.  I looked in the Hadoop
and HBase logs on the server that is the name node and HBase master, as
well as the on HBase client.

Thanks for all the help!

-James

> 
> 
> >
> > However, I do think it's strange that
> > the load is so unbalanced on the region servers.
> >
> 
> I agree.
> 
> 
> > We're also going to try throwing some more hardware at the problem.
> > We'll set up a new cluster with 16-core, 16G nodes to see if they are
> > better able to handle the large number of client requests.  We might
> > also decrease the block size to 32k or lower.
> >
> Ok.
> 
> >> Should only be a matter if you intend distributing the above.
> >
> > This is probably a topic for a separate thread, but I've never seen a
> > legal definition for the word "distribution."  How does this apply to
> > the SaaS model?
> >
> Fair enough.
> 
> Something is up.  Especially if hbase-2180 made no difference.
> 
> St.Ack

Reply via email to