On Tue, 2010-02-16 at 16:45 -0600, Stack wrote: > On Tue, Feb 16, 2010 at 2:25 PM, James Baldassari <ja...@dataxu.com> wrote: > > On Tue, 2010-02-16 at 14:05 -0600, Stack wrote: > >> On Tue, Feb 16, 2010 at 10:50 AM, James Baldassari <ja...@dataxu.com> > >> wrote: > > > > Whether the keys themselves are evenly distributed is another matter. > > Our keys are user IDs, and they should be fairly random. If we do a > > status 'detailed' in the hbase shell we see the following distribution > > for the value of "requests" (not entirely sure what this value means): > > hdfs01: 7078 > > hdfs02: 5898 > > hdfs03: 5870 > > hdfs04: 3807 > > > That looks like they are evenly distributed. Requests are how many > hits a second. See the UI on master port 60010. The numbers should > match.
So the total across all 4 region servers would be 22,653/second? Hmm, that doesn't seem too bad. I guess we just need a little more throughput... > > > > There are no order of magnitude differences here, and the request count > > doesn't seem to map to the load on the server. Right now hdfs02 has a > > load of 16 while the 3 others have loads between 1 and 2. > > > This is interesting. I went back over your dumps of cache stats above > and the 'loaded' servers didn't have any attribute there that > differentiated it from others. For example, the number of storefiles > seemed about same. > > I wonder what is making for the high load? Can you figure it? Is it > high CPU use (unlikely). Is it then high i/o? Can you try and figure > whats different about the layout under the loaded server and that of > an unloaded server? Maybe do a ./bin/hadoop fs -lsr /hbase and see if > anything jumps out at you. It's I/O wait that is killing the highly loaded server. The CPU usage reported by top is just about the same across all servers (around 100% on an 8-core node), but one server at any given time has a much higher load due to I/O. > > If you want to post the above or a loaded servers log to pastbin we'll > take a looksee. I'm not really sure what to look for, but maybe someone else will notice something, so here's the output of hadoop fs -lsr /hbase: http://pastebin.com/m98096de And here is today's region server log from hdfs02, which seems to get hit particularly hard: http://pastebin.com/m1d8a1e5f Please note that we restarted it several times today, so some of those errors are probably just due to restarting the region server. > > > Applying > > HBASE-2180 did not make any measurable difference. There are no errors > > in the region server logs. However, looking at the Hadoop datanode > > logs, I'm seeing lots of these: > > > > 2010-02-16 17:07:54,064 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration(10.24.183.165:50010, > > storageID=DS-1519453437-10.24.183.165-50010-1265907617548, infoPort=50075, > > ipcPort=50020):DataXceiver > > java.io.EOFException > > at java.io.DataInputStream.readShort(DataInputStream.java:298) > > at > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > > at java.lang.Thread.run(Thread.java:619) > > You upped xceivers on your hdfs cluster? If you look at otherend of > the above EOFE, can you see why it died? Max xceivers = 3072; datanode handler count = 20; region server handler count = 100 I can't find the other end of the EOFException. I looked in the Hadoop and HBase logs on the server that is the name node and HBase master, as well as the on HBase client. Thanks for all the help! -James > > > > > > However, I do think it's strange that > > the load is so unbalanced on the region servers. > > > > I agree. > > > > We're also going to try throwing some more hardware at the problem. > > We'll set up a new cluster with 16-core, 16G nodes to see if they are > > better able to handle the large number of client requests. We might > > also decrease the block size to 32k or lower. > > > Ok. > > >> Should only be a matter if you intend distributing the above. > > > > This is probably a topic for a separate thread, but I've never seen a > > legal definition for the word "distribution." How does this apply to > > the SaaS model? > > > Fair enough. > > Something is up. Especially if hbase-2180 made no difference. > > St.Ack