Nope. We don't do any map reduce. We're only using Hadoop for HBase at the moment.
That one node, hdfs02, still has a load of 16 with around 40% I/O and 120% CPU. The other nodes are all around 66% CPU with 0-1% I/O and load of 1 to 3. I don't think all the requests are going to hdfs02 based on the status 'detailed' output. It seems like that node is just having a much harder time getting the data or something. Maybe we have some incorrect HDFS setting. All the configs are identical, though. -James On Tue, 2010-02-16 at 17:45 -0600, Dan Washusen wrote: > You mentioned in a previous email that you have a Task Tracker process > running on each of the nodes. Is there any chance there is a map reduce job > running? > > On 17 February 2010 10:31, James Baldassari <ja...@dataxu.com> wrote: > > > On Tue, 2010-02-16 at 16:45 -0600, Stack wrote: > > > On Tue, Feb 16, 2010 at 2:25 PM, James Baldassari <ja...@dataxu.com> > > wrote: > > > > On Tue, 2010-02-16 at 14:05 -0600, Stack wrote: > > > >> On Tue, Feb 16, 2010 at 10:50 AM, James Baldassari <ja...@dataxu.com> > > wrote: > > > > > > > > Whether the keys themselves are evenly distributed is another matter. > > > > Our keys are user IDs, and they should be fairly random. If we do a > > > > status 'detailed' in the hbase shell we see the following distribution > > > > for the value of "requests" (not entirely sure what this value means): > > > > hdfs01: 7078 > > > > hdfs02: 5898 > > > > hdfs03: 5870 > > > > hdfs04: 3807 > > > > > > > That looks like they are evenly distributed. Requests are how many > > > hits a second. See the UI on master port 60010. The numbers should > > > match. > > > > So the total across all 4 region servers would be 22,653/second? Hmm, > > that doesn't seem too bad. I guess we just need a little more > > throughput... > > > > > > > > > > > > There are no order of magnitude differences here, and the request count > > > > doesn't seem to map to the load on the server. Right now hdfs02 has a > > > > load of 16 while the 3 others have loads between 1 and 2. > > > > > > > > > This is interesting. I went back over your dumps of cache stats above > > > and the 'loaded' servers didn't have any attribute there that > > > differentiated it from others. For example, the number of storefiles > > > seemed about same. > > > > > > I wonder what is making for the high load? Can you figure it? Is it > > > high CPU use (unlikely). Is it then high i/o? Can you try and figure > > > whats different about the layout under the loaded server and that of > > > an unloaded server? Maybe do a ./bin/hadoop fs -lsr /hbase and see if > > > anything jumps out at you. > > > > It's I/O wait that is killing the highly loaded server. The CPU usage > > reported by top is just about the same across all servers (around 100% > > on an 8-core node), but one server at any given time has a much higher > > load due to I/O. > > > > > > > > If you want to post the above or a loaded servers log to pastbin we'll > > > take a looksee. > > > > I'm not really sure what to look for, but maybe someone else will notice > > something, so here's the output of hadoop fs -lsr /hbase: > > http://pastebin.com/m98096de > > > > And here is today's region server log from hdfs02, which seems to get > > hit particularly hard: http://pastebin.com/m1d8a1e5f > > > > Please note that we restarted it several times today, so some of those > > errors are probably just due to restarting the region server. > > > > > > > > > > > Applying > > > > HBASE-2180 did not make any measurable difference. There are no errors > > > > in the region server logs. However, looking at the Hadoop datanode > > > > logs, I'm seeing lots of these: > > > > > > > > 2010-02-16 17:07:54,064 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > > 10.24.183.165:50010, > > storageID=DS-1519453437-10.24.183.165-50010-1265907617548, infoPort=50075, > > ipcPort=50020):DataXceiver > > > > java.io.EOFException > > > > at java.io.DataInputStream.readShort(DataInputStream.java:298) > > > > at > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > > > > at java.lang.Thread.run(Thread.java:619) > > > > > > You upped xceivers on your hdfs cluster? If you look at otherend of > > > the above EOFE, can you see why it died? > > > > Max xceivers = 3072; datanode handler count = 20; region server handler > > count = 100 > > > > I can't find the other end of the EOFException. I looked in the Hadoop > > and HBase logs on the server that is the name node and HBase master, as > > well as the on HBase client. > > > > Thanks for all the help! > > > > -James > > > > > > > > > > > > > > > > However, I do think it's strange that > > > > the load is so unbalanced on the region servers. > > > > > > > > > > I agree. > > > > > > > > > > We're also going to try throwing some more hardware at the problem. > > > > We'll set up a new cluster with 16-core, 16G nodes to see if they are > > > > better able to handle the large number of client requests. We might > > > > also decrease the block size to 32k or lower. > > > > > > > Ok. > > > > > > >> Should only be a matter if you intend distributing the above. > > > > > > > > This is probably a topic for a separate thread, but I've never seen a > > > > legal definition for the word "distribution." How does this apply to > > > > the SaaS model? > > > > > > > Fair enough. > > > > > > Something is up. Especially if hbase-2180 made no difference. > > > > > > St.Ack > > > >