the hbase master pages displays aggregate cluster operation rate. One thing about splits... they are data based, so if your hot data has a lot of locality it can drive load unevenly. On a massive data set this is usually mitigated by the size, but tests rarely demonstrate said data set sizes.
On Feb 16, 2010 2:26 PM, "James Baldassari" <ja...@dataxu.com> wrote: On Tue, 2010-02-16 at 14:05 -0600, Stack wrote: > On Tue, Feb 16, 2010 at 10:50 AM, James Baldassari... Yes, over time it would balance out, but we were trying to solve the immediate problem of uneven load across the region servers, so we wanted to make sure that all the data was evenly distributed to eliminate that as a potential cause. We did do the export and import, and then the data was evenly distributed, so that's not the issue. Whether the keys themselves are evenly distributed is another matter. Our keys are user IDs, and they should be fairly random. If we do a status 'detailed' in the hbase shell we see the following distribution for the value of "requests" (not entirely sure what this value means): hdfs01: 7078 hdfs02: 5898 hdfs03: 5870 hdfs04: 3807 There are no order of magnitude differences here, and the request count doesn't seem to map to the load on the server. Right now hdfs02 has a load of 16 while the 3 others have loads between 1 and 2. Applying HBASE-2180 did not make any measurable difference. There are no errors in the region server logs. However, looking at the Hadoop datanode logs, I'm seeing lots of these: 2010-02-16 17:07:54,064 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.24.183.165:50010, storageID=DS-1519453437-10.24.183.165-50010-1265907617548, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) at java.lang.Thread.run(Thread.java:619) One thing I should mention is that I'm not sure exactly how many requests/second HBase is servicing now that I've switched to using HBase client pools. HBase may be servicing gets at a relatively high rate, but it's just not keeping up with the incoming requests from our clients. I think there may be some additional client-side optimizations we can make, so I might take some time to look at our client code to see if there are any problems there. However, I do think it's strange that the load is so unbalanced on the region servers. We're also going to try throwing some more hardware at the problem. We'll set up a new cluster with 16-core, 16G nodes to see if they are better able to handle the large number of client requests. We might also decrease the block size to 32k or lower. > > > We also increased the max heap size on the region server from 4G to 5G > > and decreased the... This is probably a topic for a separate thread, but I've never seen a legal definition for the word "distribution." How does this apply to the SaaS model? > > > > > We do have Ganglia with monitoring and alerts. We're not swapping right > > now, althou... I wish I knew the answer to this. As I mentioned above, our keys are user IDs. They come in at random times, not in chunks of similar IDs. -James > > Thanks James, > St.Ack > > > > -James > > > > > > On Tue, 2010-02-16 at 12:17 -0600, Stack ...