On Mon, Feb 15, 2010 at 10:05 PM, James Baldassari <ja...@dataxu.com> wrote: > Applying HBASE-2180 isn't really an option at this > time because we've been told to stick with the Cloudera distro.
I'm sure the wouldn't mind (smile). Seems to about double throughput. > If I had to guess, I would say the performance issues start to happen > around the time the region servers hit max heap size, which occurs > within minutes of exposing the app to live traffic. Could GC be killing > us? We use the concurrent collector as suggested. I saw on the > performance page some mention of limiting the size of the new generation > like -XX:NewSize=6m -XX:MaxNewSize=6m. Is that worth trying? Enable GC logging for a while? See hbase-env.sh. Uncomment this line: # export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log" You are using recent JVM? 1.6.0_10 or greater? 1.6.0_18 might have issues. Whats CPU and iowait or wa in top look like on these machines, particularly the loaded machine? How many disks in the machines? St>Ack > > Here are the new region server stats along with load averages: > > Region Server 1: > request=0.0, regions=16, stores=16, storefiles=33, storefileIndexSize=4, > memstoreSize=1, compactionQueueSize=0, usedHeap=2891, maxHeap=4079, > blockCacheSize=1403878072, blockCacheFree=307135816, blockCacheCount=21107, > blockCacheHitRatio=84, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 > Load Averages: 10.34, 10.58, 7.08 > > Region Server 2: > request=0.0, regions=15, stores=16, storefiles=26, storefileIndexSize=3, > memstoreSize=1, compactionQueueSize=0, usedHeap=3257, maxHeap=4079, > blockCacheSize=661765368, blockCacheFree=193741576, blockCacheCount=9942, > blockCacheHitRatio=77, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 > Load Averages: 1.90, 1.23, 0.98 > > Region Server 3: > request=0.0, regions=16, stores=16, storefiles=41, storefileIndexSize=4, > memstoreSize=4, compactionQueueSize=0, usedHeap=1627, maxHeap=4079, > blockCacheSize=665117184, blockCacheFree=190389760, blockCacheCount=9995, > blockCacheHitRatio=70, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 > Load Averages: 2.01, 3.56, 4.18 > > That first region server is getting hit much harder than the others. > They're identical machines (8-core), and the distribution of keys should > be fairly random, so I'm not sure why that would happen. Any other > ideas or suggestions would be greatly appreciated. > > Thanks, > James > > > On Mon, 2010-02-15 at 21:51 -0600, Stack wrote: >> Yeah, I was going to say that if your loading is mostly read, you can >> probably go up from the 0.2 given over to cache. I like Dan's >> suggestion of trying it first on one server, if you can. >> >> St.Ack >> >> On Mon, Feb 15, 2010 at 5:22 PM, Dan Washusen <d...@reactive.org> wrote: >> > So roughly 72% of reads use the blocks held in the block cache... >> > >> > It would be interesting to see the difference between when it was working >> > OK >> > and now. Could you try increasing the memory allocated to one of the >> > regions and also increasing the "hfile.block.cache.size" to say '0.4' on >> > the >> > same region? >> > >> > On 16 February 2010 11:54, James Baldassari <ja...@dataxu.com> wrote: >> > >> >> Hi Dan. Thanks for your suggestions. I am doing writes at the same >> >> time as reads, but there are usually many more reads than writes. Here >> >> are the stats for all three region servers: >> >> >> >> Region Server 1: >> >> request=0.0, regions=15, stores=16, storefiles=34, storefileIndexSize=3, >> >> memstoreSize=308, compactionQueueSize=0, usedHeap=3096, maxHeap=4079, >> >> blockCacheSize=705474544, blockCacheFree=150032400, blockCacheCount=10606, >> >> blockCacheHitRatio=76, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 >> >> >> >> Region Server 2: >> >> request=0.0, regions=16, stores=16, storefiles=39, storefileIndexSize=4, >> >> memstoreSize=225, compactionQueueSize=0, usedHeap=3380, maxHeap=4079, >> >> blockCacheSize=643172800, blockCacheFree=212334144, blockCacheCount=9660, >> >> blockCacheHitRatio=69, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 >> >> >> >> Region Server 3: >> >> request=0.0, regions=13, stores=13, storefiles=31, storefileIndexSize=4, >> >> memstoreSize=177, compactionQueueSize=0, usedHeap=1905, maxHeap=4079, >> >> blockCacheSize=682848608, blockCacheFree=172658336, blockCacheCount=10262, >> >> blockCacheHitRatio=72, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 >> >> >> >> The average blockCacheHitRatio is about 72. Is this too low? Anything >> >> else I can check? >> >> >> >> -James >> >> >> >> >> >> On Mon, 2010-02-15 at 18:16 -0600, Dan Washusen wrote: >> >> > Maybe the block cache is thrashing? >> >> > >> >> > If you are regularly writing data to your tables then it's possible that >> >> the >> >> > block cache is no longer being effective. On the region server web UI >> >> check >> >> > the blockCacheHitRatio value. You want this value to be high (0 - 100). >> >> If >> >> > this value is low it means that HBase has to go to disk to fetch blocks >> >> of >> >> > data. You can control the amount of VM memory that HBase allocates to >> >> the >> >> > block cache using the "hfile.block.cache.size" property (default is 0.2 >> >> > (20%)). >> >> > >> >> > Cheers, >> >> > Dan >> >> > >> >> > On 16 February 2010 10:45, James Baldassari <ja...@dataxu.com> wrote: >> >> > >> >> > > Hi, >> >> > > >> >> > > Does anyone have any tips to share regarding optimization for random >> >> > > read performance? For writes I've found that setting a large write >> >> > > buffer and setting auto-flush to false on the client side >> >> > > significantly >> >> > > improved put performance. Are there any similar easy tweaks to >> >> > > improve >> >> > > random read performance? >> >> > > >> >> > > I'm using HBase 0.20.3 in a very read-heavy real-time system with 1 >> >> > > master and 3 region servers. It was working ok for a while, but today >> >> > > there was a severe degradation in read performance. Restarting Hadoop >> >> > > and HBase didn't help, are there are no errors in the logs. Read >> >> > > performance starts off around 1,000-2,000 gets/second but quickly >> >> > > (within minutes) drops to around 100 gets/second. >> >> > > >> >> > > I've already looked at the performance tuning wiki page. On the >> >> > > server >> >> > > side I've increased hbase.regionserver.handler.count from 10 to 100, >> >> but >> >> > > it didn't help. Maybe this is expected because I'm only using a >> >> > > single >> >> > > client to do reads. I'm working on implementing a client pool now, >> >> > > but >> >> > > I'm wondering if there are any other settings on the server or client >> >> > > side that might improve things. >> >> > > >> >> > > Thanks, >> >> > > James >> >> > > >> >> > > >> >> > > >> >> >> >> >> > > >