On Tue, May 18, 2010 at 9:04 AM, Renaud Delbru <renaud.del...@deri.org> wrote: >> How come your index is so big? Do you have big keys? Lots of data? >> Lots of storefiles? >> > > We have 90M of rows, each rows varies from a few hundreds of kilobytes to > 8MB. >
Index keeps the 'key' that starts each block in an hfile and its offset where the 'key' is a combination of row+column+timestamp (not the value). Your 'keys' are large? > I have also changed at the same time another parameter, the > hbase.hregion.max.filesize. It was set to 1GB (from previous test), and I > switch it back to the default value (256MB). > So, in the previous tests, there was a few number of region files (like > 250), but a very large index file size (>500). > > In my last test (hregion.max.filesize=256, block size=128K), the number of > region files increased (I have now more than 1000 region file), but the > index file size is now less than 200. > > Do you think the hregion.max.filesize could had impact on the index file > size ? > Hmm. You have same amount of "data" just more files because you lowered max filesize (by a factor of 4 so 4x the number of files) so I'd expect that index would be of the same size. If inclined to do more digging, you can use the hfile tool: ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile Do the above and you'll get usage. Print out the metadata on hfiles. Might help you figure whats going on. >> Looking in HRegionServer I see that its calculated so: >> >> storefileIndexSizeMB = (int)(store.getStorefilesIndexSize()/1024/1024); >> > > So, storefileIndexSize indicates the number of MB of heap used by the index. > And, in our case, 500 was too excessive given the fact that our region > server is limited to 1GB of heap. > If 1GB only, then yeah, big indices will cause a prob. How many regions per regionserver? Sounds like you have a few? If so, can you add more servers? Or up the RAM in your machines? Yours, St.Ack