On 18/05/10 17:31, Stack wrote:
On Tue, May 18, 2010 at 9:04 AM, Renaud Delbru<renaud.del...@deri.org>  wrote:
We have 90M of rows, each rows varies from a few hundreds of kilobytes to
8MB
Index keeps the 'key' that starts each block in an hfile and its
offset where the 'key' is a combination of row+column+timestamp (not
the value).  Your 'keys' are large?
Our row keys are just plain web document urls. Column name are a few characters. So, I will say fairly small.
I have also changed at the same time another parameter, the
hbase.hregion.max.filesize. It was set to 1GB (from previous test), and I
switch it back to the default value (256MB).
So, in the previous tests, there was a few number of region files (like
250), but a very large index file size (>500).

In my last test (hregion.max.filesize=256, block size=128K), the number of
region files increased (I have now more than 1000 region file), but the
index file size is now less than 200.

Do you think the hregion.max.filesize could had impact on the index file
size ?

Hmm.  You have same amount of "data" just more files because you
lowered max filesize (by a factor of 4 so 4x the number of files) so
I'd expect that index would be of the same size.
Ok, so it is jsut the modification of block size which reduces the index file size.
If inclined to do more digging, you can use the hfile tool:

./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile

Do the above and you'll get usage.  Print out the metadata on hfiles.
Might help you figure whats going on.
I'll have a look at this.
So, storefileIndexSize indicates the number of MB of heap used by the index.
And, in our case, 500 was too excessive given the fact that our region
server is limited to 1GB of heap
If 1GB only, then yeah, big indices will cause a prob.  How many
regions per regionserver?  Sounds like you have a few?  If so, can you
add more servers?  Or up the RAM in your machines?
Yes, we have four nodes, each node has currently 280 region files (approximatively). We are not able to increase the number of nodes or the RAM for the moment. So, our solution was to tune hbase for our setup. But, finally, hbase seems to handle it well. Using the new configuration settings, I was able to import our 90M rows in less than 11 hours (using a map-reduce job on the same cluster), while keeping the used heap of the region servers relatively small (300 to 500MB). Now, the region servers looks stable, with a relatively small heap used, even if I use the hbase table as a map reduce input format.
So, it seems that the memory problem was related to the hfile block size.
--
Renaud Delbru

Reply via email to