Performance of reading rows with a large number of columns

Sammy Yu Sat, 03 Apr 2010 00:41:48 -0700

Hi,
   We've been doing some performance comparison between different sets of
schema on HBase-0.20.3.  I have a schema defined as such


table:row1: {
   columfamily:cf1, column:value0001-0100: <cell value>,
   columfamily:cf1, column:value0101-0200: <cell value>,
   columfamily:cf1, column:value0201-0300: <cell value>,
   ....
}

Using the thrift protocol, we are using scannerOpen and limiting it by
specifying just a single column such as cf1:value0101-0200.  This works
really well when row1 just has a single column (0.040 seconds).  However
when a row contains 5,000 columns, the query time jumps up to 1.8 seconds.
Is HBase deserializing the entire row when it reads the data from disk so
limiting the column doesn't have any effect.  Also, is the solution is then
to move the column so that it becomes part of the key?  I think this
solution will work, however it doesn't feel right as there could be cases
where I want value0101-0200 and value0101-0200 to come back in one row.

Thanks,
Sammy

Performance of reading rows with a large number of columns

Reply via email to