Re: Performance of reading rows with a large number of columns

Andrew Purtell Sat, 03 Apr 2010 10:46:18 -0700

Sammy,

> Is HBase deserializing the entire row when it reads the
> data from disk


No.

> so limiting the column doesn't have any effect. 

HBase is a column oriented store -- values are grouped independently at the 
store level by column family. 

It appears you are using only one column family, "cf1:". You need to 
distinguish between column families and columns (family:qualifier). When 
processing rows, all values in the column family (all family:qualifier in the 
column) will be read in from disk.

To get the effect you desire, you need a schema more like:

  table:row1: {
   columfamily:cf1, column:value0001-0100: <cell value>,
   columfamily:cf2, column:value0101-0200: <cell value>,
   columfamily:cf3, column:value0201-0300: <cell value>,
   ....
  }

Then you can specify in gets and scans what column: or column:qualifier to 
include in the result, and I/O will be performed only on the column families 
selected by your get or scan. 

Hope that helps,

   - Andy


> From: Sammy Yu
> Subject: Performance of reading rows with a large number of columns
> To: hbase-user@hadoop.apache.org
> Date: Saturday, April 3, 2010, 12:41 AM
> Hi,
>    We've been doing some performance
> comparison between different sets of
> schema on HBase-0.20.3.  I have a schema defined as
> such
> 
> table:row1: {
>    columfamily:cf1, column:value0001-0100: <cell value>,
>    columfamily:cf1, column:value0101-0200: <cell value>,
>    columfamily:cf1, column:value0201-0300: <cell value>,
>    ....
> }
> 
> Using the thrift protocol, we are using scannerOpen and
> limiting it by specifying just a single column such as
> cf1:value0101-0200.  This works
> really well when row1 just has a single column (0.040
> seconds).  However
> when a row contains 5,000 columns, the query time jumps up
> to 1.8 seconds.
> Is HBase deserializing the entire row when it reads the
> data from disk so
> limiting the column doesn't have any effect.  Also, is
> the solution is then
> to move the column so that it becomes part of the
> key?  I think this
> solution will work, however it doesn't feel right as there
> could be cases
> where I want value0101-0200 and value0101-0200 to come back
> in one row.
> 
> Thanks,
> Sammy
>

Re: Performance of reading rows with a large number of columns

Reply via email to