Sammy, > Is HBase deserializing the entire row when it reads the > data from disk
No. > so limiting the column doesn't have any effect. HBase is a column oriented store -- values are grouped independently at the store level by column family. It appears you are using only one column family, "cf1:". You need to distinguish between column families and columns (family:qualifier). When processing rows, all values in the column family (all family:qualifier in the column) will be read in from disk. To get the effect you desire, you need a schema more like: table:row1: { columfamily:cf1, column:value0001-0100: <cell value>, columfamily:cf2, column:value0101-0200: <cell value>, columfamily:cf3, column:value0201-0300: <cell value>, .... } Then you can specify in gets and scans what column: or column:qualifier to include in the result, and I/O will be performed only on the column families selected by your get or scan. Hope that helps, - Andy > From: Sammy Yu > Subject: Performance of reading rows with a large number of columns > To: hbase-user@hadoop.apache.org > Date: Saturday, April 3, 2010, 12:41 AM > Hi, > We've been doing some performance > comparison between different sets of > schema on HBase-0.20.3. I have a schema defined as > such > > table:row1: { > columfamily:cf1, column:value0001-0100: <cell value>, > columfamily:cf1, column:value0101-0200: <cell value>, > columfamily:cf1, column:value0201-0300: <cell value>, > .... > } > > Using the thrift protocol, we are using scannerOpen and > limiting it by specifying just a single column such as > cf1:value0101-0200. This works > really well when row1 just has a single column (0.040 > seconds). However > when a row contains 5,000 columns, the query time jumps up > to 1.8 seconds. > Is HBase deserializing the entire row when it reads the > data from disk so > limiting the column doesn't have any effect. Also, is > the solution is then > to move the column so that it becomes part of the > key? I think this > solution will work, however it doesn't feel right as there > could be cases > where I want value0101-0200 and value0101-0200 to come back > in one row. > > Thanks, > Sammy >