Some performance questions about Indexing and Table schemas

Chris Bates Wed, 03 Feb 2010 10:41:18 -0800

Hello all,

I've read around about doing table indexing and it seems like there are a
couple approaches which I'd like clarification.


What I have been doing is near full table scans because I'm doing a lot of
aggregation and statistics for our analytics project.  We have nearly 7.5GB
of data per a day to load into HBase.  My schema has been Row: Timestamp,
ColFam1: col1...  , ColFam2: col1 ....  It takes somewhere close to 5 hours
to load in all the data I need from HDFS MapReduce.  We currently are only
running HBase on 3 machines, about 1.5 - 2gb RAM each.  We're going to scale
out in the next month to 2 8core machines with 30gb of RAM each.

With this in mind, I'm now focusing on performance.  I'm working on getting
LZO compression enabled on all the machines, but I was more curious as to
the best way to index.

It seems like there are two strategies: Use the tableindexed package or roll
my own where I'd create a new table with the rowID's as the values from the
primary table column lookup. Then when I do a scan on the main table, I
would grab one value that satisfies my filters, and use that value to scan
over the index table to grab all the rows that satisfy it.

Does anyone know about the performance of these two approaches, or if there
are others?  How does it affect loading?  I'd like to load in my 7.5gb of
data per day in a matter of minutes not hours, and then be able to query
columns in seconds not tens of minutes.

Some performance questions about Indexing and Table schemas

Reply via email to