Hello all, I've read around about doing table indexing and it seems like there are a couple approaches which I'd like clarification.
What I have been doing is near full table scans because I'm doing a lot of aggregation and statistics for our analytics project. We have nearly 7.5GB of data per a day to load into HBase. My schema has been Row: Timestamp, ColFam1: col1... , ColFam2: col1 .... It takes somewhere close to 5 hours to load in all the data I need from HDFS MapReduce. We currently are only running HBase on 3 machines, about 1.5 - 2gb RAM each. We're going to scale out in the next month to 2 8core machines with 30gb of RAM each. With this in mind, I'm now focusing on performance. I'm working on getting LZO compression enabled on all the machines, but I was more curious as to the best way to index. It seems like there are two strategies: Use the tableindexed package or roll my own where I'd create a new table with the rowID's as the values from the primary table column lookup. Then when I do a scan on the main table, I would grab one value that satisfies my filters, and use that value to scan over the index table to grab all the rows that satisfy it. Does anyone know about the performance of these two approaches, or if there are others? How does it affect loading? I'd like to load in my 7.5gb of data per day in a matter of minutes not hours, and then be able to query columns in seconds not tens of minutes.