Hi,

I setup a cluster of 4 machines running hbase.
I'm working on a web archiving application that needs to access (randomly) 
records with request of type :

Record record = getClosestRecord(url, requestedDate);
This method should find the record for the specified url at the nearest date 
from the requestedDate. The requested dates have very little chance to match 
insertion date.

Each record is made of 10 columns, and each insert is of the type;

insertRecord(url, date, record);

There are several possible designs for my record table :

1. RowKey= url and all columns are labelled with the same date.
2. RowKey=url and we use timestamp and version support of hbase, and columns 
names are columnFamily names (no label). .
3. RowKey=url+date, and columns names are columnFamily names (no label).

For now, I use method 1 that implies to answer correctly to getClosestRecord to 
load an entire columnFamily for a specified row,
to find the closest date among the columnFamily, and to load  the others 
columns labelled with this closest date.
I choose this method because I thought I could use the method 
HTable.getClosestRowBefore(url, columFamily:requestedDate) to minimize column 
loads, but in fact I need the closest row before and the closest row after to 
determine which one is at the closest date, so I don't use the method 
getClosestRowBefore.

The solution 2. seems to be a good alternative, I could have the same 
fonctionnality with the same process, but date would be stored once per row 
insert (as timestamp) instead of once per column.

Solution 3. implies only one insert per row key, but increases dramatically the 
number of rows.

What is the best solution to ensure best random acces time ?

Jérôme Thièvre

Reply via email to