Re : Re: Table design question

jthievre Wed, 18 Feb 2009 10:29:40 -0800

> On Wed, Feb 18, 2009 at 2:24 AM, Jérôme Thièvre INA 
> <[email protected]> wrote:
> 
> > Hi,
> >
> > I setup a cluster of 4 machines running hbase.
> > I'm working on a web archiving application that needs to access 
> (randomly)> records with request of type :
> >
> > Record record = getClosestRecord(url, requestedDate);
> > This method should find the record for the specified url at the 
> *nearest> *date
> > from the requestedDate. The requested dates have very little 
> chance to
> > match
> > insertion date.
> 
> 
> (wayback machine?)
>


Kind of wayback machine but based on a proxy, we don't rewrite url.


> 
> Currently we can only return records at an explicit date or older, not
> newer.
> 
> 
> Each record is made of 10 columns, and each insert is of the type;
> >
> > insertRecord(url, date, record);
> >
> > There are several possible designs for my record table :
> >
> > 1. RowKey= url and all columns are labelled with the same date.
> 
> 2. RowKey=url and we use timestamp and version support of hbase, 
> and columns
> > names are columnFamily names (no label).
> >
> 3. RowKey=url+date, and columns names are columnFamily names (no 
> label).>
> 
> Examples please (I've only had one cup of coffee so far this morning).
> 
> 


 Supposed colum families are : {'content:', 'type:'} 
I want to insert a new record with url www.google.com at date 20090218 :

Case 1: 
BactUpdate update = new BacthUpdate(www.google.com);
update.put('content:20090218', 
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:20090218', 'text/html');
table.commit(update);

Case 2: Implies use hbase versioning 
BactUpdate update = new BacthUpdate(www.google.com, toTimestamp(20090218 ));
update.put('content:', 
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);

Case3:
BactUpdate update = new BacthUpdate(www.google....@20090218);
update.put('content:', 
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);

> 
> >
> > For now, I use method 1 that implies to answer correctly to
> > getClosestRecord
> > to load an entire columnFamily for a specified row,
> > to find the closest date among the columnFamily, and to load  the 
> others> columns labelled with this closest date.
> > I choose this method because I thought I could use the method
> > HTable.getClosestRowBefore(url, columFamily:requestedDate) to 
> minimize> column loads, but in fact I need the closest row before 
> and the closest row
> > after to determine which one is at the closest date, so I don't 
> use the
> > method getClosestRowBefore.
> >
> > The solution 2. seems to be a good alternative, I could have the 
> same> fonctionnality with the same process, but date would be 
> stored once per row
> > insert (as timestamp) instead of once per column.
> 
> 
> 
> This seems like a better hbase fit.
> 
> 
> 
> >
> >
> > Solution 3. implies only one insert per row key, but increases 
> dramatically> the number of rows.
> >
> 
> Yeah, but you can scan them quickly.  Good for finding date ranges 
> (until we
> enrichen the API and allow get/scan between date ranges).  You'll 
> probablyhave to do as hbase does internally, do a little trick so 
> the newest insert
> shows first -- rather than last.
> 

We can thousands of differents version date for some url.

Is it possible (or will it be) to load column names without load cell content ? 
Same questions for the timestamp ?


> St.Ack
> 
> >
> > What is the best solution to ensure best random acces time ?
> >
> > Jérôme Thièvre
> >
>

Re : Re: Table design question

Reply via email to