> On Wed, Feb 18, 2009 at 2:24 AM, Jérôme Thièvre INA
> <[email protected]> wrote:
>
> > Hi,
> >
> > I setup a cluster of 4 machines running hbase.
> > I'm working on a web archiving application that needs to access
> (randomly)> records with request of type :
> >
> > Record record = getClosestRecord(url, requestedDate);
> > This method should find the record for the specified url at the
> *nearest> *date
> > from the requestedDate. The requested dates have very little
> chance to
> > match
> > insertion date.
>
>
> (wayback machine?)
>
Kind of wayback machine but based on a proxy, we don't rewrite url.
>
> Currently we can only return records at an explicit date or older, not
> newer.
>
>
> Each record is made of 10 columns, and each insert is of the type;
> >
> > insertRecord(url, date, record);
> >
> > There are several possible designs for my record table :
> >
> > 1. RowKey= url and all columns are labelled with the same date.
>
> 2. RowKey=url and we use timestamp and version support of hbase,
> and columns
> > names are columnFamily names (no label).
> >
> 3. RowKey=url+date, and columns names are columnFamily names (no
> label).>
>
> Examples please (I've only had one cup of coffee so far this morning).
>
>
Supposed colum families are : {'content:', 'type:'}
I want to insert a new record with url www.google.com at date 20090218 :
Case 1:
BactUpdate update = new BacthUpdate(www.google.com);
update.put('content:20090218',
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:20090218', 'text/html');
table.commit(update);
Case 2: Implies use hbase versioning
BactUpdate update = new BacthUpdate(www.google.com, toTimestamp(20090218 ));
update.put('content:',
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);
Case3:
BactUpdate update = new BacthUpdate(www.google....@20090218);
update.put('content:',
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);
>
> >
> > For now, I use method 1 that implies to answer correctly to
> > getClosestRecord
> > to load an entire columnFamily for a specified row,
> > to find the closest date among the columnFamily, and to load the
> others> columns labelled with this closest date.
> > I choose this method because I thought I could use the method
> > HTable.getClosestRowBefore(url, columFamily:requestedDate) to
> minimize> column loads, but in fact I need the closest row before
> and the closest row
> > after to determine which one is at the closest date, so I don't
> use the
> > method getClosestRowBefore.
> >
> > The solution 2. seems to be a good alternative, I could have the
> same> fonctionnality with the same process, but date would be
> stored once per row
> > insert (as timestamp) instead of once per column.
>
>
>
> This seems like a better hbase fit.
>
>
>
> >
> >
> > Solution 3. implies only one insert per row key, but increases
> dramatically> the number of rows.
> >
>
> Yeah, but you can scan them quickly. Good for finding date ranges
> (until we
> enrichen the API and allow get/scan between date ranges). You'll
> probablyhave to do as hbase does internally, do a little trick so
> the newest insert
> shows first -- rather than last.
>
We can thousands of differents version date for some url.
Is it possible (or will it be) to load column names without load cell content ?
Same questions for the timestamp ?
> St.Ack
>
> >
> > What is the best solution to ensure best random acces time ?
> >
> > Jérôme Thièvre
> >
>