Unfortunately, yes the sentences need to be sorted. I take advantage of the
lexicographical ordering of the sentences for some other purpose. Even if I
didn't, how could I generate the prefixes? Do you mean number prefixes
should be in the range [1-n] where n is the number of rows in the table?
Since I use Hadoop to pull the data in, I can't see a trivial way to
generate number prefixes but I may be missing something obvious.

Jim

On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell <[email protected]> wrote:

> Do the sentences need to be sorted?
> if not you could use an number prefix on the row key. Keep track of
> the highest prefix and use that range to select a prefix randomly.
> Then start a scanner at that prefix
>
> ~Tim.
>
> 2009/1/10 Jim Twensky <[email protected]>:
> > Hello,
> >
> > I have an HBase table that contains sentences as row keys and a few
> numeric
> > values as columns. A simple abstract model of the table looks like the
> > following:
> >
> >
> --------------------------------------------------------------------------------------------------------------------------
> > Sentence     |          frequency:value     |      probability:value-0
> > |     probability:value-2
> >
> --------------------------------------------------------------------------------------------------------------------------
> > Hello World |                 5                    |      0.000545321
> > |     0.002368204
> >     .                              .
> > .                             .
> >     .                              .
> > .                             .
> >     .                              .
> > .                             .
> >
> --------------------------------------------------------------------------------------------------------------------------
> >
> >
> > I create the table and load it using Hadoop and there are hundreds of
> > billions of entries in it. I use this table to solve an optimization
> problem
> > using a hill climbing/simulated annealing method. Basically, I need to
> > change the likelihood values randomly. For example, I need to change say
> the
> > first 5 rows starting at the 112th row and do some calculations and so
> on...
> >
> > Now the problem is, I can't see an easy way to access to the n'th row
> > directly. If I was using a traditional RDBMS, I'd add another column and
> > auto-increment it each time I added a new row but this is not possible
> since
> > I load the table using Hadoop and the there are parallel insertions
> taking
> > place simultaneously. A quick and dirty way to do this might be adding a
> new
> > index column after I load and initialize the table but the table is huge
> and
> > it doesn't seem right to me. Another bad approach would be to use a
> scanner
> > starting from the first row and calling Scanner.next() n times inside a
> for
> > loop to access the n'th row, which also seems very slow. Any ideas on how
> I
> > could do it more efficiently?
> >
> > Thanks in advance,
> > Jim
> >
>

Reply via email to