Do the sentences need to be sorted?
if not you could use an number prefix on the row key. Keep track of
the highest prefix and use that range to select a prefix randomly.
Then start a scanner at that prefix

~Tim.

2009/1/10 Jim Twensky <[email protected]>:
> Hello,
>
> I have an HBase table that contains sentences as row keys and a few numeric
> values as columns. A simple abstract model of the table looks like the
> following:
>
> --------------------------------------------------------------------------------------------------------------------------
> Sentence     |          frequency:value     |      probability:value-0
> |     probability:value-2
> --------------------------------------------------------------------------------------------------------------------------
> Hello World |                 5                    |      0.000545321
> |     0.002368204
>     .                              .
> .                             .
>     .                              .
> .                             .
>     .                              .
> .                             .
> --------------------------------------------------------------------------------------------------------------------------
>
>
> I create the table and load it using Hadoop and there are hundreds of
> billions of entries in it. I use this table to solve an optimization problem
> using a hill climbing/simulated annealing method. Basically, I need to
> change the likelihood values randomly. For example, I need to change say the
> first 5 rows starting at the 112th row and do some calculations and so on...
>
> Now the problem is, I can't see an easy way to access to the n'th row
> directly. If I was using a traditional RDBMS, I'd add another column and
> auto-increment it each time I added a new row but this is not possible since
> I load the table using Hadoop and the there are parallel insertions taking
> place simultaneously. A quick and dirty way to do this might be adding a new
> index column after I load and initialize the table but the table is huge and
> it doesn't seem right to me. Another bad approach would be to use a scanner
> starting from the first row and calling Scanner.next() n times inside a for
> loop to access the n'th row, which also seems very slow. Any ideas on how I
> could do it more efficiently?
>
> Thanks in advance,
> Jim
>

Reply via email to