Unfortunately, yes the sentences need to be sorted. I take advantage of the lexicographical ordering of the sentences for some other purpose. Even if I didn't, how could I generate the prefixes? Do you mean number prefixes should be in the range [1-n] where n is the number of rows in the table? Since I use Hadoop to pull the data in, I can't see a trivial way to generate number prefixes but I may be missing something obvious.
Jim On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell <[email protected]> wrote: > Do the sentences need to be sorted? > if not you could use an number prefix on the row key. Keep track of > the highest prefix and use that range to select a prefix randomly. > Then start a scanner at that prefix > > ~Tim. > > 2009/1/10 Jim Twensky <[email protected]>: > > Hello, > > > > I have an HBase table that contains sentences as row keys and a few > numeric > > values as columns. A simple abstract model of the table looks like the > > following: > > > > > -------------------------------------------------------------------------------------------------------------------------- > > Sentence | frequency:value | probability:value-0 > > | probability:value-2 > > > -------------------------------------------------------------------------------------------------------------------------- > > Hello World | 5 | 0.000545321 > > | 0.002368204 > > . . > > . . > > . . > > . . > > . . > > . . > > > -------------------------------------------------------------------------------------------------------------------------- > > > > > > I create the table and load it using Hadoop and there are hundreds of > > billions of entries in it. I use this table to solve an optimization > problem > > using a hill climbing/simulated annealing method. Basically, I need to > > change the likelihood values randomly. For example, I need to change say > the > > first 5 rows starting at the 112th row and do some calculations and so > on... > > > > Now the problem is, I can't see an easy way to access to the n'th row > > directly. If I was using a traditional RDBMS, I'd add another column and > > auto-increment it each time I added a new row but this is not possible > since > > I load the table using Hadoop and the there are parallel insertions > taking > > place simultaneously. A quick and dirty way to do this might be adding a > new > > index column after I load and initialize the table but the table is huge > and > > it doesn't seem right to me. Another bad approach would be to use a > scanner > > starting from the first row and calling Scanner.next() n times inside a > for > > loop to access the n'th row, which also seems very slow. Any ideas on how > I > > could do it more efficiently? > > > > Thanks in advance, > > Jim > > >
