On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <[email protected]> wrote:
> i wish people would consider more often using Hypertable / Hbase etc in > algorithms: there are times when you want random access and as a > community, > i think we need to put more effort into working-out good ways to use all > the > options available. > I think that this is very sage advice. I would actually add Lucene to this list. By doing an implicit matrix multiple and sparsification, it provides an extraordinarily useful primitive operation. > currently as a background task I'm thinking how to > Hadoop-ify our Machine Translation system; this involves random access to > big tables of string-value pairs, as well as a few other tables. > compressed, these can be 20G each and we need to hit these tables 100s of > thousands of times per sentence we translate. > For translating a single sentence, this is a very plausible design option. > so, the reseach questions > here then become how to (a) modify the Machine Translation decoding > procedure to best batch-up table requests --Google have published on this-- > and more interestingly, try to be more clever about whether a network > request actually needs to be made. > And this problem is also very interesting if you consider how these operations can be interleaved and reordered if you are translating thousands of sentences at a time. Can you rephrase the problem so that the map takes a single sentence and emits all of the queries it would like to do for that sentence. The map would also inject all of the tables that you want to query. Then reduce can group by table key "perform" the lookups for each key value. Then a second map reduce would reorganize the results back to the sentence level. If your tables exceed memory, or if the sort of many queries is faster asymptotically than queries, this could be much faster. -- Ted Dunning, CTO DeepDyve
