Re: Finding the similarity of documents using Mahout for deduplication

Ted Dunning Sat, 18 Jul 2009 11:30:46 -0700

On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <[email protected]> wrote:


> i wish people would consider more often using Hypertable / Hbase etc in
> algorithms:  there are times when you want random access and as a
> community,
> i think we need to put more effort into working-out good ways to use all
> the
> options available.
>

I think that this is very sage advice.  I would actually add Lucene to this
list.  By doing an implicit matrix multiple and sparsification, it provides
an extraordinarily useful primitive operation.


> currently as a background task I'm thinking how to
> Hadoop-ify our Machine Translation system;  this involves random access to
> big tables of string-value pairs, as well as a few other tables.
> compressed, these can be 20G each and we need to hit these tables 100s of
> thousands of times per sentence we translate.
>

For translating a single sentence, this is a very plausible design option.


> so, the reseach questions
> here then become how to (a) modify the Machine Translation decoding
> procedure to best batch-up table requests --Google have published on this--
> and more interestingly, try to be more clever about whether a network
> request actually needs to be made.
>

And this problem is also very interesting if you consider how these
operations can be interleaved and reordered if you are translating thousands
of sentences at a time.  Can you rephrase the problem so that the map takes
a single sentence and emits all of the queries it would like to do for that
sentence.  The map would also inject all of the tables that you want to
query.  Then reduce can group by table key "perform" the lookups for each
key value.  Then a second map reduce would reorganize the results back to
the sentence level.

If your tables exceed memory, or if the sort of many queries is faster
asymptotically than queries, this could be much faster.





-- 
Ted Dunning, CTO
DeepDyve

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to