Re: Finding the similarity of documents using Mahout for deduplication

Miles Osborne Sat, 18 Jul 2009 11:44:27 -0700

basically a decoder does a beam search through a vast set of options, guided
by various probabilities

(those tables I mentioned are mainly language models)

it would be vey interesting indeed to really open-up the decoder in this
manner:  get MR to do a parallel search over the search graph.  my initial
guess is that this would be too slow, since the actual decoding can be done
in-memory on a typical Dell box (eg in 32G).  this is per-sentence; per
document is a different matter.

what i had in mind was something less radical:  take a document, split it
into shards; translate each shard (as a mapper) and have the reducer
recombine the translated sentences.  Hypertable would be used to serve the
language models and phrase tables (string-string mappings, specifying the
set of possible ways some source fragment gets translated into target
fragments).

language models are interesting in that the vast majority of requests won't
be for a valid ngram  (the decoder searches over a space of possible
translations, most of which are garbage).  so, possible speed-ups would be
to have a small memory on-board cache of the table (as a Bloom Filter).
more interesting still would be to consider the stream of language model
requests and see whether it itself can be modelled  (think of this as say
Bernoulli trials:  do i get a hit or not).  if there is structure in this
stream, then i could design an algorithm to learn when it is in a sequence
of misses (exploring some low probabily space of translations) or whether it
switched to say a high probability region.

Miles

2009/7/18 Ted Dunning <[email protected]>

> On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <[email protected]> wrote:
>
> > i wish people would consider more often using Hypertable / Hbase etc in
> > algorithms:  there are times when you want random access and as a
> > community,
> > i think we need to put more effort into working-out good ways to use all
> > the
> > options available.
> >
>
> I think that this is very sage advice.  I would actually add Lucene to this
> list.  By doing an implicit matrix multiple and sparsification, it provides
> an extraordinarily useful primitive operation.
>
>
> > currently as a background task I'm thinking how to
> > Hadoop-ify our Machine Translation system;  this involves random access
> to
> > big tables of string-value pairs, as well as a few other tables.
> > compressed, these can be 20G each and we need to hit these tables 100s of
> > thousands of times per sentence we translate.
> >
>
> For translating a single sentence, this is a very plausible design option.
>
>
> > so, the reseach questions
> > here then become how to (a) modify the Machine Translation decoding
> > procedure to best batch-up table requests --Google have published on
> this--
> > and more interestingly, try to be more clever about whether a network
> > request actually needs to be made.
> >
>
> And this problem is also very interesting if you consider how these
> operations can be interleaved and reordered if you are translating
> thousands
> of sentences at a time.  Can you rephrase the problem so that the map takes
> a single sentence and emits all of the queries it would like to do for that
> sentence.  The map would also inject all of the tables that you want to
> query.  Then reduce can group by table key "perform" the lookups for each
> key value.  Then a second map reduce would reorganize the results back to
> the sentence level.
>
> If your tables exceed memory, or if the sort of many queries is faster
> asymptotically than queries, this could be much faster.
>
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to