you could probably eliminate phase 2 if the output of phase 1 was stored in
Perfect Hashing table (say using Hypertable).  this works by storing a
fingerprint for each shingle/count pair (a few bits) and organising the hash
table such that you never get collisions (hence the Perfect Hashing).

the table would be compact and --modulo network latency-- would have O(1)
lookup time for each shingle-count query.

i wish people would consider more often using Hypertable / Hbase etc in
algorithms:  there are times when you want random access and as a community,
i think we need to put more effort into working-out good ways to use all the
options available.  currently as a background task I'm thinking how to
Hadoop-ify our Machine Translation system;  this involves random access to
big tables of string-value pairs, as well as a few other tables.
compressed, these can be 20G each and we need to hit these tables 100s of
thousands of times per sentence we translate.  so, the reseach questions
here then become how to (a) modify the Machine Translation decoding
procedure to best batch-up table requests --Google have published on this--
and more interestingly, try to be more clever about whether a network
request actually needs to be made.  i have ideas here

Miles



2009/7/18 Ted Dunning <[email protected]>

> Computing min-sets of shingle hashes is definitely the preferred approach.
> I think that a full implementation to find all duplicates in a corpus
> requires more than one map-reduce phase.
>
> phase 1:
> Convert to an inverted index that maps hash => docids for each hash in the
> minset of some document.
>
>    map: (_, document) => for (shingle-hash in min-set(document))
> emit(shingle-hash, docid)
>    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)
>
> phase 2:
> Count the number of times that documents share a shingle hash value in
> their
> minsets.  Duplicate or near duplicate documents will share many shingle
> hashes.  The map explodes a list of documents into all pairs, the combine
> reduces this to counts, the reduce filters this to only output pairs with
> high enough counts.
>
>    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
>    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
> sum(list-of-count))
>    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
> (n > threshold) emit (null, [doc1, doc2])
>
> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.
>
>
> Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <[email protected]> wrote:
>
> >
> >
> >
> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
> >
> > it would be very nice if someone could implement a randomised approach
> > using
> > Hadoop ...
> >
> > (this should be fairly esy to do, since you have to convert each document
> > into a set of shingles --could be done in one mapper-- and then sort
> these
> > documents plus some extra twists)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Reply via email to