Computing min-sets of shingle hashes is definitely the preferred approach.
I think that a full implementation to find all duplicates in a corpus
requires more than one map-reduce phase.

phase 1:
Convert to an inverted index that maps hash => docids for each hash in the
minset of some document.

    map: (_, document) => for (shingle-hash in min-set(document))
emit(shingle-hash, docid)
    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)

phase 2:
Count the number of times that documents share a shingle hash value in their
minsets.  Duplicate or near duplicate documents will share many shingle
hashes.  The map explodes a list of documents into all pairs, the combine
reduces this to counts, the reduce filters this to only output pairs with
high enough counts.

    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
sum(list-of-count))
    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
(n > threshold) emit (null, [doc1, doc2])

If you only want to be able test a single document at a time in near
real-time, you can adapt this to build a Lucene index out of the minsets of
each document (rather than the terms).  That Lucene index can be used to
find duplicates in real-time.


Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <[email protected]> wrote:

>
>
> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
>
> it would be very nice if someone could implement a randomised approach
> using
> Hadoop ...
>
> (this should be fairly esy to do, since you have to convert each document
> into a set of shingles --could be done in one mapper-- and then sort these
> documents plus some extra twists)




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to