Re: Finding the similarity of documents using Mahout for deduplication

Jason Rutherglen Fri, 17 Jul 2009 17:17:31 -0700

In this context does min-set refer to:
http://portal.acm.org/citation.cfm?id=1226882


> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.

Interesting.

Thanks!

On Fri, Jul 17, 2009 at 4:01 PM, Ted Dunning<[email protected]> wrote:
> Computing min-sets of shingle hashes is definitely the preferred approach.
> I think that a full implementation to find all duplicates in a corpus
> requires more than one map-reduce phase.
>
> phase 1:
> Convert to an inverted index that maps hash => docids for each hash in the
> minset of some document.
>
>    map: (_, document) => for (shingle-hash in min-set(document))
> emit(shingle-hash, docid)
>    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)
>
> phase 2:
> Count the number of times that documents share a shingle hash value in their
> minsets.  Duplicate or near duplicate documents will share many shingle
> hashes.  The map explodes a list of documents into all pairs, the combine
> reduces this to counts, the reduce filters this to only output pairs with
> high enough counts.
>
>    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
>    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
> sum(list-of-count))
>    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
> (n > threshold) emit (null, [doc1, doc2])
>
> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.
>
>
> Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <[email protected]> wrote:
>
>>
>>
>> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
>>
>> it would be very nice if someone could implement a randomised approach
>> using
>> Hadoop ...
>>
>> (this should be fairly esy to do, since you have to convert each document
>> into a set of shingles --could be done in one mapper-- and then sort these
>> documents plus some extra twists)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to