In this context does min-set refer to: http://portal.acm.org/citation.cfm?id=1226882
> If you only want to be able test a single document at a time in near > real-time, you can adapt this to build a Lucene index out of the minsets of > each document (rather than the terms). That Lucene index can be used to > find duplicates in real-time. Interesting. Thanks! On Fri, Jul 17, 2009 at 4:01 PM, Ted Dunning<[email protected]> wrote: > Computing min-sets of shingle hashes is definitely the preferred approach. > I think that a full implementation to find all duplicates in a corpus > requires more than one map-reduce phase. > > phase 1: > Convert to an inverted index that maps hash => docids for each hash in the > minset of some document. > > map: (_, document) => for (shingle-hash in min-set(document)) > emit(shingle-hash, docid) > reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list) > > phase 2: > Count the number of times that documents share a shingle hash value in their > minsets. Duplicate or near duplicate documents will share many shingle > hashes. The map explodes a list of documents into all pairs, the combine > reduces this to counts, the reduce filters this to only output pairs with > high enough counts. > > map: (shingle-hash, docid-list) => ([doc1, doc2], 1) > combine: ([doc1, doc2], list-of-count) => ([doc1, doc2], > sum(list-of-count)) > reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if > (n > threshold) emit (null, [doc1, doc2]) > > If you only want to be able test a single document at a time in near > real-time, you can adapt this to build a Lucene index out of the minsets of > each document (rather than the terms). That Lucene index can be used to > find duplicates in real-time. > > > Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <[email protected]> wrote: > >> >> >> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf >> >> it would be very nice if someone could implement a randomised approach >> using >> Hadoop ... >> >> (this should be fairly esy to do, since you have to convert each document >> into a set of shingles --could be done in one mapper-- and then sort these >> documents plus some extra twists) > > > > > -- > Ted Dunning, CTO > DeepDyve >
