Re: Finding the similarity of documents using Mahout for deduplication

Miles Osborne Fri, 17 Jul 2009 12:50:15 -0700

a direct MR approach is as follows:

http://web.jhu.edu/HLTCOE/Publications/acl08_elsayed_pairwise_sim.pdf


this is not particularly efficient;  a better approach would be to use a
randomised approach, eg:

Broder et al:

http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf

it would be very nice if someone could implement a randomised approach using
Hadoop ...

(this should be fairly esy to do, since you have to convert each document
into a set of shingles --could be done in one mapper-- and then sort these
documents plus some extra twists)

Miles

2009/7/17 Jason Rutherglen <[email protected]>

> I think this comes up fairly often in search apps, duplicate
> documents are indexed (for example using SimplyHired's search
> there are 20 of the same job listed from different websites). A
> similarity score above a threshold would determine the documents
> are too similar, are duplicates, and therefore can be removed.
> Is there a recommended Mahout algorithm for this?
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to