a direct MR approach is as follows: http://web.jhu.edu/HLTCOE/Publications/acl08_elsayed_pairwise_sim.pdf
this is not particularly efficient; a better approach would be to use a randomised approach, eg: Broder et al: http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf it would be very nice if someone could implement a randomised approach using Hadoop ... (this should be fairly esy to do, since you have to convert each document into a set of shingles --could be done in one mapper-- and then sort these documents plus some extra twists) Miles 2009/7/17 Jason Rutherglen <[email protected]> > I think this comes up fairly often in search apps, duplicate > documents are indexed (for example using SimplyHired's search > there are 20 of the same job listed from different websites). A > similarity score above a threshold would determine the documents > are too similar, are duplicates, and therefore can be removed. > Is there a recommended Mahout algorithm for this? > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
