I think this comes up fairly often in search apps, duplicate documents are indexed (for example using SimplyHired's search there are 20 of the same job listed from different websites). A similarity score above a threshold would determine the documents are too similar, are duplicates, and therefore can be removed. Is there a recommended Mahout algorithm for this?
- Finding the similarity of documents using Mahout for dedu... Jason Rutherglen
- Re: Finding the similarity of documents using Mahout... Miles Osborne
- Re: Finding the similarity of documents using Ma... Ted Dunning
- Re: Finding the similarity of documents usin... Jason Rutherglen
- Re: Finding the similarity of documents usin... Miles Osborne
- Re: Finding the similarity of documents ... Ted Dunning
- Re: Finding the similarity of documents ... Ted Dunning
- Re: Finding the similarity of docum... Miles Osborne
- Re: Finding the similarity of documents using Mahout... Shashikant Kore
- Re: Finding the similarity of documents using Ma... Jason Rutherglen
- Re: Finding the similarity of documents usin... Ted Dunning
