I think this comes up fairly often in search apps, duplicate
documents are indexed (for example using SimplyHired's search
there are 20 of the same job listed from different websites). A
similarity score above a threshold would determine the documents
are too similar, are duplicates, and therefore can be removed.
Is there a recommended Mahout algorithm for this?

Reply via email to