I'm looking for a way to filter out duplicate documents from an index (either while indexing, or after the fact). It seems like there should be an approach of comparing the terms for two documents, but I'm wondering if any other folks (i.e. nutch) have come up with a solution to this problem.

Obviously you can compute the Levenstein distance on the text, but that is way too computationally intensive to scale. So the goal is to find something that would be workable in a production system. For example, a given NYT article, and its printer friendly version should be deemed to be the same.

-Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to