that kind of fuzzy equality is an area of open research. you need to define what is an 
acceptable error rate for Type 1 and Type 2 errors before you can think about 
implementations that scale better. approaches range from identifying document 
vocabulary and statistics to raw hashing of the input text.

Herb...

-----Original Message-----
From: Michael Giles [mailto:[EMAIL PROTECTED]
Sent: Monday, March 08, 2004 4:38 PM
To: Lucene Users List
Subject: Filtering out duplicate documents...


Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to