that kind of fuzzy equality is an area of open research. you need to define what is an acceptable error rate for Type 1 and Type 2 errors before you can think about implementations that scale better. approaches range from identifying document vocabulary and statistics to raw hashing of the input text.
Herb... -----Original Message----- From: Michael Giles [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 4:38 PM To: Lucene Users List Subject: Filtering out duplicate documents... Obviously you can compute the Levenstein distance on the text, but that is way too computationally intensive to scale. So the goal is to find something that would be workable in a production system. For example, a given NYT article, and its printer friendly version should be deemed to be the same. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
