Oh, one more option would be to use ngrams and support vector machines,
which
may be more in touch with what you want, if by similar you mean conceptually
"in the same category" using training sets, etc...


-----Original Message-----
From: John Cwikla [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 29, 2003 1:49 PM
To: 'Lucene Users List'
Subject: RE: Find Documents 'Similar' to Another



Depends what "similar" means.

If by similar, you mean they contain alot of the same words/phrases, then
you can probably use
a query (although the number you can have is limited to 32 or 64 I think)
and get documents
using lucene.

If by similar you are trying to determine if the text in some documents is
byte/byte the same
except for some small deviations, you are probably interested in using a
nilsima signature.

If you have some words/phrases that give you are starting point of documents
to check for similiarity,
you could use Lucene first, and then nilsima second.

If you are talking about conceptual similarity, you probably have a big
research project on your
hands. 

-----Original Message-----
From: Bruce Ritchie [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 29, 2003 1:41 PM
To: Lucene Users List
Subject: Re: Find Documents 'Similar' to Another


Wirthlin, Rick - Workstream wrote:

> I have a requirement to find documents similar to another.  Can that be
accomplished using a PhraseQuery, or some other way?

One option I'm looking at to get this functionality is the InfoWrangler
product 
(www.infowrangler.com) as it does this and seems to be at least partially
based upon lucene. If 
other people know of other (available) options I'd love to hear of them as
well.


Regards,

Bruce Ritchie

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to