John Cwikla wrote:

Depends what "similar" means.

If by similar, you mean they contain alot of the same words/phrases, then
you can probably use
a query (although the number you can have is limited to 32 or 64 I think)
and get documents
using lucene.


I have a demo site that does this.
I thought I posted a mention of this a week ago in a different thread but it seems that
my mailer ate the outgoing msg so I don't think this is a dup....


Anyway, the site is:

http://www.hostmon.com/rfc/index.jsp

I index all RFCs w/ Lucene.

When you're viewing an RFC at a URL like this:

http://www.hostmon.com/rfc/get.jsp?id=1823&s=LDAP%20Security

if you click on 'show similar' then you go to a URL like this:

http://www.hostmon.com/rfc/related.jsp?id=1823&s=LDAP%20Security

the implementation takes the source RFC, forms a query with all words in it, and then
passes this large query to lucene. The display lists docs that are 'similar' to the
source doc, and the little blue bar graph is bigger for more similar docs.


The nature of what's going on is that the more 2 documents have words in common, and the
more they have rare words in common, the more similar they are.


Similarity is not by concept, and even alternate spellings throw off this simple technique.

Site warning: linux JVM just crashed so I upgraded to 1.4.2b which is hopefully more stable.


If by similar you are trying to determine if the text in some documents is
byte/byte the same
except for some small deviations, you are probably interested in using a
nilsima signature.

If you have some words/phrases that give you are starting point of documents
to check for similiarity,
you could use Lucene first, and then nilsima second.

If you are talking about conceptual similarity, you probably have a big
research project on your
hands.


-----Original Message-----
From: Bruce Ritchie [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 29, 2003 1:41 PM
To: Lucene Users List
Subject: Re: Find Documents 'Similar' to Another


Wirthlin, Rick - Workstream wrote:




I have a requirement to find documents similar to another. Can that be


accomplished using a PhraseQuery, or some other way?

One option I'm looking at to get this functionality is the InfoWrangler
product (www.infowrangler.com) as it does this and seems to be at least partially
based upon lucene. If other people know of other (available) options I'd love to hear of them as
well.



Regards,


Bruce Ritchie

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to