Sebastian,

There is no simple way of calculating similarity between terms in Lucene.

Normally documents are represented in the Vector Space Model (VSM) where as some weight is associated to each unique term associated with the document (e.g. term frequency or number of times a term occurs within the document). This representation is used internally to calculate the similarity between documents, treating a query as a special case short document. Now, you can get these term vectors per documents with the Lucene API if the index was built with the term vectors option. You can try building a Term vs. Documents matrix by accumulating document term vectors and then applying some LSA or co-occurrence based calculations as a similarity, but this may be computationally very expensive if done with a huge matrix. Some sampling based techniques have been developed (please contact me directly if you wish to learn more about it).

Now, regarding your comment about seeing a term as a document, if you inverse the T x D matrix you may think of a term as a document where as the vector representation now contains entries with term weights associated with each document, thus similar vector space calculations (e.g. cosine-based similarity) can be drawn between terms. This just looks at a first degree of co-occurrence though (i.e. how many documents share the terms) and does not capture semantic transitivity (second or higher degree of co-occurrence) which is very important to determine similarity between terms (i.e. synonyms, representing the same concept, may be use in different sub-sets of documents thus having low first degree of co-occurrence)

-- Joaquin

Sebastian Menge wrote:

Hi all

Given an index, how can (if i can) get the similarity between _terms_?
I read somewhere (In an Intro to IR) that a term can be seen as a
document. Can i do that with lucene, and how would one proceed? (a code
snippet would be great ..)

Thanks alot, Sebastian.

BTW: I found lucene when looking for a LSA component. I already asked
for that on the general-list. Other people are also looking for this
(e.g. fidde andersson). I already get asked whether i got any further.
So it seems that there is demand for such a component. If i were still a
student i would try to extend lucene to do something like that, but
today i dont have the ressources but perhaps another person has.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to