Question: Using Lucene for documents having probabilistically weighted terms(first e-mail is editted)

Seckin Uluskan Wed, 26 Aug 2009 06:31:02 -0700

Dear All,
 
I want to use Lucene in information retrieval of documents which contains 
probabilistically weighted words or fields (fields in Lucene). Below, I give a 
template model as an example in order to illustrate my problem. Any information 
or advice is very valuable for me. Thank you very much.
 
A document containing two words will be implicitly something like this:
 
IMPLICIT DOCUMENT
-----------------------------------------------------------------------------
cherry know
-----------------------------------------------------------------------------
This document stands for a document which contains explicitly probabilistically 
weighted words. For each word, implicit document gets the most probable words 
from explicit document. For this example, the explicit document will be 
something like this: 
 
EXPLICIT DOCUMENT
--------------------------------------------------------------------------------
cherry(0,83)/chary(0,17)    know(0,76)/now(0,24)
---------------------------------------------------------------------------------
 
I will do my query search among ‘Explicit Document’s.
 
First and easy problem is that: when we enter the query ‘chary’, Lucene will 
return this document although it does not contain ‘chary’ implicitly. (This is 
due to the fact that this document has a probability (even it is so small) to 
include the word ‘chary’.)
 
Second and more important problem: In calculation of term frequency of terms, 
we sum probabilistic weights of words instead of simply counting them. As an 
example, for this document the term frequency of ‘cherry’ is ‘0.83’ (instead of 
‘1’) and that of chary is ‘0.17’.
 
Solving second problem will yield that consequence: When we enter query 
‘chary’, a document containing implicitly a lot of ‘cherry’ can score higher 
than documents containing implicitly a few ‘chary’.
 
Thank you very much again.

Question: Using Lucene for documents having probabilistically weighted terms(first e-mail is editted)

Reply via email to