Dear All, I want to use Lucene in information retrieval of documents which contains probabilistically weighted words or fields (fields in Lucene). Below, I give a template model as an example in order to illustrate my problem. Any information or advice is very valuable for me. Thank you very much. A document containing two words will be implicitly something like this: IMPLICIT DOCUMENT ----------------------------------------------------------------------------- cherry know ----------------------------------------------------------------------------- This document stands for a document which contains explicitly probabilistically weighted words. For each word, implicit document gets the most probable words from explicit document. For this example, the explicit document will be something like this: EXPLICIT DOCUMENT -------------------------------------------------------------------------------- cherry(0,83)/chary(0,17) know(0,76)/now(0,24) --------------------------------------------------------------------------------- I will do my query search among ‘Explicit Document’s. First and easy problem is that: when we enter the query ‘chary’, Lucene will return this document although it does not contain ‘chary’ implicitly. (This is due to the fact that this document has a probability (even it is so small) to include the word ‘chary’.) Second and more important problem: In calculation of term frequency of terms, we sum probabilistic weights of words instead of simply counting them. As an example, for this document the term frequency of ‘cherry’ is ‘0.83’ (instead of ‘1’) and that of chary is ‘0.17’. Solving second problem will yield that consequence: When we enter query ‘chary’, a document containing implicitly a lot of ‘cherry’ can score higher than documents containing implicitly a few ‘chary’. Thank you very much again.
