Hello, I'm using lucene to retrieve relevant segments of a corpus based on a given query. Every segment is represented as Document in the indexing. Once the relevant segments are retrieved, I search for a Regex in them to capture the requested information. It can happens that I found the regex in two or more documents retrieved by lucene. Thus, I would like to show a confidence score to each captured Regex. To make simple, I means if the regex is found in the first retrieved document and in fourth, the one retrieved in the first document is more certain to be relevant since the its document was better scored than other one.
To show a confidence score, I tried to use scores returned by Lucene. But, the lucene score are arbitrary and not normalized. So it's not relevant to use. I wonder how we can have an arbitrary score (>1) when the default scoring system in lucene is based on cosine measure. In a simple case, the cosine score between two document vectors (obtained by tf-idf) is between 0 and 1. Since the lucene score is arbitrary and not normalized, if I want to verify which query is more relevant to retrieve a document, how can I compare the score of the first retrieved documents corresponding to each query ? Is there any way to give a confidence score to each retrieved document which make the comparison of the results of the different search using the different queries possible? Many thank Best regards. -- Hassan Saneifar, PhD. student, University Montpellier II LIRMM laboratory 161 rue Ada 34392 Montpellier, France Tel: +33 (0)467 41 85 41 Fax: +33 (0)467 41 85 00 URL: http://www.lirmm.fr/~saneifar -- Hassan Saneifar, R&D engineer Satin IP Technologies Tel: +33 (0)467 13 00 86 URL: http://www.satin-ip.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org