[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Charlie Zhao (JIRA) Thu, 26 Jul 2007 12:16:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515833
 ]


Charlie Zhao commented on LUCENE-965:
-------------------------------------



Document Length and Average Document Length are sort of speed bottlenecks of  
Lucene's implementation of some IR models, like Axiomatic Retrieval Function we 
just saw and one Language Model I have extended in Lucene. I said speed, 
instead of performance. Because Lucene's performance measures (in the sense of 
recall and precision) are relatively low comparing with other IR models with my 
experimental results. And since early Lucene, we never updated the kernel of 
similarity measure algorithm. Do general users value (recall+precision) more 
than (speed)? 

How to conveniently store and retrieve "field length", "document length", 
"average document length", etc.? Can they be the payload data at document level 
and index level? So we may say bye to their corresponding overhead during query 
time? 

I used to leverage from TermFreqVector's getTermFrequencies() to obtain the 
field length. (size() only return the unique terms)  But shall I just reverse 
that field's norm value back to its length as (1/norm)^2? Which might be 
faster. Can someone confirm this?

BTW, I need help to understand the claim of "a small constant factor to the 
cost of reading them." in Doug's comment. Average norm does not give us the 
average field length. We need to recover the individual field length to get the 
average field length, which involve a great deal of floating point operations 
there. Did I miss something?

Can we store the "document length" (with multiple fields) and "average document 
length" as the payload data at document level and index level respectively? The 
current payload is designed at term level, is it right? If we want to store 
something at document and index level, do we necessary change the Lucene file 
format? 





> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Reply via email to