On Thursday 26 February 2009 02:21:41 Koren Krupko wrote: > > Hello Lucene Developers! > > My name is Koren Krupko. I'm quite new to Lucene but I do have experience in > research in the fields of information retrieval. After reviewing Lucene's > capabilities I understand that one of its major strengths is its scalability > (as opposed to other frameworks such as Lemur). However, the retrieval and > scoring models used by Lucene are based upon the rather obsolete traditional > Vector Space Model. I'm interested in adding newer, state of the art, > retrieval models based on the notion of Language Models (see > http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf for more > details). > During the last years, retrieval systems based on LM have outperformed their > VSM based counterparts consistently in well recognized competitions such as > TREC. Thus, in order to make Lucene more attractive to IR researchers, I > would like to implement the following LM scoring functions using both > Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood, > KL-Divergence and Cross Entropy. > Integrating Language Models into Lucene in addition to its proven > performance capabilities and ease of use, will undoubtedly advance Lucene > into becoming the leading open source IR framework. > > Assuming the usage of an Inverted Index holding posting lists, in order to > implement basic LM scoring functions, I need the following information > available during query time: > 1. For each term in the inverted index – > a. Frequency in every document. > b. Frequency in the corpus. > 2. For each document – its size. > 3. Total size of the corpus. > As I understand, 1a is implemented in Lucene but the problem is getting 1b, > 2 and 3 since these details are not calculated during indexing. As I see it, > one could use the Payload to store document size.
The field size is encoded in the norms. > However, adding the Corpus > statistics requires fundamental changes in the index file format. From first > glance, this addition isn't substantial space-wise since all we need is one > more parameter per term. My eventual goal is to build a more complete and > comprehensive index once that will allow running multiple sessions of > retrieval using different scoring models later. > I did a survey of the forum but didn't find anything similar to my ideas > (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I > also understand that there are thoughts regarding changing the index format > in the future ("flexible indexing" - > https://issues.apache.org/jira/browse/LUCENE-1458). > > My questions are: > 1. Has anyone tried to do something similar in the past? This is a term scorer that simply divides term frequency by field length: https://issues.apache.org/jira/browse/LUCENE-293 A better field length encoding would be welcome, but it's a start. > 2. Is anyone working on something similar at the moment? Me, not any more, but that's for other reasons than the qualities of LM. > 3. Do you think LM can/should become a part of official future Lucene > versions? A contrib module with an alternative set of scorers would be a nice goal, for example starting from the one referenced above. > 4. How would you recommend implementing the index additions with minimal > changes as a temporary patch? No need for a temporary patch, just create a separate issue for each index addition, and see what happens. Regards, Paul Elschot