you have a Lucene LM implementation only for research purposes in http://ilps.science.uva.nl/resources/lm-lucene
is a very old implementation but maybe could be useful to you jose On Thu, Feb 26, 2009 at 9:25 AM, Paul Elschot <paul.elsc...@xs4all.nl> wrote: > On Thursday 26 February 2009 02:21:41 Koren Krupko wrote: > >> > >> Hello Lucene Developers! > >> > >> My name is Koren Krupko. I'm quite new to Lucene but I do have experience >> in > >> research in the fields of information retrieval. After reviewing Lucene's > >> capabilities I understand that one of its major strengths is its >> scalability > >> (as opposed to other frameworks such as Lemur). However, the retrieval and > >> scoring models used by Lucene are based upon the rather obsolete >> traditional > >> Vector Space Model. I'm interested in adding newer, state of the art, > >> retrieval models based on the notion of Language Models (see > >> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf for more > >> details). > >> During the last years, retrieval systems based on LM have outperformed >> their > >> VSM based counterparts consistently in well recognized competitions such >> as > >> TREC. Thus, in order to make Lucene more attractive to IR researchers, I > >> would like to implement the following LM scoring functions using both > >> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood, > >> KL-Divergence and Cross Entropy. > >> Integrating Language Models into Lucene in addition to its proven > >> performance capabilities and ease of use, will undoubtedly advance Lucene > >> into becoming the leading open source IR framework. > >> > >> Assuming the usage of an Inverted Index holding posting lists, in order to > >> implement basic LM scoring functions, I need the following information > >> available during query time: > >> 1. For each term in the inverted index – > >> a. Frequency in every document. > >> b. Frequency in the corpus. > >> 2. For each document – its size. > >> 3. Total size of the corpus. > >> As I understand, 1a is implemented in Lucene but the problem is getting >> 1b, > >> 2 and 3 since these details are not calculated during indexing. As I see >> it, > >> one could use the Payload to store document size. > > The field size is encoded in the norms. > >> However, adding the Corpus > >> statistics requires fundamental changes in the index file format. From >> first > >> glance, this addition isn't substantial space-wise since all we need is >> one > >> more parameter per term. My eventual goal is to build a more complete and > >> comprehensive index once that will allow running multiple sessions of > >> retrieval using different scoring models later. > >> I did a survey of the forum but didn't find anything similar to my ideas > >> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). >> I > >> also understand that there are thoughts regarding changing the index >> format > >> in the future ("flexible indexing" - > >> https://issues.apache.org/jira/browse/LUCENE-1458). > >> > >> My questions are: > >> 1. Has anyone tried to do something similar in the past? > > This is a term scorer that simply divides term frequency by field length: > > https://issues.apache.org/jira/browse/LUCENE-293 > > A better field length encoding would be welcome, but it's a start. > >> 2. Is anyone working on something similar at the moment? > > Me, not any more, but that's for other reasons than the qualities of LM. > >> 3. Do you think LM can/should become a part of official future Lucene > >> versions? > > A contrib module with an alternative set of scorers would be a nice goal, > > for example starting from the one referenced above. > >> 4. How would you recommend implementing the index additions with minimal > >> changes as a temporary patch? > > No need for a temporary patch, just create a separate issue for each index > > addition, and see what happens. > > Regards, > > Paul Elschot -- José Ramón Pérez Agüera Dept. de Ingeniería del Software e Inteligencia Artificial Despacho 411 tlf. 913947599 Facultad de Informática Universidad Complutense de Madrid --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org