Re: Integrating Language Models into Lucene

José Ramón Pérez Agüera Fri, 27 Feb 2009 00:34:30 -0800

you have a Lucene LM implementation only for research purposes in

http://ilps.science.uva.nl/resources/lm-lucene


is a very old implementation but maybe could be useful to you

jose

On Thu, Feb 26, 2009 at 9:25 AM, Paul Elschot <[email protected]> wrote:
> On Thursday 26 February 2009 02:21:41 Koren Krupko wrote:
>
>>
>
>> Hello Lucene Developers!
>
>>
>
>> My name is Koren Krupko. I'm quite new to Lucene but I do have experience
>> in
>
>> research in the fields of information retrieval. After reviewing Lucene's
>
>> capabilities I understand that one of its major strengths is its
>> scalability
>
>> (as opposed to other frameworks such as Lemur). However, the retrieval and
>
>> scoring models used by Lucene are based upon the rather obsolete
>> traditional
>
>> Vector Space Model. I'm interested in adding newer, state of the art,
>
>> retrieval models based on the notion of Language Models (see
>
>> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf for more
>
>> details).
>
>> During the last years, retrieval systems based on LM have outperformed
>> their
>
>> VSM based counterparts consistently in well recognized competitions such
>> as
>
>> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
>
>> would like to implement the following LM scoring functions using both
>
>> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
>
>> KL-Divergence and Cross Entropy.
>
>> Integrating Language Models into Lucene in addition to its proven
>
>> performance capabilities and ease of use, will undoubtedly advance Lucene
>
>> into becoming the leading open source IR framework.
>
>>
>
>> Assuming the usage of an Inverted Index holding posting lists, in order to
>
>> implement basic LM scoring functions, I need the following information
>
>> available during query time:
>
>> 1. For each term in the inverted index –
>
>> a. Frequency in every document.
>
>> b. Frequency in the corpus.
>
>> 2. For each document – its size.
>
>> 3. Total size of the corpus.
>
>> As I understand, 1a is implemented in Lucene but the problem is getting
>> 1b,
>
>> 2 and 3 since these details are not calculated during indexing. As I see
>> it,
>
>> one could use the Payload to store document size.
>
> The field size is encoded in the norms.
>
>> However, adding the Corpus
>
>> statistics requires fundamental changes in the index file format. From
>> first
>
>> glance, this addition isn't substantial space-wise since all we need is
>> one
>
>> more parameter per term. My eventual goal is to build a more complete and
>
>> comprehensive index once that will allow running multiple sessions of
>
>> retrieval using different scoring models later.
>
>> I did a survey of the forum but didn't find anything similar to my ideas
>
>> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965).
>> I
>
>> also understand that there are thoughts regarding changing the index
>> format
>
>> in the future ("flexible indexing" -
>
>> https://issues.apache.org/jira/browse/LUCENE-1458).
>
>>
>
>> My questions are:
>
>> 1. Has anyone tried to do something similar in the past?
>
> This is a term scorer that simply divides term frequency by field length:
>
> https://issues.apache.org/jira/browse/LUCENE-293
>
> A better field length encoding would be welcome, but it's a start.
>
>> 2. Is anyone working on something similar at the moment?
>
> Me, not any more, but that's for other reasons than the qualities of LM.
>
>> 3. Do you think LM can/should become a part of official future Lucene
>
>> versions?
>
> A contrib module with an alternative set of scorers would be a nice goal,
>
> for example starting from the one referenced above.
>
>> 4. How would you recommend implementing the index additions with minimal
>
>> changes as a temporary patch?
>
> No need for a temporary patch, just create a separate issue for each index
>
> addition, and see what happens.
>
> Regards,
>
> Paul Elschot



-- 
José Ramón Pérez Agüera

Dept. de Ingeniería del Software e Inteligencia Artificial
Despacho 411 tlf. 913947599
Facultad de Informática
Universidad Complutense de Madrid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Integrating Language Models into Lucene

Reply via email to