Re: Integrating Language Models into Lucene

Grant Ingersoll Thu, 26 Feb 2009 04:42:06 -0800

I think there is a group in the Netherlands that has open sourced aversion of Lucene using Language Models.

I'd certainly welcome alternate implementations. There have beenmany, many discussions about "flexible indexing" (http://www.lucidimagination.com/search/?q=flexible+indexing, and I know there are a bunch of related JIRA issues too) on the listhere that you might look at. In fact, several people have made someprogress towards it, such that we are getting close to being able tomore easily plug in different scoring models. With flex. indexing,you should be able to do #3 below, and I believe all the others arealready possible.




On Feb 25, 2009, at 8:21 PM, Koren Krupko wrote:

Hello Lucene Developers!
My name is Koren Krupko. I'm quite new to Lucene but I do haveexperience inresearch in the fields of information retrieval. After reviewingLucene'scapabilities I understand that one of its major strengths is itsscalability(as opposed to other frameworks such as Lemur). However, theretrieval andscoring models used by Lucene are based upon the rather obsoletetraditional
Vector Space Model. I'm interested in adding newer, state of the art,
retrieval models based on the notion of Language Models (see
http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdffor more
details).
During the last years, retrieval systems based on LM haveoutperformed theirVSM based counterparts consistently in well recognized competitionssuch asTREC. Thus, in order to make Lucene more attractive to IRresearchers, I
would like to implement the following LM scoring functions using both
Jelinek-Mercer and Dirichlet priors smoothing functions: QueryLikelihood,
KL-Divergence and Cross Entropy.
Integrating Language Models into Lucene in addition to its proven
performance capabilities and ease of use, will undoubtedly advanceLucene
into becoming the leading open source IR framework.
Assuming the usage of an Inverted Index holding posting lists, inorder toimplement basic LM scoring functions, I need the followinginformation
available during query time:
1.      For each term in the inverted index –
a.      Frequency in every document.
b.      Frequency in the corpus.
2.      For each document – its size.
3.      Total size of the corpus.
As I understand, 1a is implemented in Lucene but the problem isgetting 1b,2 and 3 since these details are not calculated during indexing. As Isee it,one could use the Payload to store document size. However, addingthe Corpusstatistics requires fundamental changes in the index file format.From firstglance, this addition isn't substantial space-wise since all we needis onemore parameter per term. My eventual goal is to build a morecomplete and
comprehensive index once that will allow running multiple sessions of
retrieval using different scoring models later.
I did a survey of the forum but didn't find anything similar to myideas(the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). Ialso understand that there are thoughts regarding changing the indexformat
in the future ("flexible indexing" -
https://issues.apache.org/jira/browse/LUCENE-1458).

My questions are:
1.      Has anyone tried to do something similar in the past?
2.      Is anyone working on something similar at the moment?
3.      Do you think LM can/should become a part of official future Lucene
versions?
4. How would you recommend implementing the index additions withminimal
changes as a temporary patch?

Koren

--
View this message in context: 
http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
Sent from the Lucene - Java Developer mailing list archive atNabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Integrating Language Models into Lucene

Reply via email to