Hello Lucene Developers!

My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
research in the fields of information retrieval. After reviewing Lucene's
capabilities I understand that one of its major strengths is its scalability
(as opposed to other frameworks such as Lemur). However, the retrieval and
scoring models used by Lucene are based upon the rather obsolete traditional
Vector Space Model. I'm interested in adding newer, state of the art,
retrieval models based on the notion of Language Models (see  
http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
details).
During the last years, retrieval systems based on LM have outperformed their
VSM based counterparts consistently in well recognized competitions such as
TREC. Thus, in order to make Lucene more attractive to IR researchers, I
would like to implement the following LM scoring functions using both
Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
KL-Divergence and Cross Entropy.
Integrating Language Models into Lucene in addition to its proven
performance capabilities and ease of use, will undoubtedly advance Lucene
into becoming the leading open source IR framework.

Assuming the usage of an Inverted Index holding posting lists, in order to
implement  basic LM scoring functions, I need the following information
available during query time:
1.      For each term in the inverted index – 
a.      Frequency in every document.
b.      Frequency in the corpus.
2.      For each document – its size.
3.      Total size of the corpus.
As I understand, 1a is implemented in Lucene but the problem is getting 1b,
2 and 3 since these details are not calculated during indexing. As I see it,
one could use the Payload to store document size. However, adding the Corpus
statistics requires fundamental changes in the index file format. From first
glance, this addition isn't substantial space-wise since all we need is one
more parameter per term. My eventual goal is to build a more complete and
comprehensive index once that will allow running multiple sessions of
retrieval using different scoring models later.
I did a survey of the forum but didn't find anything similar to my ideas
(the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
also understand that there are thoughts regarding changing the index format
in the future ("flexible indexing" -
https://issues.apache.org/jira/browse/LUCENE-1458).

My questions are:
1.      Has anyone tried to do something similar in the past?
2.      Is anyone working on something similar at the moment?
3.      Do you think LM can/should become a part of official future Lucene
versions?
4.      How would you recommend implementing the index additions with minimal
changes as a temporary patch?

Koren

-- 
View this message in context: 
http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to