So, yes. You can do this kind of retrieval using Lucene. Avoiding the details of the Liu and Croft method, the basic idea is that the observed words in a document can be augmented by means of a hierarchical language model. This means that there is a corpus language model describing the gross characteristics of the language in question. Below that are clusters, each with their own model that is informed by both the corpus model and the details of the documents in the cluster. Finally, there are document language models informed by the contents of the document and the contents of the cluster and possibly the also by the corpus model.
The document language model can be used to derive words that the author might well have said, if they had written longer. You can index these derived words just as easily as the words that actually appear. You may be able to get away with Lucene's native ability to modify weights on words or you might need to use the flexibility of the scoring system to build your own scoring system. Also, depending on the scale of your corpus, I would suggest that you might benefit from the Apache Mahout project's k-means clustering. If you allow multiple cluster probabilistic membership, then this language model probably reduces to either PLSI or LDA (I can't say which without detailed analysis). You could also start with those models and do the augmented indexing trick. Mahout has a reasonably nice implementation of LDA as well as k-means. My guess is that the hierarchical nature of the language model actually has gain even with LDA so you might want to do conventional clustering, LDA on the clusters, then LDA on the contents of the cluster. So the answer is yes. On Sat, Nov 27, 2010 at 1:19 AM, vermansi <verma...@gmail.com> wrote: > > *Cluster*-*Based Retrieval* Using Language > Models< > http://www.google.co.in/url?sa=t&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.83.4177%26rep%3Drep1%26type%3Dpdf&ei=8szwTPaOMJHIuAO3lIH6DQ&usg=AFQjCNEiQCxvKNZMfGKk6pRtdLaqIY847g&sig2=GE_yyn_ow9KQojgwnZ2ACw > >by > X Liu - 2004 > > I hope this helps .. sorry for constantly giving incomplete information > > Regards > Manisha > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Cluster-Retrieval-in-Lucene-tp1968500p1976646.html > Sent from the Lucene - General mailing list archive at Nabble.com. >