Hi Andy, I am also very interested in such approaches. I have tried a hack to simulate the effects of LSI in a Lucene index. What I did was, as you suggested to extract the term frequencies from the index, constructed a term/document matrix, and performed SVD on the matrix. Then I multiplied the resulting values by a constant factor to simulate term frequencies in the LSI space (that is, I created a new field "lsi" in the documents and added the words with their corresponding frequencies). However this is a pretty nasty hack, and I would appreciate if anyone knows a good way of applying LSI to Lucene.
Are there any plans of including LSI as a Lucene feature in the future? Regards, Tarjei On 11/15/05, Andy Liu <[EMAIL PROTECTED]> wrote: > > I'm currently experimenting with latent semantic indexing techniques and > Lucene. I need to extract term frequencies from a Lucene index and > construct > a document/term matrix, then subsequently perform some mathematical > algorithms on this matrix which produces float and potentially negative > term > frequency values. Extracting the tf's from the Lucene index is easy. The > hard part is importing the modified tf's back into the index, since in > Lucene, tf's are stored as integer values. > > Anybody that knows the Lucene codebase well have any tips? Has anybody > even > tried performing LSI on a Lucene index? > > Andy > >