Hi Gregor, I do not come from your area, so I don't understand all the stuff you are writing about, but from what you write, it looks that you are interested in the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently flexible indexing only allows to modify term dictionary and posting lists currently (the 4-dim Enum api in Lucene), but in the future we will also allow to modify index format of stotred fields/term vectors. We already started to have patches that allow per-field/document statistics for BM25 scoring.
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Gregor Heinrich [mailto:[email protected]] > Sent: Friday, November 19, 2010 8:50 AM > To: [email protected] > Subject: lucene.index.*: extending Lucene to store topic model data ? > > Dear list -- a question on potential storage of data originating from "topic > models" like LSA (latent semantic analysis) and LDA (latent Dirichlet allocation). > Packages like Mahout or SemanticVectors allow extraction of latent topics from > an existing Lucene corpus. They don't have the storage of the actual latent > concepts integrated into Lucene's efficient backend. So storing those data > withing Lucene's segments may be a benefit for them. > > My question: In the IndexWriter backend, is there any reasonable way you can > think of to store extra information after segments have been created but > before a commit() ? (This way any IndexSearcher/Reader always sees a > consistent index.) Further, after the optimize() step, another modification of the > extra information in index should be possible. > > Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from > the information in the index and stores topic related data with the segments > currently active for indexing, but in extra files. The extra files contain > document-specific topic float vectors as well as segment-global float vectors. > During commit(), the extra files are merged with the segments (which involves > some math processing again). At the end of the indexing process, the LDA > algorithm is rerun, improving the topic model globally, thus again modifying > the extra files. > > What may be a point of departure? Adding a modified TermVector-like storage > approach and hooking it to extended Segment* classes? > > Best regards > > gregor > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] For additional > commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
