lucene.index.*: extending Lucene to store topic model data ?

Gregor Heinrich Thu, 18 Nov 2010 23:44:55 -0800

Dear list -- a question on potential storage of data originating from "topicmodels" like LSA (latent semantic analysis) and LDA (latent Dirichletallocation). Packages like Mahout or SemanticVectors allow extraction of latenttopics from an existing Lucene corpus. They don't have the storage of the actuallatent concepts integrated into Lucene's efficient backend. So storing thosedata withing Lucene's segments may be a benefit for them.

My question: In the IndexWriter backend, is there any reasonable way you canthink of to store extra information after segments have been created but beforea commit() ? (This way any IndexSearcher/Reader always sees a consistent index.)Further, after the optimize() step, another modification of the extrainformation in index should be possible.

Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from theinformation in the index and stores topic related data with the segmentscurrently active for indexing, but in extra files. The extra files containdocument-specific topic float vectors as well as segment-global float vectors.During commit(), the extra files are merged with the segments (which involvessome math processing again). At the end of the indexing process, the LDAalgorithm is rerun, improving the topic model globally, thus again modifying theextra files.

What may be a point of departure? Adding a modified TermVector-like storageapproach and hooking it to extended Segment* classes?


Best regards

gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

lucene.index.*: extending Lucene to store topic model data ?

Reply via email to