Re: lucene.index.*: extending Lucene to store topic model data ?

Gregor Heinrich Fri, 19 Nov 2010 00:36:45 -0800

Hi Uwe -- thanks for this great hint. Is it considered stable enough to throwcorpora at it that have 100MB etc. raw text?

ps -- sorry for staying cryptic about the actual application. I tried toabstract its relation to Lucene... Basically it's about automaticallyassociating queries and documents with groups of related terms (topics) and thusimproving recall. I wrote an introductory note about this stuff that may give anoverview and cites much of the original literature:http://www.arbylon.net/publications/text-est2.pdf .


All the best

gregor


On 11/19/10 9:07 AM, Uwe Schindler wrote:

Hi Gregor,

I do not come from your area, so I don't understand all the stuff you are
writing about, but from what you write, it looks that you are interested in
the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently
flexible indexing only allows to modify term dictionary and posting lists
currently (the 4-dim Enum api in Lucene), but in the future we will also
allow to modify index format of stotred fields/term vectors. We already
started to have patches that allow per-field/document statistics for BM25
scoring.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

-----Original Message-----
From: Gregor Heinrich [mailto:[email protected]]
Sent: Friday, November 19, 2010 8:50 AM
To: [email protected]
Subject: lucene.index.*: extending Lucene to store topic model data ?

Dear list -- a question on potential storage of data originating from

"topic

models" like LSA (latent semantic analysis) and LDA (latent Dirichlet

allocation).

Packages like Mahout or SemanticVectors allow extraction of latent topics

from

an existing Lucene corpus. They don't have the storage of the actual

latent

concepts integrated into Lucene's efficient backend. So storing those data
withing Lucene's segments may be a benefit for them.

My question: In the IndexWriter backend, is there any reasonable way you

can

think of to store extra information after segments have been created but
before a commit() ? (This way any IndexSearcher/Reader always sees a
consistent index.) Further, after the optimize() step, another

modification of the

extra information in index should be possible.

Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from
the information in the index and stores topic related data with the

segments

currently active for indexing, but in extra files. The extra files contain
document-specific topic float vectors as well as segment-global float

vectors.

During commit(), the extra files are merged with the segments (which

involves

some math processing again). At the end of the indexing process, the LDA
algorithm is rerun, improving the topic model globally, thus again

modifying

the extra files.

What may be a point of departure? Adding a modified TermVector-like

storage

approach and hooking it to extended Segment* classes?

Best regards

gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: lucene.index.*: extending Lucene to store topic model data ?

Reply via email to