List, I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic model into Lucene. Briefly, the LDA model extracts topics (distribution over words) from a set of documents, and then represents each document with topic vectors. For example, documents could be represented as:
d1 = (0, 0.5, 0, 0.5) d2 = (1, 0, 0, 0) This means that document d1 contains topics 2 and 4, and document d2 contains topic 1. I.e., P(z1, d1) = 0 P(z2, d1) = 0.5 P(z3, d1) = 0 P(z4, d1) = 0.5 P(z1, d2) = 1 P(z2, d2) = 0 ... Also, topics are represented by the probability that a term appears in that topic, so we also have a set of vectors: z1 = (0, 0, .02, ...) meaning that topic z1 does not contain terms 1 or 2, but does contain term 3. I.e., P(t1, z1) = 0 P(t2, z1) = 0 P(t3, z1) = .02 ... Then, the similarity between a query and a document is computed as: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) Basically, for each term in the query, and each topic in existence, see how relevant that term is in that topic, and how relevant that topic is in the document. I've been thinking about how to do this in Lucene. Assume I already have the topics and the topic vectors for each document. I know that I need to write my own Similarity class that extends DefaultSimilarity. I need to override tf(), queryNorm(), coord(), and computeNorm() to all return a constant 1, so that they have no effect. Then, I can override idf() to compute the Sim equation above. Seems simple enough. However, I have a few practical issues: - Storing the topic vectors for each document. Can I store this in the index somehow? If so, how do I retrieve it later in my CustomSimilarity class? - Changing the Boolean model. Instead of only computing the similarity on a documents that contain any of the terms in the query (the default behavior), I need to compute the similarity on all of the documents. (This is the whole idea behind LDA: you don't need an exact term match for there to be a similarity.) I understand that this will result in a performance hit, but I do not see a way around it. - Turning off fieldNorm(). How can I set the field norm for each doc to a constant 1? Any help is greatly appreciated. Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org