Robin Anil wrote:
> 
> http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book
> 
> <http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book>David
> gave me a very nice paper which talks about tag-document correlation. If
> you
> start with named labels, it does end up being naive bayes classifier.
> 

One caveat on this: it reduces to NB only when there is exactly one observed
label per document. Otherwise you have to do some kind of inference to
figure out which words belong to which label.


Robin Anil wrote:
> 
> On Mon, Jan 11, 2010 at 2:23 AM, Grant Ingersoll
> <[email protected]>wrote:
> 
>> A couple of things strike me about LDA, and I wanted to hear others
>> thoughts:
>>
>> 1. The LDA implementation (and seems to be reinforced by my reading on
>> topic models in general) is that the topic themselves don't have "names". 
>> I
>> can see why this is difficult (in some ways, your summarizing a summary),
>> but am curious whether anyone has done any work on such a thing as w/o
>> them
>> it still requires a fair amount by the human to infer what the topics
>> are.
>>  I suppose you could just pick the top few terms, but seems like a common
>> phrase or something would go further.  Also, I believe someone in the
>> past
>> mentioned some more recent work by Blei and Lafferty (Blei and Lafferty.
>> Visualizing Topics with Multi-Word Expressions. stat (2009) vol. 1050 pp.
>> 6)
>> to alleviate that.
> 
> It's a big problem. David Blei's students Jonathan Chang and Jordan
> Boyd-Graber have another paper out called "Reading Tea Leaves: How Humans
> Interpret Topic Models" at NIPS this year that I haven't had a chance to
> read yet that might shed some light. Usually the "top-k" words serve as a
> pretty good summary of a topic, particularly if you've stop-worded out
> useless words.
> 
>>
>> 2. We get the words in the topic, but how do we know which documents have
>> those topics?  I think, based on reading the paper, that the answer is
>> "You
>> don't get to know", but I'm not sure.
>>
> If I am correct, You do get to know based on the words in the document
>  which of those un-labelled topics are in the documents with an affinity
> score to eacj. You can sort it or do some form of testing to filter out
> the
> ones with significance.
> 

So, the output of what we have implemented at the moment doesn't give you
p(topic|document), but this is actually really easy, and could be done in
about 20 minutes-hour. LDAInference (called in the Mapper--which is
basically the E-Step) does all of the necessary work to learn
p(topic|document), but it then just outputs sufficient statistics for
p(word|topic). If instead we had a different Mapper to output
<DOC-ID,p(topic|document) \forall topic>, you'd have that.

That much is probably about 20 lines of logical code, along with the usual
mess of hadoop boiler plate. If you want it, I'll code it up.

-- David

-- 
View this message in context: 
http://old.nabble.com/More-LDA-Questions-tp27102356p27105825.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to