RE: LDA [was RE: Taste on Mahout]

Daniel Kluesing Sun, 08 Jun 2008 14:22:07 -0700

Just a note that LDA and CRFs are two different algorithms. They both
fall under the general class of graphical models but otherwise solve
different problems.

LDA is trained on un-labeled, unordered data using a bag-of-words
assumption, CRFs are trained on labeled, sequential data with markov
assumptions. (CRFs are sorta a generalization of HMMs)

But CRFs are neat in their own right. Biology folks would get excited
with a good CRF implementation.

-----Original Message-----
From: Robin Anil [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 07, 2008 4:36 AM
To: [email protected]
Subject: Re: LDA [was RE: Taste on Mahout]

Hi,.
     There some LDA/CRF implementations available online. Might prove
useful when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling
for parameter estimation and inference. GibbsLDA++ is fast and is
designed to analyze hidden/latent topic structures of large-scale (text)
data collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based
Conditional Random Fields Part-of-Speech (POS) Tagger for English. The
model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (accuracy of 97.00%). Tagging speed: 500
sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for
English.
The model was trained on sections 01..24 of WSJ corpus and using section
00 as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text
processing tool that includes sentence boundary detection (using maximum
entropy classifier), word tokenization (following Penn convention),
part-of-speech tagging (using CRFTagger), and phrase chunking (using
CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can
serve as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs.
It would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a
parallel version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <[EMAIL PROTECTED]>
wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid 
> fancy stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that 
> most of the math devolves into sampling discrete hidden variables from

> simple distributions and then counting the results as if they were
observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <[EMAIL PROTECTED]>
> wrote:
>
> > It draws reference from Java implementation - 
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with 
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you 
> > suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: [email protected]
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things 
> > http://gibbslda.sourceforge.net/ looks to be a good implementation 
> > of the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: [email protected]
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can

> > be used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira 
> > or use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: [email protected]
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship 
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur 
> > <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on 
> > > LDA and EM.
> > > I am going through the docs to visualize and understand how LDA 
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce 
> > > side of
> > >
> > > thngs as this is something that will benefit both my project and 
> > > the community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: [email protected]
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own 
> > > experience is that the simplicity of the Gibb's sampling makes it 
> > > very much more attractive for implementation.  Also, since it is 
> > > (nearly) trivially parallelizable, it is more likely we will get a

> > > useful implementation right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> > > <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs 
> > > > sampling and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://w
> > > > ww.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the 
> > > > EM approach would give better performance on bigger data sets. 
> > > > Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>

--
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering IIT Kharagpur

------------------------------------------------------------------------
--------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

RE: LDA [was RE: Taste on Mahout]

Reply via email to