Here's a followup proposal (submitted to GSOC's site. I will add it to
the wiki, but I'm having trouble accessing it right now)

Thanks!

-- David

Title/Summary: Distributed Latent Dirichlet Allocation

Student: David Hall

Student e-mail: d...@cs.stanford.edu


Student Major: Symbolic Systems/ Computer Science

Student Degree: MS/PhD

Student Graduation:  Stanford '09 / Berkeley '14


Organization: Hadoop

Assigned Mentor: Grant Ingersoll


Abstract:

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into "topics"
and documents into mixtures of topics, and it has been successfully
applied to model change in scientific fields over time (Griffiths and
Steyver, 2004; Hall, et al. 2008). In this project, I propose to
implement a distributed variant of Latent Dirichlet Allocation using
MapReduce, and, time permitting, to investigate extensions of LDA and
possibly more efficient algorithms for distributed inference.

Detailed Description:

A topic model is, roughly, a hierarchical Bayesian model that
associates with each document a probability distribution over
"topics", which are in turn distributions over words. For instance, a
topic in a collection of newswire might include words about "sports",
such as "baseball", "home run", "player", and a document about steroid
use in baseball might include "sports", "drugs", and "politics". Note
that the labels "sports", "drugs", and "politics", are post-hoc labels
assigned by a human, and that the algorithm itself only assigns
associate words with probabilities. The task of parameter estimation
in these models is to learn both what these topics are, and which
documents employ them in what proportions.

One of the promises of unsupervised learning algorithms like Latent
Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
massive collections of documents and condense them down into a
collection of easily understandable topics. However, all available
open source implementations of LDA and related topics models are not
distributed, which hampers their utility. This project seeks to
correct this shortcoming.

In the literature, there have been several proposals for paralellzing
LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
which each processors gets its own subset of the documents to run
Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
its very nature, which is not advantageous for repeated runs. Instead,
I propose to follow Nallapati, et al. (2007) and use a variational
approximation that is fast and non-random.

>From there, I would like extend LDA to either Supervised Topic Models
(Blei & McAulliffe, 2007) or Topics over Time (Wang & McCallum, 2006).
The former would enable LDA to be used for basic classification tasks,
and the latter would model topic dynamics explicitly.  For instance, a
politics topic is not static: certain words are more important over
time and the prominence of sports itself rises and falls around
playoff schedules and the like. A basic implementation of either of
these would be reasonably straightforward extension to LDA, and they
would also prove the flexibility of my implementation.

Finally, time permitting, I would like to examine the more efficient
algorithms using junction trees proposed by Wolfe, et al. (2008). They
demonstrate substantial speed up over the naive implementation
proposed earlier, but the framework does not fit as easily into
standard map-reduce architecture as implemented by Hadoop. I
anticipate that this work will be more exploratory in nature, with a
focus on laying the ground work for improvements more than a polished
implementation.

Biography:

I am a graduating masters student in Symbolic Systems from Stanford
University, and I will begin work on the PhD at UC Berkeley next
autumn. I am currently a member of Stanford's Natural Language
Processing group under Dan Jurafsky and Chris Manning. (I will be
working Dan Klein the fall.) My research has involved the application
of topic models to modeling and discovering the origins of scientific
"paradigms", breakthroughs that change a scientific field or even
create a new field. More recently, I've worked on minimally
unsupervised part of speech tagging.


In terms of my experience with Hadoop and Map Reduce, I have interned
at Google for two summers, working with MapReduce for the entirety of
both internships. My second summer I worked on Google's Machine
Translation team, and so I am familiar with using Map Reduce to
implement large-scale NLP and Machine Learning algorithms.

More recently, I've been in charge of setting up and maintaining a
small Hadoop cluster in my research group, and I've written a wrapper
library for Hadoop in the Scala Programming Language. The code isn't
quite "release" quality yet, but you can see its work-in-progress
state at http://bugs.scalanlp.org/repositories/show/smr , and I've
written a short blog post about it at
http://scala-blogs.org/2008/09/scalable-language-and-scalable.html . I
have used SMR to implement some research code for a machine learning
approach to author deduplication using variational inference.


Timeline:

Weeks 1-3: Implement mean-field variational approximation for LDA
(Nallapati, et al. 2007)

Weeks 4-5: Generalize LDA, either with Supervised LDA (Blei and
McAuliffe, 2007) for classification, or Topics over Time (Lei and
McCallum, 2006)

Weeks 6-8: Make inference make more efficient use of bandwidth (Wolfe,
et al. 2008)
Weeks 9-10: Clean up/preparing for end of GSoC

References:

David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.

David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
allocation, The Journal of Machine Learning Research, 3, p.993-1022,
3/1/2003

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
Acad Sci U S A, 101 Suppl 1: 5228–5235, April 2004.

David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.

Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
variational EM for Latent Dirichlet Allocation: An experimental
evaluation of speed and scalability, ICDM workshop on high performance
data mining, 2007.

Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
Inference for Latent Dirichlet Allocation. NIPS, 2007.


Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
continuous-time model of topical trends. KDD, 2006


Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
large datasets. ICML, 2008.


Additional Information:


On Tue, Mar 24, 2009 at 12:26 AM, David Hall <d...@cs.stanford.edu> wrote:
> Hi,
>
> I'll begin by saying I'm not 100% sure how this works; I am new to the
> Mahout mailing list, though I have been subscribed to the Hadoop list.
> I have been following the Mahout project with interest for some time,
> however.
>
> I'm David Hall, a graduating master's student in the Natural Language
> Processing group at Stanford University. In the fall, I'll be
> beginning my PhD at UC Berkeley EECS, working on machine learning and
> natural language processing. (Strictly, I should say "most likely"
> because I am waiting on a few details before I make a final decision
> on the school. It will be Berkeley, Stanford, or Carnegie Mellon.) My
> research has primarily involved topic models like Latent Dirichlet
> Allocation and unsupervised learning for part of speech tagging.
>
> In terms of my experience with Hadoop and Map Reduce, I have interned
> at Google for two summers, working with MapReduce for the entirety of
> both internships. My second summer I worked on Google's Machine
> Translation team, and so I am familiar with using Map Reduce to
> implement large-scale NLP and Machine Learning algorithms.
>
> More recently, I've been in charge of setting up and maintaining a
> small Hadoop cluster in my research group, and I've written a wrapper
> library for Hadoop in the Scala Programming Language. The code isn't
> quite "release" quality yet, but you can see its work-in-progress
> state at http://bugs.scalanlp.org/repositories/show/smr , and I've
> written a short blog post about it at
> http://scala-blogs.org/2008/09/scalable-language-and-scalable.html . I
> have used SMR to implement some research code for a machine learning
> approach to author deduplication using variational inference.
>
> This summer, I'd like to help contribute to the Mahout project. I read
> Tijs Zwinkels' proposal, and I think that what I would like to work on
> is sufficiently different from what he would like to do. First, I
> would like to implement Latent Dirichilet Allocation, a popular topic
> mixture model that learns both document clusters and word clusters. I
> would then like to extend it to implement a number of general purpose
> topic models, including Topics over Time, Pachinko Allocation, and
> possibly Supervised Topic Models.
>
> I see that there has been some discussion of LDA on this list, and I'd
> like to weigh in on that discussion, if I may. HD-LDA (Newman et al.,
> 2008), based on Gibbs Sampling, seems to perform quite reasonably,
> though AD-LDA is much easier to implement (and understand!) and
> performs just as well. Gibbs Sampling, however, is not well-suited to
> parallellization in general because Rao-Blackwellization (collapsing)
> is needed to make it converge in a reasonable period, but by its
> nature collapsing removes the independence assumptions that makes
> sampling "embarrassingly parallel." For this and other reasons, I
> personally would prefer to follow Wolfe, et al. (2008) and use
> variational techniques to implement these models.
>
> In terms of implementation, I would love to implement these in SMR, if
> you are open to adding a new language, because it offers a 10-fold
> decrease in the number of lines of code for non-trivial algorithms
> with almost no slowdown, and it has support for "staging" phases of
> computation, which is especially useful for the kind of iterative
> algorithms that are needed for machine learning. I would definitely be
> willing to stay in Java-land, though.
>
>  I recognize that this does not constitute a full proposal, but I
> wanted to do a "soft" proposal before working out the details. I am
> also open to expanding my proposal to include anything Bayesian,
> semi-, partially, or un- supervised, and, of course, fun.
>
> As for my time, I have no constraints on my time this summer in
> particular. I've decided not to pursue a "full" summer internship for
> personal reasons.
>
> Thanks,
> David Hall
>
>
> Partial list of references:
>
> David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
> allocation, The Journal of Machine Learning Research, 3, p.993-1022,
> 3/1/2003
>
> Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2008). Distributed
> Inference for Latent Dirichlet Allocation. NIPS.
>
> Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
> large datasets. ICML, 2008.
>

Reply via email to