Here's a followup proposal (submitted to GSOC's site. I will add it to the wiki, but I'm having trouble accessing it right now)
Thanks! -- David Title/Summary: Distributed Latent Dirichlet Allocation Student: David Hall Student e-mail: d...@cs.stanford.edu Student Major: Symbolic Systems/ Computer Science Student Degree: MS/PhD Student Graduation: Stanford '09 / Berkeley '14 Organization: Hadoop Assigned Mentor: Grant Ingersoll Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over "topics", which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about "sports", such as "baseball", "home run", "player", and a document about steroid use in baseball might include "sports", "drugs", and "politics". Note that the labels "sports", "drugs", and "politics", are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an "approximate" LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. >From there, I would like extend LDA to either Supervised Topic Models (Blei & McAulliffe, 2007) or Topics over Time (Wang & McCallum, 2006). The former would enable LDA to be used for basic classification tasks, and the latter would model topic dynamics explicitly. For instance, a politics topic is not static: certain words are more important over time and the prominence of sports itself rises and falls around playoff schedules and the like. A basic implementation of either of these would be reasonably straightforward extension to LDA, and they would also prove the flexibility of my implementation. Finally, time permitting, I would like to examine the more efficient algorithms using junction trees proposed by Wolfe, et al. (2008). They demonstrate substantial speed up over the naive implementation proposed earlier, but the framework does not fit as easily into standard map-reduce architecture as implemented by Hadoop. I anticipate that this work will be more exploratory in nature, with a focus on laying the ground work for improvements more than a polished implementation. Biography: I am a graduating masters student in Symbolic Systems from Stanford University, and I will begin work on the PhD at UC Berkeley next autumn. I am currently a member of Stanford's Natural Language Processing group under Dan Jurafsky and Chris Manning. (I will be working Dan Klein the fall.) My research has involved the application of topic models to modeling and discovering the origins of scientific "paradigms", breakthroughs that change a scientific field or even create a new field. More recently, I've worked on minimally unsupervised part of speech tagging. In terms of my experience with Hadoop and Map Reduce, I have interned at Google for two summers, working with MapReduce for the entirety of both internships. My second summer I worked on Google's Machine Translation team, and so I am familiar with using Map Reduce to implement large-scale NLP and Machine Learning algorithms. More recently, I've been in charge of setting up and maintaining a small Hadoop cluster in my research group, and I've written a wrapper library for Hadoop in the Scala Programming Language. The code isn't quite "release" quality yet, but you can see its work-in-progress state at http://bugs.scalanlp.org/repositories/show/smr , and I've written a short blog post about it at http://scala-blogs.org/2008/09/scalable-language-and-scalable.html . I have used SMR to implement some research code for a machine learning approach to author deduplication using variational inference. Timeline: Weeks 1-3: Implement mean-field variational approximation for LDA (Nallapati, et al. 2007) Weeks 4-5: Generalize LDA, either with Supervised LDA (Blei and McAuliffe, 2007) for classification, or Topics over Time (Lei and McCallum, 2006) Weeks 6-8: Make inference make more efficient use of bandwidth (Wolfe, et al. 2008) Weeks 9-10: Clean up/preparing for end of GSoC References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228–5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. Additional Information: On Tue, Mar 24, 2009 at 12:26 AM, David Hall <d...@cs.stanford.edu> wrote: > Hi, > > I'll begin by saying I'm not 100% sure how this works; I am new to the > Mahout mailing list, though I have been subscribed to the Hadoop list. > I have been following the Mahout project with interest for some time, > however. > > I'm David Hall, a graduating master's student in the Natural Language > Processing group at Stanford University. In the fall, I'll be > beginning my PhD at UC Berkeley EECS, working on machine learning and > natural language processing. (Strictly, I should say "most likely" > because I am waiting on a few details before I make a final decision > on the school. It will be Berkeley, Stanford, or Carnegie Mellon.) My > research has primarily involved topic models like Latent Dirichlet > Allocation and unsupervised learning for part of speech tagging. > > In terms of my experience with Hadoop and Map Reduce, I have interned > at Google for two summers, working with MapReduce for the entirety of > both internships. My second summer I worked on Google's Machine > Translation team, and so I am familiar with using Map Reduce to > implement large-scale NLP and Machine Learning algorithms. > > More recently, I've been in charge of setting up and maintaining a > small Hadoop cluster in my research group, and I've written a wrapper > library for Hadoop in the Scala Programming Language. The code isn't > quite "release" quality yet, but you can see its work-in-progress > state at http://bugs.scalanlp.org/repositories/show/smr , and I've > written a short blog post about it at > http://scala-blogs.org/2008/09/scalable-language-and-scalable.html . I > have used SMR to implement some research code for a machine learning > approach to author deduplication using variational inference. > > This summer, I'd like to help contribute to the Mahout project. I read > Tijs Zwinkels' proposal, and I think that what I would like to work on > is sufficiently different from what he would like to do. First, I > would like to implement Latent Dirichilet Allocation, a popular topic > mixture model that learns both document clusters and word clusters. I > would then like to extend it to implement a number of general purpose > topic models, including Topics over Time, Pachinko Allocation, and > possibly Supervised Topic Models. > > I see that there has been some discussion of LDA on this list, and I'd > like to weigh in on that discussion, if I may. HD-LDA (Newman et al., > 2008), based on Gibbs Sampling, seems to perform quite reasonably, > though AD-LDA is much easier to implement (and understand!) and > performs just as well. Gibbs Sampling, however, is not well-suited to > parallellization in general because Rao-Blackwellization (collapsing) > is needed to make it converge in a reasonable period, but by its > nature collapsing removes the independence assumptions that makes > sampling "embarrassingly parallel." For this and other reasons, I > personally would prefer to follow Wolfe, et al. (2008) and use > variational techniques to implement these models. > > In terms of implementation, I would love to implement these in SMR, if > you are open to adding a new language, because it offers a 10-fold > decrease in the number of lines of code for non-trivial algorithms > with almost no slowdown, and it has support for "staging" phases of > computation, which is especially useful for the kind of iterative > algorithms that are needed for machine learning. I would definitely be > willing to stay in Java-land, though. > > I recognize that this does not constitute a full proposal, but I > wanted to do a "soft" proposal before working out the details. I am > also open to expanding my proposal to include anything Bayesian, > semi-, partially, or un- supervised, and, of course, fun. > > As for my time, I have no constraints on my time this summer in > particular. I've decided not to pursue a "full" summer internship for > personal reasons. > > Thanks, > David Hall > > > Partial list of references: > > David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet > allocation, The Journal of Machine Learning Research, 3, p.993-1022, > 3/1/2003 > > Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2008). Distributed > Inference for Latent Dirichlet Allocation. NIPS. > > Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very > large datasets. ICML, 2008. >