[ 
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159924#comment-13159924
 ] 

jirapos...@reviews.apache.org commented on MAHOUT-897:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2944/
-----------------------------------------------------------

(Updated 2011-11-30 08:58:25.710229)


Review request for mahout and Ted Dunning.


Changes
-------

Adds appropriate ASL headers to new files, adds a bunch of nice javadocs, some 
TODOs to clean up some detritus, moves MemoryUtil to a more common location.

Could use some more review if anyone has an urge, but otherwise this code is 
ready to go.  More updates can come in the future.  And more docs, etc.


Summary
-------

See MAHOUT-897


This addresses bug MAHOUT-897.
    https://issues.apache.org/jira/browse/MAHOUT-897


Diffs (updated)
-----

  trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java 
1208294 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0DocInferenceMapper.java
 PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java 
PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0TopicTermVectorNormalizerMapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0Mapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0PerplexityMapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/InMemoryCollapsedVariationalBayes0.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/ModelTrainer.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/TopicModel.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/common/MemoryUtil.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/common/Pair.java 1208294 
  
trunk/core/src/main/java/org/apache/mahout/math/DistributedRowMatrixWriter.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/math/MatrixUtils.java PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java 
PRE-CREATION 
  
trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java 
1208294 
  trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java 
1208294 
  
trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb/TestCVBModelTrainer.java
 PRE-CREATION 
  trunk/core/src/test/java/org/apache/mahout/math/stats/SamplerTest.java 
PRE-CREATION 
  trunk/src/conf/driver.classes.props 1208294 

Diff: https://reviews.apache.org/r/2944/diff


Testing
-------

mvn clean test


Thanks,

Jake


                
> New implementation for LDA: Collapsed Variational Bayes (0th derivative 
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-897
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-897
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>              Labels: clustering, lda
>             Fix For: 0.6
>
>         Attachments: MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
>   1) it's based on the original Variational Bayes E/M training methods of 
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), 
> which are a) significantly more complex to implement/maintain, and b) 
> significantly slower than subsequently discovered techniques
>   2) the entire "current working model" is held in memory in each Mapper, 
> which limits the scalability of the implementation by numTerms in vocabulary 
> * numTopics * 8bytes per double being less than the mapper heap size.
>   3) the sufficient statistics which need to be emitted by the mappers scale 
> as numTopics * numNonZeroEntries in the corpus.  Even with judicious use of 
> Combiners (currently implemented), this can get prohibitively expensive in 
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would 
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be 
> using 2.5TB if disk+network traffic *per E/M iteration*.  Running a moderate 
> 40 iterations we're talking about 100TB.  Having tried this implementation on 
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so 
> memory wasn't an issue), I've seen this in practice: even with our production 
> Hadoop cluster with many thousands of map slots available, even one iteration 
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0 
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion, 
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ).  I tried many 
> approaches to get the overall distributed side of the algorithm to scale 
> better, originally aiming at removing point 2), but it turned out that point 
> 3) was what kept rearing its ugly head.  The way that YahooLDA ( 
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high 
> scalability is by doing distributed Gibbs sampling, but that requires that 
> you hold onto the model in distributed memory and query it continually via 
> RPC.  This could be done in something like Giraph or Spark, but not in 
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of 
> relying on Hadoop combiners to aggregate sufficient statistics for the model, 
> you instead do a full map-side cache of (this mapper's slice of) the next 
> iteration's model, and emit nothing in each map() call, emitting the entire 
> model at cleanup(), and then the reducer simply sums the sub-models.  This 
> effectively becomes a form of ensemble learning: each mapper learns its own 
> sequential model, emits it, the reducers (one for each topic) sum up these 
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R 
> iterations per hour on the same cluster/data set mentioned above (which makes 
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes 
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and 
> also comes with an "in-memory" version of the same algorithm, for smaller 
> datasets (i.e. those which can fit in memory).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to