[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

EntilZha Thu, 26 Feb 2015 22:01:04 -0800

GitHub user EntilZha opened a pull request:

    https://github.com/apache/spark/pull/4807


    [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA 
algorithms (EM+Gibbs)

    JIRA: https://issues.apache.org/jira/browse/SPARK-5556
    
    As discussed in that issue, it would be great to have multiple LDA 
algorithm options, principally EM (implemented already in #4047) and Gibbs.
    
    ## Goals of PR:
    1. Refactor LDA to allow multiple algorithm options (done)
    2. Refactor Gibbs code here to this interface (mostly done): 
https://github.com/EntilZha/spark/tree/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling
    3. Run the same performance tests run for the EM PR for comparison (todo, 
initial smaller tests have been run)
    
    At the moment, I am looking for feedback on the refactoring while working 
on putting the Gibbs code in.
    
    ## Summary of Changes:
    These traits were created with the purpose of encapsulating everything 
about implementation, while interfacing with the entry point ```LDA.run``` and 
```DistributedLDAModel```.
    ```scala
    private[clustering] trait LearningState {
        def next(): LearningState
        def topicsMatrix: Matrix
        def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
        def logLikelihood: Double
        def logPrior: Double
        def topicDistributions: RDD[(Long, Vector)]
        def globalTopicTotals: LDA.TopicCounts
        def k: Int
        def vocabSize: Int
        def docConcentration: Double
        def topicConcentration: Double
        def deleteAllCheckpoints(): Unit
      }
    
      trait LearningStateInitializer {
        def initialState(
          docs: RDD[(Long, Vector)],
          k: Int,
          docConcentration: Double,
          topicConcentration: Double,
          randomSeed: Long,
          checkpointInterval: Int): LearningState
      }
    ```
    
    The entirety of an LDA implementation can be captured by an object and 
class which extend these traits. Specifically, the 
```LearningStateInitializer``` provides the method for returning the 
```LearningState``` which maintains state.
    
    Lastly, the algorithm can be set via an enum which is pattern matched to 
create the correct thing. My thought is the default algorithm should be 
whichever performs better.
    
    ## Gibbs Implementation
    Old design doc is here:
    Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not 
table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
    Implements FastLDA from here: 
http://www.ics.uci.edu/~newman/pubs/fastlda.pdf
    
    ### Specific Points for Feedback
    1. Naming, its hard, and I'me not sure if the traits are named appropriately
    2. Similarly, I am reasonably familiar with the Scala type system, but 
perhaps there is some ninja tricks I don't know that would be helpful
    3. General interface/cleanliness
    4. Should the LearningStates/etc go within LDA, I think so, thoughts?
    5. Anything else, I'me also learning here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/EntilZha/spark LDA-pull-request

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4807
    
----
commit 4bd110385f36927990a980085d727f12fc90fd53
Author: Pedro Rodriguez <[email protected]>
Date:   2015-02-26T07:39:47Z

    updates to LDA and LDAModel to separate implementation from interface for 
LDA algorithms

commit 34d58532b278f95aaab9c250f4a2ed64ea959811
Author: Pedro Rodriguez <[email protected]>
Date:   2015-02-27T05:35:51Z

    refactored tests to make tests succeed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Reply via email to