Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-87141115
  
    @hhbyyh  Thanks for the updates.
    
    **Builder pattern**: I recommend having OnlineLDA be a class (not an 
object) and using a builder pattern to specify parameters:
    ```
      val lda = new OnlineLDA().setK(50)
      val model = lda.run(myDataset)
    ```
    
    **Sampling**: It can be useful to make multiple passes over the dataset.  
Instead of taking ```batchNumber```, could this take ```numIterations``` and 
```miniBatchFraction``` (as in GradientDescent)?
    
    **Optimizer**: I like that you separated out the optimizer, but I think we 
should keep it private for now.
    
    **Work remaining**: I hope we can get this ready before long, but there are 
some major items remaining.  Some are obvious, but I wanted to list them here:
    * Parameter parity: It will be good to support the same parameters which 
LDA currently supports.
    * API: The main API questions are:
      * Should OnlineLDA be just another optimizer in the current LDA class?  
(I like this, though the ```miniBatchFraction``` parameter would only be used 
for online learning, not batch EM.)
      * If so, then how do we need to adjust the LDAModel API?
    * Testing:
      * Unit tests
      * Scaling tests on a cluster with a large dataset
      * Correctness: It would be great to verify correctness vs. another 
implementation.  This would require both implementations running without any 
sampling to make the result deterministic.
    * Coding style: We'll need to do some significant cleanups to fit the Spark 
coding style guidelines.
    
    I'll think more about the API issues and update you soon.  Let me know if 
you have opinions about this too.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to