Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-87619172
  
    Thanks for the informative feedback. And I sincerely like it when you tell 
me what's recommended and what should be changed. 
    
    ##### 1. First thing is API. 
    
    One thing great about Online LDA is that it can avoid loading the entire 
corpus, since it only need to process one mini batch each time. Thus I kinda 
feel it's necessary to have an API that can support the usage.
    In current edition, user can write some code like
    ```
       // corpus does not need to be ready before this
        val onlineLDA = new OnlineLDAOptimizer(k, D, vocabSize)
        for(i <- 1 to batchNumber){
          val batch =  // ... convert dynamically or read libsvm directly
          onlineLDA.submitMiniBatch(batch)
        }
    ```
    I think this will be especially necessary and helpful for larger data set 
since doc2vec at large scale is resource intensive. And having a stream of mini 
`documents: RDD[(Long, Vector)]` rather than an integrated corpus will be a key 
factor that why OnlineLDA can handle larger dataset and be stream friendly.
    This is why I leave optimizer public. I'd like to know your opinions.
    
    ##### 2. Builder Pattern and Parameter parity
    
        Sure it's doable. Originally I named `OnlineLDAOptimizer` just as 
`OnlineLDA`, and then I thought we talked about optimizer framework, so I 
changed it. If we can lock down API, it will be pretty clear how to proceed 
with these details.
    
    ##### 3. About Scaling and correctness testing, can you please share a 
recommended dataset?
    
    Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to