Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/4419#issuecomment-87619172 Thanks for the informative feedback. And I sincerely like it when you tell me what's recommended and what should be changed. ##### 1. First thing is API. One thing great about Online LDA is that it can avoid loading the entire corpus, since it only need to process one mini batch each time. Thus I kinda feel it's necessary to have an API that can support the usage. In current edition, user can write some code like ``` // corpus does not need to be ready before this val onlineLDA = new OnlineLDAOptimizer(k, D, vocabSize) for(i <- 1 to batchNumber){ val batch = // ... convert dynamically or read libsvm directly onlineLDA.submitMiniBatch(batch) } ``` I think this will be especially necessary and helpful for larger data set since doc2vec at large scale is resource intensive. And having a stream of mini `documents: RDD[(Long, Vector)]` rather than an integrated corpus will be a key factor that why OnlineLDA can handle larger dataset and be stream friendly. This is why I leave optimizer public. I'd like to know your opinions. ##### 2. Builder Pattern and Parameter parity Sure it's doable. Originally I named `OnlineLDAOptimizer` just as `OnlineLDA`, and then I thought we talked about optimizer framework, so I changed it. If we can lock down API, it will be pretty clear how to proceed with these details. ##### 3. About Scaling and correctness testing, can you please share a recommended dataset? Thanks a lot.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org