Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-76741787
  
    @jkbradley. I was on vacation last two weeks. Really appreciate the 
detailed comments and I know how time consuming it can be.
    
    * About batch split. I used docId % batchNumber to split documents into 
`batchNumber` batches in the new commit. Will that work? I'm not sure I 
understand how stochastic gradient descent help in this case. 
    
    * local vs. distributed models: Indeed capacity of current implementation 
is limited by the local matrix (lambda: vocabSize * k < 2 ^31 - 1). Since 
online LDA don't need to hold the entire corpus, documents number is not a 
concern. In each `seqop` of the aggregate, matrix in calculation is bound to k 
* ids, where ids is the number of terms in each document. So the problem is how 
to resolve the limitation on lambda. My initial idea is to support local matrix 
for now and add support for distributed matrix in the future. I'll explore the 
upper limit for the current local matrix. (scale  estimation is 100000 (vocab) 
* 1000 (topics), no limit on documents number)
    
    I made some changes according to the last two points. Not sure about how to 
fit current version to the optimization steps. I thought the code is only for 
LDA and hard to be used in other context. Is there any example I can refer to? 
Thanks a lot.
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to