[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

jkbradley Mon, 02 Mar 2015 12:58:15 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-76820786
  
    Thanks for the updates!  Responding:
    
    > About batch split. I used docId % batchNumber to split documents into 
batchNumber batches in the new commit. Will that work? I'm not sure I 
understand how stochastic gradient descent help in this case.
    
    That should help distribute the work; it will be good to see numbers about 
whether subsampling speeds things up enough.  (I mentioned SGD because you 
could take a random sample on each iteration, rather than a deterministic 
sample.  You wouldn't be able to use the other SGD code in MLlib, but a random 
sample would effectively be doing mini-batch SGD.  That might be a bit better 
since stochasticity is usually helpful in these non-convex problems.)
    
    > My initial idea is to support local matrix for now and add support for 
distributed matrix in the future.
    
    That sounds good.  I don't think you need to implement a distributed 
version in this PR, but it will be good to think about to make sure we can 
later generalize to a distributed version without breaking APIs.
    
    > Not sure about how to fit current version to the optimization steps. I 
thought the code is only for LDA and hard to be used in other context. Is there 
any example I can refer to?
    
    There's a nice explanation in Section 2.3 of the original paper: [Online 
Learning for Latent Dirichlet 
Allocation](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf).
  I haven't thought carefully about whether this affects computation, but I 
think it'd be doable.  Don't bother, though, if it makes the code harder to 
understand; I mainly hope it will make the code easier to understand.
    
    I'll try to make another close pass soon!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

Reply via email to