[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

karlhigley Thu, 02 Oct 2014 11:06:06 -0700

Github user karlhigley commented on the pull request:

    https://github.com/apache/spark/pull/1269#issuecomment-57674145
  
    I did notice that the iterations took longer and longer, but wasn't sure if 
that was expected or not.
    
    I'm training the model on a dataset with 400k documents and 51m total 
words, on a standalone cluster containing 3 slaves, each with 4 cores and 8g 
memory (12 total executors).  Within 10 iterations of RobustPLSA, the size of 
the serialized tasks grows to several megabytes.  If I switch to the PLSA model 
without making any other changes to the driver program, the serialized task 
size stays roughly constant (somewhere around 60kb) over the same number of 
iterations.  In both cases, I'm using the default regularizers, and have the 
perplexity computation between iterations turned off.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

Reply via email to