Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-57674145
I did notice that the iterations took longer and longer, but wasn't sure if
that was expected or not.
I'm training the model on a dataset with 400k documents and 51m total
words, on a standalone cluster containing 3 slaves, each with 4 cores and 8g
memory (12 total executors). Within 10 iterations of RobustPLSA, the size of
the serialized tasks grows to several megabytes. If I switch to the PLSA model
without making any other changes to the driver program, the serialized task
size stays roughly constant (somewhere around 60kb) over the same number of
iterations. In both cases, I'm using the default regularizers, and have the
perplexity computation between iterations turned off.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]