[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

jkbradley Mon, 02 Feb 2015 15:14:09 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4047#issuecomment-72559262
  
    *Another test update*
    
    Large dataset: It's still unclear to me why I cannot run for more than 10 
or 15 iterations on the big Wikipedia dataset.  Checkpointing on S3 works.  
Persisting is working.  Checkpointing should be limiting the size of shuffle 
files.  Sometimes executors run out of space, and sometimes connections die 
(despite long time-out settings).  We'll need to do more testing to figure out 
how to scale this up more, but I think it's good enough for medium-size 
datasets.
    
    I'm doing some clean-ups and making a test suite for the checkpointer 
utility.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Reply via email to