[jira] [Updated] (SPARK-5560) LDA EM should scale to more iterations

Joseph K. Bradley (JIRA) Mon, 22 Jun 2015 13:31:54 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joseph K. Bradley updated SPARK-5560:
-------------------------------------
    Remaining Estimate: 336h
     Original Estimate: 336h

> LDA EM should scale to more iterations
> --------------------------------------
>
>                 Key: SPARK-5560
>                 URL: https://issues.apache.org/jira/browse/SPARK-5560
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (LDA) sometimes fails to run for many iterations 
> on large datasets, even when it is able to run for a few iterations.  It 
> should be able to run for as many iterations as the user likes, with proper 
> persistence and checkpointing.
> Here is an example from a test on 16 workers (EC2 r3.2xlarge) on a big 
> Wikipedia dataset:
> * 100 topics
> * Training set size: 4072243 documents
> * Vocabulary size: 9869422 terms
> * Training set size: 1041734290 tokens
> It runs for about 10-15 iterations before failing, even when using a variety 
> of checkpointInterval values and longer timeout settings (up to 5 minutes).  
> The failure varies from disconnections from workers/driver to workers running 
> out of disk space.  I would not expect workers to run out of memory or disk 
> space based on rough calculations.  There was some job imbalance, but not a 
> significant amount.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-5560) LDA EM should scale to more iterations

Reply via email to