[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

jkbradley Fri, 16 Jan 2015 22:27:56 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4047#issuecomment-70356709
  
    Here are some initial test results.  There are 2 sets since I had run some 
before the updates from @mengxr  and other tests after the updates.
    
    Summary: Iterations keep getting longer.  Will need to work on scalability, 
but it at least runs on medium-sized datasets.  Updates from @mengxr improve 
scalability.  Still need to test large numbers of topics.
    
    ## How tests were run
    
    I ran using this branch:
    [https://github.com/jkbradley/spark/tree/lda-testing].
    It includes a little more instrumentation and a Timing script: 
[https://github.com/jkbradley/spark/blob/lda-testing/examples/src/main/scala/org/apache/spark/examples/mllib/LDATiming.scala].
    
    I used the collection of stopwords from @dlwh here: 
[https://github.com/dlwh/spark/blob/feature/lda/examples/src/main/scala/org/apache/spark/examples/mllib/SimpleLatentDirichletAllocation.scala]
    
    I ran using a (partial) dump of Wikipedia) consisting of about 4.7GB of 
gzipped text files.
    
    My goal is to do a few sets of tests, scaling:
    * corpus sizes: 10K, 100K, 1M
    * k: 10, 100, 1K, 10K, 100K
    * vocabSize: (not run yet)
    
    I ran:
    ```
    FOR SCALING CORPUS SIZE:
    
    bin/spark-submit --class org.apache.spark.examples.mllib.LDATiming --master 
spark://MY_EC2_URL:7077 --driver-memory 20g 
/root/spark-git/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 --corpusSizes "10000 100000 1000000 -1" --ks "100" --maxIterations 10 
--topicSmoothing -1 --termSmoothing -1 --vocabSizes "1000000" --stopwordFile 
"stopwords.txt" "DATADIR"
    
    SCALING K:
    
    bin/spark-submit --class org.apache.spark.examples.mllib.LDATiming --master 
spark://MY_EC2_URL:7077 --driver-memory 20g 
/root/spark-git/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 --corpusSizes "-1" --ks "10 100 1000 10000 100000" --maxIterations 10 
--topicSmoothing -1 --termSmoothing -1 --vocabSizes "1000000" --stopwordFile 
"stopwords.txt" "DATADIR"
    ```
    
    These used a 16-node EC2 cluster of r3.2xlarge machines.
    
    ## Results (before recent GC-related updates)
    
    Take-aways: Iterations keep getting longer.  Will need to work on 
scalability, but it at least runs on medium-sized datasets.
    
    ### Scaling corpus size
    
    ```
    DATASET
         Training set size: 9999 documents
         Vocabulary size: 183313 terms
         Training set size: 2537186 tokens
         Preprocessing time: 53.03116398 sec
    Finished training LDA model.  Summary:
         Training time: 51.766336365 sec
         Training data average log likelihood: -2374.150807757336
         Training times per iteration (sec):
        16.301772584
        2.714758941
        2.681336067
        2.812407396
        3.067381155
        3.148446287
        4.091387595
        4.845391099
        4.800537527
        5.784521297
    ```
    Note that iteration times keep getting longer.
    
    ```
    DATASET
         Training set size: 99657 documents
         Vocabulary size: 864755 terms
         Training set size: 25372240 tokens
         Preprocessing time: 56.172117053 sec
    Finished training LDA model.  Summary:
         Training time: 272.724740335 sec
         Training data average log likelihood: -2453.2238815201995
         Training times per iteration (sec):
        36.27088487
        9.239099504
        12.899834887
        17.887326081
        22.548736594
        29.705019399
        34.532178918
        37.132915562
        43.264158967
        24.167606732
    ```
    
    ```
    DATASET
         Training set size: 998875 documents
         Vocabulary size: 3718603 terms
         Training set size: 255269137 tokens
         Preprocessing time: 969.582218325 sec
    (died)
    ```
    
    ### Scaling k
    
    ```
    DATASET
         Training set size: 4072243 documents
         Vocabulary size: 1000000 terms
         Training set size: 955849462 tokens
         Preprocessing time: 1023.173703836 sec
    Finished training LDA model.  Summary:
         Training time: 734.18870584 sec
         Training data average log likelihood: -2487.378006538547
         Training times per iteration (sec):
        220.962623351
        43.31892217
        44.65509746
        49.119503552
        52.24947807
        53.822309875
        57.582740118
        64.41201
        70.151547256
        72.043746927
    ```
    
    (larger tests died)
    
    ## Results (after recent GC-related updates)
    
    Main take-away: Updates from @mengxr improve scaling.  (Notice the not 
dying on the later tests.)
    
    ### Scaling corpus size
    
    ```
    DATASET
         Training set size: 9999 documents
         Vocabulary size: 549136 terms
         Training set size: 2651999 tokens
         Preprocessing time: 59.886667054 sec
    Finished training LDA model.  Summary:
         Training time: 87.066636441 sec
         Training data average log likelihood: -2987.1021154536284
         Training times per iteration (sec):
        25.916230397
        3.348482537
        3.133761688
        4.325156952
        5.231940702
        6.117500071
        7.246081989
        8.584900244
        9.413571911
        9.114112645
    ```
    
    ```
    DATASET
         Training set size: 99657 documents
         Vocabulary size: 1000000 terms
         Training set size: 24106059 tokens
         Preprocessing time: 79.624635494 sec
    Finished training LDA model.  Summary:
         Training time: 295.883936257 sec
         Training data average log likelihood: -2608.5841219446515
         Training times per iteration (sec):
        41.455679987
        11.062455643
        15.526668004
        21.027575262
        26.262190857
        25.565775147
        30.829831734
        35.716967684
        37.592023917
        44.04621023
    ```
    
    ```
    
    DATASET
         Training set size: 998875 documents
         Vocabulary size: 1000000 terms
         Training set size: 235682866 tokens
         Preprocessing time: 322.008531951 sec
    Finished training LDA model.  Summary:
         Training time: 1073.726914484 sec
         Training data average log likelihood: -2496.418600245705
         Training times per iteration (sec):
        119.644333858
        41.555120562
        52.719948261
        64.48673763
        88.892069695
        100.981587858
        123.62990158
        150.65753992
        157.688974275
        168.515556567
    ```
    
    ```
    DATASET
         Training set size: 4072243 documents
         Vocabulary size: 1000000 terms
         Training set size: 955849462 tokens
         Preprocessing time: 1110.123689033 sec
    Finished training LDA model.  Summary:
         Training time: 4781.682695595 sec
         Training data average log likelihood: -2483.61085533687
         Training times per iteration (sec):
        363.747503418
        234.396490798
        264.977904783
        377.257946593
        447.054876375
        364.207562754
        408.152587705
        420.5513901
        1080.746177241
        813.866786165
    ```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Reply via email to