Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70356709
Here are some initial test results. There are 2 sets since I had run some
before the updates from @mengxr and other tests after the updates.
Summary: Iterations keep getting longer. Will need to work on scalability,
but it at least runs on medium-sized datasets. Updates from @mengxr improve
scalability. Still need to test large numbers of topics.
## How tests were run
I ran using this branch:
[https://github.com/jkbradley/spark/tree/lda-testing].
It includes a little more instrumentation and a Timing script:
[https://github.com/jkbradley/spark/blob/lda-testing/examples/src/main/scala/org/apache/spark/examples/mllib/LDATiming.scala].
I used the collection of stopwords from @dlwh here:
[https://github.com/dlwh/spark/blob/feature/lda/examples/src/main/scala/org/apache/spark/examples/mllib/SimpleLatentDirichletAllocation.scala]
I ran using a (partial) dump of Wikipedia) consisting of about 4.7GB of
gzipped text files.
My goal is to do a few sets of tests, scaling:
* corpus sizes: 10K, 100K, 1M
* k: 10, 100, 1K, 10K, 100K
* vocabSize: (not run yet)
I ran:
```
FOR SCALING CORPUS SIZE:
bin/spark-submit --class org.apache.spark.examples.mllib.LDATiming --master
spark://MY_EC2_URL:7077 --driver-memory 20g
/root/spark-git/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
--corpusSizes "10000 100000 1000000 -1" --ks "100" --maxIterations 10
--topicSmoothing -1 --termSmoothing -1 --vocabSizes "1000000" --stopwordFile
"stopwords.txt" "DATADIR"
SCALING K:
bin/spark-submit --class org.apache.spark.examples.mllib.LDATiming --master
spark://MY_EC2_URL:7077 --driver-memory 20g
/root/spark-git/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop1.0.4.jar
--corpusSizes "-1" --ks "10 100 1000 10000 100000" --maxIterations 10
--topicSmoothing -1 --termSmoothing -1 --vocabSizes "1000000" --stopwordFile
"stopwords.txt" "DATADIR"
```
These used a 16-node EC2 cluster of r3.2xlarge machines.
## Results (before recent GC-related updates)
Take-aways: Iterations keep getting longer. Will need to work on
scalability, but it at least runs on medium-sized datasets.
### Scaling corpus size
```
DATASET
Training set size: 9999 documents
Vocabulary size: 183313 terms
Training set size: 2537186 tokens
Preprocessing time: 53.03116398 sec
Finished training LDA model. Summary:
Training time: 51.766336365 sec
Training data average log likelihood: -2374.150807757336
Training times per iteration (sec):
16.301772584
2.714758941
2.681336067
2.812407396
3.067381155
3.148446287
4.091387595
4.845391099
4.800537527
5.784521297
```
Note that iteration times keep getting longer.
```
DATASET
Training set size: 99657 documents
Vocabulary size: 864755 terms
Training set size: 25372240 tokens
Preprocessing time: 56.172117053 sec
Finished training LDA model. Summary:
Training time: 272.724740335 sec
Training data average log likelihood: -2453.2238815201995
Training times per iteration (sec):
36.27088487
9.239099504
12.899834887
17.887326081
22.548736594
29.705019399
34.532178918
37.132915562
43.264158967
24.167606732
```
```
DATASET
Training set size: 998875 documents
Vocabulary size: 3718603 terms
Training set size: 255269137 tokens
Preprocessing time: 969.582218325 sec
(died)
```
### Scaling k
```
DATASET
Training set size: 4072243 documents
Vocabulary size: 1000000 terms
Training set size: 955849462 tokens
Preprocessing time: 1023.173703836 sec
Finished training LDA model. Summary:
Training time: 734.18870584 sec
Training data average log likelihood: -2487.378006538547
Training times per iteration (sec):
220.962623351
43.31892217
44.65509746
49.119503552
52.24947807
53.822309875
57.582740118
64.41201
70.151547256
72.043746927
```
(larger tests died)
## Results (after recent GC-related updates)
Main take-away: Updates from @mengxr improve scaling. (Notice the not
dying on the later tests.)
### Scaling corpus size
```
DATASET
Training set size: 9999 documents
Vocabulary size: 549136 terms
Training set size: 2651999 tokens
Preprocessing time: 59.886667054 sec
Finished training LDA model. Summary:
Training time: 87.066636441 sec
Training data average log likelihood: -2987.1021154536284
Training times per iteration (sec):
25.916230397
3.348482537
3.133761688
4.325156952
5.231940702
6.117500071
7.246081989
8.584900244
9.413571911
9.114112645
```
```
DATASET
Training set size: 99657 documents
Vocabulary size: 1000000 terms
Training set size: 24106059 tokens
Preprocessing time: 79.624635494 sec
Finished training LDA model. Summary:
Training time: 295.883936257 sec
Training data average log likelihood: -2608.5841219446515
Training times per iteration (sec):
41.455679987
11.062455643
15.526668004
21.027575262
26.262190857
25.565775147
30.829831734
35.716967684
37.592023917
44.04621023
```
```
DATASET
Training set size: 998875 documents
Vocabulary size: 1000000 terms
Training set size: 235682866 tokens
Preprocessing time: 322.008531951 sec
Finished training LDA model. Summary:
Training time: 1073.726914484 sec
Training data average log likelihood: -2496.418600245705
Training times per iteration (sec):
119.644333858
41.555120562
52.719948261
64.48673763
88.892069695
100.981587858
123.62990158
150.65753992
157.688974275
168.515556567
```
```
DATASET
Training set size: 4072243 documents
Vocabulary size: 1000000 terms
Training set size: 955849462 tokens
Preprocessing time: 1110.123689033 sec
Finished training LDA model. Summary:
Training time: 4781.682695595 sec
Training data average log likelihood: -2483.61085533687
Training times per iteration (sec):
363.747503418
234.396490798
264.977904783
377.257946593
447.054876375
364.207562754
408.152587705
420.5513901
1080.746177241
813.866786165
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]