[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

avulanov Fri, 05 Dec 2014 17:28:12 -0800

Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1379#issuecomment-65879536
  
    @dbtsai Here are the results of my tests:
    - Settings:
         - Spark: latest Spark merged with 
https://github.com/dbtsai/spark/tree/dbtsai-mlor (manual merge) and 
https://github.com/avulanov/spark/tree/annclassifier. Optimizer in MLOR was 
changed to LBFGS to make a correct comparison with ANN which uses LBFGS.
         - Hadoop 1.2.1, dataset is loaded from hdfs
         - Cluster: 6 machines Xeon 3.3GHz, 16GB RAM, each machine has 2 Spark 
Workers with maximum 8GB or RAM and 2GB used, total 16 workers
         - Dataset: mnist8m; classes: 10;  data: 8,100,000 instances; features: 
784; random split 99% train, 1% test
         - Link to the dataset: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2
         - Learning settings: 40 iterations, tolerance=1e-4 (both); ANN 
classifier: hidden layer `Array[Int]()` (no hidden layer - the same as 
regression)
    - Result
         - ANN classifier: training time: 00:47:55; accuracy: 0.848
         - MLOR: training time: 01:30:45; accuracy: 0.864
    - Average gradient compute time (`mapPartitionsWithIndex at 
RDDFunctions.scala:108`)
         - ANN classifier: 51 seconds
         - MLOR: 2.1 minutes
    - Average update time (`reduce at RDDFunctions.scala:112`)
         - ANN classifier: 90 ms
         - MLOR: 90 ms
    
    It seems that ANN is almost 2x faster (with the mentioned settings), though 
accuracy is 1.6% smaller. The difference in accuracy can be explained by the 
fact that ANN uses (half) squared error cost function instead of cross entropy 
and no softmax. They are supposed to be better for classification.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

Reply via email to