Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/166#issuecomment-40668271
  
    I rewrite the 2 versions of `GradientDescent` with `Vector` instead of 
`Array`. Lasso is easy to test now thanks for @mengxr 's refactoring of code.
    
    I run the test on a single node, in local mode. Note that original version 
runs 100 iterations, while the other two run 10 iterations with 10 local 
iterations.
    
    latest update:
    
    | Type        | Time           | Last 10 losses  |
    | ------------- |:-------------:| -----:|
    | original LR     | 346 | 0.6444 - 0.6430 |
    | 1-version LR     | 45      |  0.7082-0.6773  |
    | **2-version LR** | **37**     | **0.7070-0.6817**    |
    | original SVM     | 338 | 0.9468 - 0.9468 |
    | 1-version SVM | 46 |  0.7861 - 0.7803  |
    | **2-version SVM** | **34** | **0.7875 - 0.7829** |
    | original Lasso | 320 | 0.6063 - 0.6063 |
    | 1-version Lasso | 39  |  0.6131 - 0.2062 |
    | **2-version Lasso** | **32** | **0.6062 - 0.2104** |
    
    1-version is not good due to the reuse of `Iterator`, which inherently 
store all elements in a queue and will cause OOM if the data entry in a 
partition is large enough. 2-version is better, but due to the tiny-batch 
property, 2-version `GradientDescent`'s convergence ability is slightly lower 
than the 1-version. There is a trade-off between hardware efficiency and 
statistical efficiency.
    
    I port my code into an independent [git 
repo](https://github.com/yinxusen/gradient_descent_variants) so as to do 
experiments more easily, I'll move them back here recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to