[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:14 PM:
----------------------------------------------------------------

[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you happen to have heard of any particularly good 
distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

> Implement Nesterov's accelerated first-order method
> ---------------------------------------------------
>
>                 Key: SPARK-1503
>                 URL: https://issues.apache.org/jira/browse/SPARK-1503
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to