[
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226958#comment-14226958
]
Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:09 PM:
----------------------------------------------------------------
[~rezazadeh] Thanks for your feedback.
Your point about the communication cost of backtracking is well taken. Just to
explain where I was coming from with the design proposal: As I was looking to
come up to speed on accelerated gradient descent, I came across some scattered
comments online suggesting that acceleration was difficult to implement well,
was finicky, etc - especially when compared with standard gradient descent for
machine learning. So, wary of these sorts of comments, I wrote up the proposal
with the intention of duplicating TFOCS as closely as possible to start with,
with the possibility of making changes from there based on performance. In
addition, I’d interpreted an earlier comment in this ticket as suggesting that
line search be implemented in the same manner as in TFOCS.
I am happy to implement a constant step size first, though. It may also be
informative to run some performance tests in spark both with and without
backtracking. One (basic, not conclusive) data point I have now is that, if I
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop
and 106 iterations of the inner backtracking loop. For this one example, the
backtracking iteration overhead is only about 10%. Though keep in mind that in
spark if we removed backtracking entirely it would mean only one distributed
aggregation per iteration rather than two - so a huge improvement in
communication cost assuming there is still a good convergence rate.
Incidentally, are there any specific learning benchmarks for spark that you
would recommend?
I’ll do a bit of research to identify the best ways to manage the lipschitz
estimate / step size in the absence of backtracking (for our objective
functions in particular). I’ve also noticed some references online to
distributed implementations of accelerated methods. It may be informative to
learn more about them - if you’ve heard of any particularly good distributed
optimizers using acceleration, please let me know.
Thanks,
Aaron
PS Yes, I’ll make sure to follow the lbfgs example so the accelerated
implementation can be easily substituted.
was (Author: staple):
[~rezazadeh] Thanks for your feedback.
Your point about the communication cost of backtracking is well taken. Just to
explain where I was coming from with the design proposal: As I was looking to
come up to speed on accelerated gradient descent, I came across some scattered
comments online suggesting that acceleration was difficult to implement well,
was finicky, etc - especially when compared with standard gradient descent for
machine learning. So, wary of these sorts of comments, I wrote up the proposal
with the intention of duplicating TFOCS as closely as possible to start with,
with the possibility of making changes from there based on performance. In
addition, I’d interpreted an earlier comment in this ticket as suggesting that
line search be implemented in the same manner as in TFOCS.
I am happy to implement a constant step size first, though. It may also be
informative to run some performance tests in spark both with and without
backtracking. One (basic, not conclusive) data point I have now is that, if I
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop
and 106 iterations of the inner backtracking loop. For this one example, the
backtracking iteration overhead is only about 10%. Though keep in mind that in
spark if we removed backtracking entirely it would mean only one distributed
aggregation per iteration rather than two - so a huge improvement in
communication cost assuming there is still a good convergence rate.
Incidentally, are there any specific learning benchmarks for spark that you
would recommend?
I’ll do a bit of research to identify the best ways to manage the lipschitz
estimate / step size in the absence of backtracking. I’ve also noticed some
references online to distributed implementations of accelerated methods. It may
be informative to learn more about them - if you’ve heard of any particularly
good distributed optimizers using acceleration, please let me know.
Thanks,
Aaron
PS Yes, I’ll make sure to follow the lbfgs example so the accelerated
implementation can be easily substituted.
> Implement Nesterov's accelerated first-order method
> ---------------------------------------------------
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xiangrui Meng
> Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for
> steepest descent but it converges much faster. We should implement this
> method and compare its performance with existing algorithms, including SGD
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's
> method and its variants on composite objectives.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]