[
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819964#comment-15819964
]
Weichen Xu edited comment on SPARK-10078 at 1/12/17 4:59 AM:
-------------------------------------------------------------
[~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic
distributed vector interface (move into breeze ?) and make them support
multiple distributed platform(not only spark) will make the optimization
against spark platform difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many
optimizations need to combine spark features and the optimizer algorithm
closely, make a abstract interface supporting distributed vector (for example,
Vector space operator include dot, add, scale, persist/unpersist operators and
so on...) seems not enough.
I give two simple problem to show the complexity when considering general
interface:
1. Look this VF-OWLQN implementation based on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg,
look the code function `calculateComponentWithL1`, here when computing
pseudo-gradient using RDD, it also use an accumulator(only spark have) to
calculate the adjusted fnValue, so that will the abstract interface containing
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy
computation feature, improper persist/unpersist/checkpoint order may cause
serious problem (may cause RDD recomputation, checkpoint take no effect and so
on), about this complexity, we can take a look into the VF-BFGS implementation
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step
RDDs" like many other algos in spark mllib. The complexity is at, spark always
do lazy computation, when you persist RDD, it do not persist immediately, but
postponed to RDD.action called. If the internal code call `unpersist` too
early, it will cause the problem that an RDD haven't been computed and haven't
been actually persisted(although the persist API called), but already been
unpersisted, so that such awful situation will cause the whole RDD lineage
recomputation.
This feature may be much different than other distributed platform, so that a
general interface can really handle this problem correctly and still keep high
efficient in the same time?
As the detail problems I list above(I only list a small part problems), in my
opinion, breeze can provide the following base class and/or abstract interface:
* FirstOrderMinimizer
* DiffFunction interface
* LineSearch implementation (including StrongWolfeLinsearch and
BacktrackingLinesearch)
* DistributedVector abstract interface
*BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be
implemented in spark mllib, for better optimization.
was (Author: weichenxu123):
[~debasish83] Considering make VF-LBFGS/VF-OWLQN supporting generic
distributed vector interface (move into breeze ?) and make them support
multiple distributed platform(not only spark) will make the optimization
against spark platform difficult I think,
because when we implement VF-LBFGS/VF-OWLQN base on spark, we found that many
optimizations need to combine spark features and the optimizer algorithm
closely, make a abstract interface supporting distributed vector (for example,
Vector space operator include dot, add, scale, persist/unpersist operators and
so on...) seems not enough.
I give two simple problem to show the complexity when considering general
interface:
1. Look this VF-OWLQN implementation based on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeOWLQN.scala
We know that OWLQN internal will help compute the pseudo-gradient for L1 reg,
look the code function `calculateComponentWithL1`, here when computing
pseudo-gradient using RDD, it also use an accumulator(only spark have) to
calculate the adjusted fnValue, so that will the abstract interface containing
something about `accumulator` in spark ?
2. About persist, unpersist, checkpoint problem in spark. Because of spark lazy
computation feature, improper persist/unpersist/checkpoint order may cause
serious problem (may cause RDD recomputation, checkpoint take no effect and so
on), about this complexity, we can take a look into the VF-BFGS implementation
on spark:
https://github.com/yanboliang/spark-vlbfgs/blob/master/src/main/scala/org/apache/spark/ml/optim/VectorFreeLBFGS.scala
it use the pattern "persist current step RDDs, then unpersist previous step
RDDs" like many other algos in spark mllib. The complexity is at, spark always
do lazy computation, when you persist RDD, it do not persist immediately, but
postponed to RDD.action called. If the internal code call `unpersist` too
early, it will cause the problem that an RDD haven't been computed and haven't
been actually persisted(although the persist API called), but already been
unpersisted, so that such awful situation will cause the whole RDD lineage
recomputation.
This feature may be much different than other distributed platform, so that a
general interface can really handle this problem correctly and still keep high
efficient in the same time?
As the detail problems I list above(I only list a small part problems), in my
opinion, breeze can provide the following base class and/or abstract interface:
* FirstOrderMinimizer
* DiffFunction interface
* LineSearch implementation (including StrongWolfeLinsearch and
BacktrackingLinesearch)
* DistributedVector abstract interface
*BUT*, the core logic of VF-LBFGS and VF-OWLQN (based on VF-LBFGS) should be
implemented in spark mllib, for better optimization.
> Vector-free L-BFGS
> ------------------
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]