[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

Ahmed Mahran (JIRA) Sat, 13 Aug 2016 05:00:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmed Mahran updated SPARK-14880:
---------------------------------
    External issue URL: 
https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
  (was: 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala)

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---------------------------------------------------------------
>
>                 Key: SPARK-14880
>                 URL: https://issues.apache.org/jira/browse/SPARK-14880
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Ahmed Mahran
>              Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>    iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
>     iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

Reply via email to