Xiangrui, I posted a note on my JIRA for MiniBatch KMeans about the same problem -- sampling running in O(n).
Can you elaborate on ways to get more efficient sampling? I think this will be important for a variety of stochastic algorithms. RJ On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng <men...@gmail.com> wrote: > miniBatchFraction uses RDD.sample to get the mini-batch, and sample > still needs to visit the elements one after another. So it is not > efficient if the task is not computation heavy and this is why > setMiniBatchFraction is marked as experimental. If we can detect that > the partition iterator is backed by an ArrayBuffer, maybe we can do a > skip iterator to skip elements. -Xiangrui > > On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander > <alexander.ula...@hp.com> wrote: > > Hi, RJ > > > > > https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala > > > > Unit tests are in the same branch. > > > > Alexander > > > > From: RJ Nowling [mailto:rnowl...@gmail.com] > > Sent: Tuesday, August 26, 2014 6:59 PM > > To: Ulanov, Alexander > > Cc: dev@spark.apache.org > > Subject: Re: Gradient descent and runMiniBatchSGD > > > > Hi Alexander, > > > > Can you post a link to the code? > > > > RJ > > > > On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander < > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: > > Hi, > > > > I've implemented back propagation algorithm using Gradient class and a > simple update using Updater class. Then I run the algorithm with mllib's > GradientDescent class. I have troubles in scaling out this implementation. > I thought that if I partition my data into the number of workers then > performance will increase, because each worker will run a step of gradient > descent on its partition of data. But this does not happen and each worker > seems to process all data (if miniBatchFraction == 1.0 as in mllib's > logisic regression implementation). For me, this doesn't make sense, > because then only single Worker will provide the same performance. Could > someone elaborate on this and correct me if I am wrong. How can I scale out > the algorithm with many Workers? > > > > Best regards, Alexander > > > > > > > > -- > > em rnowl...@gmail.com<mailto:rnowl...@gmail.com> > > c 954.496.2314 > -- em rnowl...@gmail.com c 954.496.2314