Also, another idea: may algorithms that use sampling tend to do so multiple times. It may be beneficial to allow a transformation to a representation that is more efficient for multiple rounds of sampling.
On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling <[email protected]> wrote: > Xiangrui, > > I posted a note on my JIRA for MiniBatch KMeans about the same problem -- > sampling running in O(n). > > Can you elaborate on ways to get more efficient sampling? I think this > will be important for a variety of stochastic algorithms. > > RJ > > > On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng <[email protected]> wrote: > >> miniBatchFraction uses RDD.sample to get the mini-batch, and sample >> still needs to visit the elements one after another. So it is not >> efficient if the task is not computation heavy and this is why >> setMiniBatchFraction is marked as experimental. If we can detect that >> the partition iterator is backed by an ArrayBuffer, maybe we can do a >> skip iterator to skip elements. -Xiangrui >> >> On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander >> <[email protected]> wrote: >> > Hi, RJ >> > >> > >> https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala >> > >> > Unit tests are in the same branch. >> > >> > Alexander >> > >> > From: RJ Nowling [mailto:[email protected]] >> > Sent: Tuesday, August 26, 2014 6:59 PM >> > To: Ulanov, Alexander >> > Cc: [email protected] >> > Subject: Re: Gradient descent and runMiniBatchSGD >> > >> > Hi Alexander, >> > >> > Can you post a link to the code? >> > >> > RJ >> > >> > On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander < >> [email protected]<mailto:[email protected]>> wrote: >> > Hi, >> > >> > I've implemented back propagation algorithm using Gradient class and a >> simple update using Updater class. Then I run the algorithm with mllib's >> GradientDescent class. I have troubles in scaling out this implementation. >> I thought that if I partition my data into the number of workers then >> performance will increase, because each worker will run a step of gradient >> descent on its partition of data. But this does not happen and each worker >> seems to process all data (if miniBatchFraction == 1.0 as in mllib's >> logisic regression implementation). For me, this doesn't make sense, >> because then only single Worker will provide the same performance. Could >> someone elaborate on this and correct me if I am wrong. How can I scale out >> the algorithm with many Workers? >> > >> > Best regards, Alexander >> > >> > >> > >> > -- >> > em [email protected]<mailto:[email protected]> >> > c 954.496.2314 >> > > > > -- > em [email protected] > c 954.496.2314 > -- em [email protected] c 954.496.2314
