Note that we can do this in DataFrames and use Catalyst to push Sample down beneath Projection :)
On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng <men...@gmail.com> wrote: > The gap sampling is triggered when the sampling probability is small > and the directly underlying storage has constant time lookups, in > particular, ArrayBuffer. This is a very strict requirement. If rdd is > cached in memory, we use ArrayBuffer to store its elements and > rdd.sample will trigger gap sampling. However, if we call rdd2 = > rdd.map(x => x), we can no longer tell whether the storage is backed > by an ArrayBuffer and hence gaps sampling is not enabled. We should > use Scala's drop(k) and let Scala decides whether this is an O(1) > operation or an O(k) operation. But unfortunately, due to a Scala bug, > this could become an O(k^2) operation. So we didn't use this approach. > Please check the comments in the gap sampling PR. > > For SGD, I think we should either assume the input data is randomized > or randomize the input data (and eat this one-time cost), then do > min-batch sequentially. The key is the balance the batch size and the > communication cost of model update. > > Best, > Xiangrui > > On Mon, Apr 6, 2015 at 10:38 AM, Ulanov, Alexander > <alexander.ula...@hp.com> wrote: > > Batch size impacts convergence, so bigger batch means more iterations. > There are some approaches to deal with it (such as > http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be > implemented and tested. > > > > Nonetheless, could you share your thoughts regarding reducing this > overhead in Spark (or probably a workaround)? Sorry for repeating it, but I > think this is crucial for MLlib in Spark, because Spark is intended for > bigger amounts of data. Machine learning with bigger data usually requires > SGD (vs batch GD), SGD requires a lot of updates, and “Spark overhead” > times “many updates” equals impractical time needed for learning. > > > > > > From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] > > Sent: Sunday, April 05, 2015 7:13 PM > > To: Ulanov, Alexander > > Cc: shiva...@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org > > Subject: Re: Stochastic gradient descent performance > > > > Yeah, a simple way to estimate the time for an iterative algorithms is > number of iterations required * time per iteration. The time per iteration > will depend on the batch size, computation required and the fixed overheads > I mentioned before. The number of iterations of course depends on the > convergence rate for the problem being solved. > > > > Thanks > > Shivaram > > > > On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander < > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: > > Hi Shivaram, > > > > It sounds really interesting! With this time we can estimate if it worth > considering to run an iterative algorithm on Spark. For example, for SGD on > Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all > data by one example not considering the data loading, computation and > update times. One may need to traverse all data a number of times to > converge. Let’s say this number is equal to the batch size. So, we remain > with 62.5 hours overhead. Is it reasonable? > > > > Best regards, Alexander > > > > From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu<mailto: > shiva...@eecs.berkeley.edu>] > > Sent: Thursday, April 02, 2015 1:26 PM > > To: Joseph Bradley > > Cc: Ulanov, Alexander; dev@spark.apache.org<mailto:dev@spark.apache.org> > > Subject: Re: Stochastic gradient descent performance > > > > I haven't looked closely at the sampling issues, but regarding the > aggregation latency, there are fixed overheads (in local and distributed > mode) with the way aggregation is done in Spark. Launching a stage of > tasks, fetching outputs from the previous stage etc. all have overhead, so > I would say its not efficient / recommended to run stages where computation > is less than 500ms or so. You could increase your batch size based on this > and hopefully that will help. > > > > Regarding reducing these overheads by an order of magnitude it is a > challenging problem given the architecture in Spark -- I have some ideas > for this, but they are very much at a research stage. > > > > Thanks > > Shivaram > > > > On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jos...@databricks.com > <mailto:jos...@databricks.com>> wrote: > > When you say "It seems that instead of sample it is better to shuffle > data > > and then access it sequentially by mini-batches," are you sure that holds > > true for a big dataset in a cluster? As far as implementing it, I > haven't > > looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, > > but that looks like it could be modified to be deterministic. > > > > Hopefully someone else can comment on aggregation in local mode. I'm not > > sure how much effort has gone into optimizing for local mode. > > > > Joseph > > > > On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander < > alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> > > wrote: > > > >> Hi Joseph, > >> > >> > >> > >> Thank you for suggestion! > >> > >> It seems that instead of sample it is better to shuffle data and then > >> access it sequentially by mini-batches. Could you suggest how to > implement > >> it? > >> > >> > >> > >> With regards to aggregate (reduce), I am wondering why it works so slow > in > >> local mode? Could you elaborate on this? I do understand that in cluster > >> mode the network speed will kick in and then one can blame it. > >> > >> > >> > >> Best regards, Alexander > >> > >> > >> > >> *From:* Joseph Bradley [mailto:jos...@databricks.com<mailto: > jos...@databricks.com>] > >> *Sent:* Thursday, April 02, 2015 10:51 AM > >> *To:* Ulanov, Alexander > >> *Cc:* dev@spark.apache.org<mailto:dev@spark.apache.org> > >> *Subject:* Re: Stochastic gradient descent performance > >> > >> > >> > >> It looks like SPARK-3250 was applied to the sample() which > GradientDescent > >> uses, and that should kick in for your minibatchFraction <= 0.4. Based > on > >> your numbers, aggregation seems like the main issue, though I hesitate > to > >> optimize aggregation based on local tests for data sizes that small. > >> > >> > >> > >> The first thing I'd check for is unnecessary object creation, and to > >> profile in a cluster or larger data setting. > >> > >> > >> > >> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander < > >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: > >> > >> Sorry for bothering you again, but I think that it is an important issue > >> for applicability of SGD in Spark MLlib. Could Spark developers please > >> comment on it. > >> > >> > >> -----Original Message----- > >> From: Ulanov, Alexander > >> Sent: Monday, March 30, 2015 5:00 PM > >> To: dev@spark.apache.org<mailto:dev@spark.apache.org> > >> Subject: Stochastic gradient descent performance > >> > >> Hi, > >> > >> It seems to me that there is an overhead in "runMiniBatchSGD" function > of > >> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" > >> might take time that is order of magnitude greater than the actual > gradient > >> computation. In particular, for mnist dataset of 60K instances, > minibatch > >> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to > >> aggregate in local mode with 1 data partition on Core i5 processor. The > >> actual gradient computation takes 0.002 s. I searched through Spark Jira > >> and found that there was recently an update for more efficient sampling > >> (SPARK-3250) that is already included in Spark codebase. Is there a way > to > >> reduce the sampling time and local treeRedeuce by order of magnitude? > >> > >> Best regards, Alexander > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: > dev-unsubscr...@spark.apache.org> > >> For additional commands, e-mail: dev-h...@spark.apache.org<mailto: > dev-h...@spark.apache.org> > >> > >> > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >