It looks like SPARK-3250 was applied to the sample() which GradientDescent uses, and that should kick in for your minibatchFraction <= 0.4. Based on your numbers, aggregation seems like the main issue, though I hesitate to optimize aggregation based on local tests for data sizes that small.
The first thing I'd check for is unnecessary object creation, and to profile in a cluster or larger data setting. On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Sorry for bothering you again, but I think that it is an important issue > for applicability of SGD in Spark MLlib. Could Spark developers please > comment on it. > > -----Original Message----- > From: Ulanov, Alexander > Sent: Monday, March 30, 2015 5:00 PM > To: dev@spark.apache.org > Subject: Stochastic gradient descent performance > > Hi, > > It seems to me that there is an overhead in "runMiniBatchSGD" function of > MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" > might take time that is order of magnitude greater than the actual gradient > computation. In particular, for mnist dataset of 60K instances, minibatch > size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to > aggregate in local mode with 1 data partition on Core i5 processor. The > actual gradient computation takes 0.002 s. I searched through Spark Jira > and found that there was recently an update for more efficient sampling > (SPARK-3250) that is already included in Spark codebase. Is there a way to > reduce the sampling time and local treeRedeuce by order of magnitude? > > Best regards, Alexander > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >