It looks like SPARK-3250 was applied to the sample() which GradientDescent
uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
your numbers, aggregation seems like the main issue, though I hesitate to
optimize aggregation based on local tests for data sizes that small.

The first thing I'd check for is unnecessary object creation, and to
profile in a cluster or larger data setting.

On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: dev@spark.apache.org
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to