RE: Stochastic gradient descent performance

Ulanov, Alexander Mon, 06 Apr 2015 10:41:00 -0700

Batch size impacts convergence, so bigger batch means more iterations. There 
are some approaches to deal with it (such as 
http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be 
implemented and tested.

Nonetheless, could you share your thoughts regarding reducing this overhead in 
Spark (or probably a workaround)? Sorry for repeating it, but I think this is 
crucial for MLlib in Spark, because Spark is intended for bigger amounts of 
data. Machine learning with bigger data usually requires SGD (vs batch GD), SGD 
requires a lot of updates, and “Spark overhead” times “many updates” equals 
impractical time needed for learning.

From: Shivaram Venkataraman [mailto:[email protected]]
Sent: Sunday, April 05, 2015 7:13 PM
To: Ulanov, Alexander
Cc: [email protected]; Joseph Bradley; [email protected]
Subject: Re: Stochastic gradient descent performance

Yeah, a simple way to estimate the time for an iterative algorithms is number 
of iterations required * time per iteration. The time per iteration will depend 
on the batch size, computation required and the fixed overheads I mentioned 
before. The number of iterations of course depends on the convergence rate for 
the problem being solved.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander 
<[email protected]<mailto:[email protected]>> wrote:
Hi Shivaram,

It sounds really interesting! With this time we can estimate if it worth 
considering to run an iterative algorithm on Spark. For example, for SGD on 
Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data 
by one example not considering the data loading, computation and update times. 
One may need to traverse all data a number of times to converge. Let’s say this 
number is equal to the batch size. So, we remain with 62.5 hours overhead. Is 
it reasonable?

Best regards, Alexander

From: Shivaram Venkataraman 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, April 02, 2015 1:26 PM
To: Joseph Bradley
Cc: Ulanov, Alexander; [email protected]<mailto:[email protected]>
Subject: Re: Stochastic gradient descent performance

I haven't looked closely at the sampling issues, but regarding the aggregation 
latency, there are fixed overheads (in local and distributed mode) with the way 
aggregation is done in Spark. Launching a stage of tasks, fetching outputs from 
the previous stage etc. all have overhead, so I would say its not efficient / 
recommended to run stages where computation is less than 500ms or so. You could 
increase your batch size based on this and hopefully that will help.

Regarding reducing these overheads by an order of magnitude it is a challenging 
problem given the architecture in Spark -- I have some ideas for this, but they 
are very much at a research stage.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley 
<[email protected]<mailto:[email protected]>> wrote:
When you say "It seems that instead of sample it is better to shuffle data
and then access it sequentially by mini-batches," are you sure that holds
true for a big dataset in a cluster?  As far as implementing it, I haven't
looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
but that looks like it could be modified to be deterministic.

Hopefully someone else can comment on aggregation in local mode.  I'm not
sure how much effort has gone into optimizing for local mode.

Joseph

On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander 
<[email protected]<mailto:[email protected]>>
wrote:

>  Hi Joseph,
>
>
>
> Thank you for suggestion!
>
> It seems that instead of sample it is better to shuffle data and then
> access it sequentially by mini-batches. Could you suggest how to implement
> it?
>
>
>
> With regards to aggregate (reduce), I am wondering why it works so slow in
> local mode? Could you elaborate on this? I do understand that in cluster
> mode the network speed will kick in and then one can blame it.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley 
> [mailto:[email protected]<mailto:[email protected]>]
> *Sent:* Thursday, April 02, 2015 10:51 AM
> *To:* Ulanov, Alexander
> *Cc:* [email protected]<mailto:[email protected]>
> *Subject:* Re: Stochastic gradient descent performance
>
>
>
> It looks like SPARK-3250 was applied to the sample() which GradientDescent
> uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
> your numbers, aggregation seems like the main issue, though I hesitate to
> optimize aggregation based on local tests for data sizes that small.
>
>
>
> The first thing I'd check for is unnecessary object creation, and to
> profile in a cluster or larger data setting.
>
>
>
> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> [email protected]<mailto:[email protected]>> wrote:
>
> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
>
>
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To: [email protected]<mailto:[email protected]>
> Subject: Stochastic gradient descent performance
>
> Hi,
>
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
>
> Best regards, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> [email protected]<mailto:[email protected]>
> For additional commands, e-mail: 
> [email protected]<mailto:[email protected]>
>
>
>

RE: Stochastic gradient descent performance

Reply via email to