Note that we can do this in DataFrames and use Catalyst to push Sample down
beneath Projection :)

On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng <men...@gmail.com> wrote:

> The gap sampling is triggered when the sampling probability is small
> and the directly underlying storage has constant time lookups, in
> particular, ArrayBuffer. This is a very strict requirement. If rdd is
> cached in memory, we use ArrayBuffer to store its elements and
> rdd.sample will trigger gap sampling. However, if we call rdd2 =
> rdd.map(x => x), we can no longer tell whether the storage is backed
> by an ArrayBuffer and hence gaps sampling is not enabled. We should
> use Scala's drop(k) and let Scala decides whether this is an O(1)
> operation or an O(k) operation. But unfortunately, due to a Scala bug,
> this could become an O(k^2) operation. So we didn't use this approach.
> Please check the comments in the gap sampling PR.
>
> For SGD, I think we should either assume the input data is randomized
> or randomize the input data (and eat this one-time cost), then do
> min-batch sequentially. The key is the balance the batch size and the
> communication cost of model update.
>
> Best,
> Xiangrui
>
> On Mon, Apr 6, 2015 at 10:38 AM, Ulanov, Alexander
> <alexander.ula...@hp.com> wrote:
> > Batch size impacts convergence, so bigger batch means more iterations.
> There are some approaches to deal with it (such as
> http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be
> implemented and tested.
> >
> > Nonetheless, could you share your thoughts regarding reducing this
> overhead in Spark (or probably a workaround)? Sorry for repeating it, but I
> think this is crucial for MLlib in Spark, because Spark is intended for
> bigger amounts of data. Machine learning with bigger data usually requires
> SGD (vs batch GD), SGD requires a lot of updates, and “Spark overhead”
> times “many updates” equals impractical time needed for learning.
> >
> >
> > From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
> > Sent: Sunday, April 05, 2015 7:13 PM
> > To: Ulanov, Alexander
> > Cc: shiva...@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org
> > Subject: Re: Stochastic gradient descent performance
> >
> > Yeah, a simple way to estimate the time for an iterative algorithms is
> number of iterations required * time per iteration. The time per iteration
> will depend on the batch size, computation required and the fixed overheads
> I mentioned before. The number of iterations of course depends on the
> convergence rate for the problem being solved.
> >
> > Thanks
> > Shivaram
> >
> > On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander <
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> > Hi Shivaram,
> >
> > It sounds really interesting! With this time we can estimate if it worth
> considering to run an iterative algorithm on Spark. For example, for SGD on
> Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all
> data by one example not considering the data loading, computation and
> update times. One may need to traverse all data a number of times to
> converge. Let’s say this number is equal to the batch size. So, we remain
> with 62.5 hours overhead. Is it reasonable?
> >
> > Best regards, Alexander
> >
> > From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu<mailto:
> shiva...@eecs.berkeley.edu>]
> > Sent: Thursday, April 02, 2015 1:26 PM
> > To: Joseph Bradley
> > Cc: Ulanov, Alexander; dev@spark.apache.org<mailto:dev@spark.apache.org>
> > Subject: Re: Stochastic gradient descent performance
> >
> > I haven't looked closely at the sampling issues, but regarding the
> aggregation latency, there are fixed overheads (in local and distributed
> mode) with the way aggregation is done in Spark. Launching a stage of
> tasks, fetching outputs from the previous stage etc. all have overhead, so
> I would say its not efficient / recommended to run stages where computation
> is less than 500ms or so. You could increase your batch size based on this
> and hopefully that will help.
> >
> > Regarding reducing these overheads by an order of magnitude it is a
> challenging problem given the architecture in Spark -- I have some ideas
> for this, but they are very much at a research stage.
> >
> > Thanks
> > Shivaram
> >
> > On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley <jos...@databricks.com
> <mailto:jos...@databricks.com>> wrote:
> > When you say "It seems that instead of sample it is better to shuffle
> data
> > and then access it sequentially by mini-batches," are you sure that holds
> > true for a big dataset in a cluster?  As far as implementing it, I
> haven't
> > looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> > but that looks like it could be modified to be deterministic.
> >
> > Hopefully someone else can comment on aggregation in local mode.  I'm not
> > sure how much effort has gone into optimizing for local mode.
> >
> > Joseph
> >
> > On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander <
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>
> > wrote:
> >
> >>  Hi Joseph,
> >>
> >>
> >>
> >> Thank you for suggestion!
> >>
> >> It seems that instead of sample it is better to shuffle data and then
> >> access it sequentially by mini-batches. Could you suggest how to
> implement
> >> it?
> >>
> >>
> >>
> >> With regards to aggregate (reduce), I am wondering why it works so slow
> in
> >> local mode? Could you elaborate on this? I do understand that in cluster
> >> mode the network speed will kick in and then one can blame it.
> >>
> >>
> >>
> >> Best regards, Alexander
> >>
> >>
> >>
> >> *From:* Joseph Bradley [mailto:jos...@databricks.com<mailto:
> jos...@databricks.com>]
> >> *Sent:* Thursday, April 02, 2015 10:51 AM
> >> *To:* Ulanov, Alexander
> >> *Cc:* dev@spark.apache.org<mailto:dev@spark.apache.org>
> >> *Subject:* Re: Stochastic gradient descent performance
> >>
> >>
> >>
> >> It looks like SPARK-3250 was applied to the sample() which
> GradientDescent
> >> uses, and that should kick in for your minibatchFraction <= 0.4.  Based
> on
> >> your numbers, aggregation seems like the main issue, though I hesitate
> to
> >> optimize aggregation based on local tests for data sizes that small.
> >>
> >>
> >>
> >> The first thing I'd check for is unnecessary object creation, and to
> >> profile in a cluster or larger data setting.
> >>
> >>
> >>
> >> On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <
> >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> >>
> >> Sorry for bothering you again, but I think that it is an important issue
> >> for applicability of SGD in Spark MLlib. Could Spark developers please
> >> comment on it.
> >>
> >>
> >> -----Original Message-----
> >> From: Ulanov, Alexander
> >> Sent: Monday, March 30, 2015 5:00 PM
> >> To: dev@spark.apache.org<mailto:dev@spark.apache.org>
> >> Subject: Stochastic gradient descent performance
> >>
> >> Hi,
> >>
> >> It seems to me that there is an overhead in "runMiniBatchSGD" function
> of
> >> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> >> might take time that is order of magnitude greater than the actual
> gradient
> >> computation. In particular, for mnist dataset of 60K instances,
> minibatch
> >> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> >> aggregate in local mode with 1 data partition on Core i5 processor. The
> >> actual gradient computation takes 0.002 s. I searched through Spark Jira
> >> and found that there was recently an update for more efficient sampling
> >> (SPARK-3250) that is already included in Spark codebase. Is there a way
> to
> >> reduce the sampling time and local treeRedeuce by order of magnitude?
> >>
> >> Best regards, Alexander
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>
> >> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:
> dev-h...@spark.apache.org>
> >>
> >>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to