Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
re-done at each iteration. Instead, at each iteration, all points are
re-scaled according to the pre-computed standard-deviations in the
StandardScalerModel, and then an intercept is appended.

Just to be clear - you're currently calling .persist() before you pass data
to LogisticRegressionWithLBFGS?

Also - can you give some parameters about the problem/cluster size you're
solving this on? How much memory per node? How big are n and d, what is its
sparsity (if any) and how many iterations are you running for? Is 0:45 the
per-iteration time or total time for some number of iterations?

A useful test might be to call GeneralizedLinearAlgorithm useFeatureScaling
set to false (and maybe also addIntercept set to false) on persisted data,
and see if you see the same performance wins. If that's the case we've
isolated the issue and can start profiling to see where all the time is
going.

It would be great if you can open a JIRA.

Thanks!



On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins <j...@soundcloud.com> wrote:

> Cross-posting as I got no response on the users mailing list last
> week. Any response would be appreciated :)
>
> Josh
>
>
> ---------- Forwarded message ----------
> From: Josh Devins <j...@soundcloud.com>
> Date: 9 February 2015 at 15:59
> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
> To: "u...@spark.apache.org" <u...@spark.apache.org>
>
>
> I've been looking into a performance problem when using
> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
> Here's an outline of what I've figured out so far and it would be
> great to get some confirmation of the problem, some input on how
> wide-spread this problem might be and any ideas on a nice way to fix
> this.
>
> Context:
> - I will reference `branch-1.1` as we are currently on v1.1.1 however
> this appears to still be a problem on `master`
> - The cluster is run on YARN, on bare-metal hardware (no VMs)
> - I've not filed a Jira issue yet but can do so
> - This problem affects all algorithms based on
> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)
>
> Problem Outline:
> - Starting at GLA line 177
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
> ),
> a feature scaler is created using the `input` RDD
> - Refer next to line 186 which then maps over the `input` RDD and
> produces a new `data` RDD
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
> )
> - If you are using feature scaling or adding intercepts, the user
> `input` RDD has been mapped over *after* the user has persisted it
> (hopefully) and *before* going into the (iterative) optimizer on line
> 204 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
> )
> - Since the RDD `data` that is iterated over in the optimizer is
> unpersisted, when we are running the cost function in the optimizer
> (e.g. LBFGS --
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
> ),
> the map phase will actually first go back and rerun the feature
> scaling (map tasks on `input`) and then map with the cost function
> (two maps pipelined into one stage)
> - As a result, parts of the StandardScaler will actually be run again
> (perhaps only because the variable is `lazy`?) and this can be costly,
> see line 84 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84
> )
> - For small datasets and/or few iterations, this is not really a
> problem, however we found that by adding a `data.persist()` right
> before running the optimizer, we went from map iterations in the
> optimizer that went from 5:30 down to 0:45
>
> I had a very tough time coming up with a nice way to describe my
> debugging sessions in an email so I hope this gets the main points
> across. Happy to clarify anything if necessary (also by live
> debugging/Skype/phone if that's helpful).
>
> Thanks,
>
> Josh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to