GitHub user daniel-siegmann-aol opened a pull request:
https://github.com/apache/spark/pull/12761
[SPARK-14464] [MLLIB] Better support for logistic regression when features
are sparse
## What changes were proposed in this pull request?
Where aggregations were being done against feature vectors, replaced the
use of arrays with a new `VectorBuilder` class. This class has both dense and
sparse implementations, similar to the existing `Vector` class. The dense
implementation is essentially a wrapper around an array, while the sparse
implementation uses a hashmap. This allows values to be aggregated into the
structure (cannot be done with the existing `SparseVector`).
Aggregations are dense by default. There is a new parameter on
`LogisticRegression` called `useSparseAgg` to specify that sparse
implementations should be used for aggregation; otherwise behavior should be
identical to before this PR. Note that this value had to be added to the
constructor of `MultivariateOnlineSummarizer` as well, although a default was
set so it will not break existing code - WARNING this may break binary
compatibility, although this is developer API so I don't expect that to be a
problem.
Aside from the new `VectorBuilder` class and the new `useSparseAgg`
parameter, the changes are largely line-for-line mapping from using arrays to
the new vector builders.
This change was made with the intention of having zero impact for anyone
who doesn't explicitly opt in by setting the `useSparseAgg` parameter to true.
While this change has been made only for logistic regression, it is likely
the same technique could be applied to other estimators.
## How was this patch tested?
Unit tests added, existing unit tests pass.
The original version of these changes I made by copying some code into my
own project and making the changes I needed. These were tested on my production
training flow with a few different settings to do verify the performance
improvements. Additionally, I attempted to verify this does not degrade
performance with dense aggregation, and while it didn't seem to I did not test
this thoroughly.
Spark doesn't build on my machine (even on master) so I cannot verify that
everything works. I'm not sure what tests exist to defend against performance
regressions, but I assume there are some that will be run before the pull
request is merged. If there are additional tests that need to be created for
the sparse aggregation case, please let me know.
Also, please note that this is the first patch I have attempted to submit
to Spark. Please do give my code a careful review.
## Reviewers
PR submission guidelines suggest mentioning some of the committers to the
relevant code. Some of those people include @mengxr @dbtsai @jkbradley .
Apologies if I missed anyone important.
## Copyright
These changes are copyright AOL, Inc. (my employer). I have received
permission to distribute them under the Apache license, as required by the
Apache Spark project.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/daniel-siegmann-aol/spark
SPARK-14464_sparse_logistic_regression
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12761.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12761
----
commit c4f6c15a2da02334eeeff87e72ab8fc5e46cd052
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-07T20:28:23Z
[SPARK-14464] [ML] Added utility classes for storing aggregations of
features in either dense or sparse form.
commit bf086abea2b7f0f332338c3d644aa0fa074005e3
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-08T19:40:01Z
[SPARK-14464] Fixed bug in VectorBuilder clone implementations (dense and
sparse).
commit 07d3b94584d22e248e16747a02445715fddf88f2
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-08T20:11:38Z
[SPARK-14464] Modified MultivariateOnlineSummarizer to support aggregation
in sparse structures by replacing arrays with VectorBuilders. Instances use
dense structures by default. Replicated test cases so the class is tested with
both dense and sparse structures.
commit fc44af51ba69f466949d4d6c18a3d9daa9f72712
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-08T20:35:33Z
[SPARK-14464] Refactored tests in MultivariateOnlineSummarizerSuite so code
isn't duplicate testing both dense and sparse aggregation.
commit 10272c7ed834afd6bf5b8f2bc6443f227560c66c
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-08T21:52:18Z
[SPARK-14464] Implemented option to aggregate in sparse structures when
training using LogisticRegression.
commit fb2e9748af4bff9edc80769c46464e715299ec99
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-11T17:12:08Z
[SPARK-14464] Fixed style check failures.
commit 4eec865350ae8aa786959e6b50ca7c29ebcc9de3
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-11T17:38:09Z
[SPARK-14464] Set the @Since annotations to 2.0.0, as this is presumably
the version which would include these changes.
commit f087f8be726d82a1f8a3ea1bc5b1c2307e2ccd0d
Author: Daniel Siegmann <[email protected]>
Date: 2016-04-28T17:42:56Z
Merge branch 'master' into SPARK-14464_sparse_logistic_regression
# Conflicts:
#
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]