GitHub user daniel-siegmann-aol opened a pull request:

    https://github.com/apache/spark/pull/12761

    [SPARK-14464] [MLLIB] Better support for logistic regression when features 
are sparse

    ## What changes were proposed in this pull request?
    
    Where aggregations were being done against feature vectors, replaced the 
use of arrays with a new `VectorBuilder` class. This class has both dense and 
sparse implementations, similar to the existing `Vector` class. The dense 
implementation is essentially a wrapper around an array, while the sparse 
implementation uses a hashmap. This allows values to be aggregated into the 
structure (cannot be done with the existing `SparseVector`).
    
    Aggregations are dense by default. There is a new parameter on 
`LogisticRegression` called `useSparseAgg` to specify that sparse 
implementations should be used for aggregation; otherwise behavior should be 
identical to before this PR. Note that this value had to be added to the 
constructor of `MultivariateOnlineSummarizer` as well, although a default was 
set so it will not break existing code - WARNING this may break binary 
compatibility, although this is developer API so I don't expect that to be a 
problem.
    
    Aside from the new `VectorBuilder` class and the new `useSparseAgg` 
parameter, the changes are largely line-for-line mapping from using arrays to 
the new vector builders.
    
    This change was made with the intention of having zero impact for anyone 
who doesn't explicitly opt in by setting the `useSparseAgg` parameter to true.
    
    While this change has been made only for logistic regression, it is likely 
the same technique could be applied to other estimators.
    
    ## How was this patch tested?
    
    Unit tests added, existing unit tests pass.
    
    The original version of these changes I made by copying some code into my 
own project and making the changes I needed. These were tested on my production 
training flow with a few different settings to do verify the performance 
improvements. Additionally, I attempted to verify this does not degrade 
performance with dense aggregation, and while it didn't seem to I did not test 
this thoroughly.
    
    Spark doesn't build on my machine (even on master) so I cannot verify that 
everything works. I'm not sure what tests exist to defend against performance 
regressions, but I assume there are some that will be run before the pull 
request is merged. If there are additional tests that need to be created for 
the sparse aggregation case, please let me know.
    
    Also, please note that this is the first patch I have attempted to submit 
to Spark. Please do give my code a careful review.
    
    ## Reviewers
    
    PR submission guidelines suggest mentioning some of the committers to the 
relevant code. Some of those people include @mengxr @dbtsai @jkbradley . 
Apologies if I missed anyone important.
    
    ## Copyright
    
    These changes are copyright AOL, Inc. (my employer). I have received 
permission to distribute them under the Apache license, as required by the 
Apache Spark project.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/daniel-siegmann-aol/spark 
SPARK-14464_sparse_logistic_regression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12761.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12761
    
----
commit c4f6c15a2da02334eeeff87e72ab8fc5e46cd052
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-07T20:28:23Z

    [SPARK-14464] [ML] Added utility classes for storing aggregations of 
features in either dense or sparse form.

commit bf086abea2b7f0f332338c3d644aa0fa074005e3
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-08T19:40:01Z

    [SPARK-14464] Fixed bug in VectorBuilder clone implementations (dense and 
sparse).

commit 07d3b94584d22e248e16747a02445715fddf88f2
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-08T20:11:38Z

    [SPARK-14464] Modified MultivariateOnlineSummarizer to support aggregation 
in sparse structures by replacing arrays with VectorBuilders. Instances use 
dense structures by default. Replicated test cases so the class is tested with 
both dense and sparse structures.

commit fc44af51ba69f466949d4d6c18a3d9daa9f72712
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-08T20:35:33Z

    [SPARK-14464] Refactored tests in MultivariateOnlineSummarizerSuite so code 
isn't duplicate testing both dense and sparse aggregation.

commit 10272c7ed834afd6bf5b8f2bc6443f227560c66c
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-08T21:52:18Z

    [SPARK-14464] Implemented option to aggregate in sparse structures when 
training using LogisticRegression.

commit fb2e9748af4bff9edc80769c46464e715299ec99
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-11T17:12:08Z

    [SPARK-14464] Fixed style check failures.

commit 4eec865350ae8aa786959e6b50ca7c29ebcc9de3
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-11T17:38:09Z

    [SPARK-14464] Set the @Since annotations to 2.0.0, as this is presumably 
the version which would include these changes.

commit f087f8be726d82a1f8a3ea1bc5b1c2307e2ccd0d
Author: Daniel Siegmann <[email protected]>
Date:   2016-04-28T17:42:56Z

    Merge branch 'master' into SPARK-14464_sparse_logistic_regression
    
    # Conflicts:
    #   
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to