[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

sethah Fri, 26 Aug 2016 11:03:23 -0700

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/14834


    [SPARK-17163][ML][WIP] Unified LogisticRegression interface

    ## What changes were proposed in this pull request?
    
    Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove 
`MultinomialLogisticRegression`.
    
    Marked as WIP because we should discuss the coefficients API in the model. 
See discussion below.
    
    JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
    
    
    ## How was this patch tested?
    
    Merged test suites and added some new unit tests.
    
    ## Design
    
    ### Switching between binomial and multinomial
    
    We default to automatically detecting whether we should run binomial or 
multinomial lor. We expose a new parameter called `family` which defaults to 
auto. When "auto" is used, we run normal binomial lor with pivoting if there 
are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly 
sets the family, then we abide by that setting. In the case where "binomial" is 
set but multiclass lor is detected, we throw an error.
    
    ### coefficients/intercept model API (TODO)
    
    This is the biggest design point remaining, IMO. We need to decide how to 
store the coefficients and intercepts in the model, and in turn how to expose 
them via the API. Two important points:
    
    * We must maintain compatibility with the old API, i.e. we must expose `def 
coefficients: Vector` and `def intercept: Double`
    * There are two separate cases: binomial lr where we have a single set of 
coefficients and a single intercept and multinomial lr where we have 
`numClasses` sets of coefficients and `numClasses` intercepts.
    
    Some options:
    
    1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This 
means that we would center the model coefficients before storing them in the 
model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would 
convert them to `2 x numFeatures` coefficients before storing them, effectively 
doubling the storage in the model. This has the advantage that we can make the 
code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to 
reason about the different cases as much. It has the disadvantage that we 
double the storage space and we could see small regressions at prediction time 
since there are 2x the number of operations in the prediction algorithms. 
Additionally, we still have to produce the uncentered coefficients/intercept 
via the API, so we will have to either ALSO store the uncentered version, or 
compute it in `def coefficients: Vector` every time.
    
    2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We 
still store the coefficients as a matrix and the intercepts as a vector. When 
users call `coefficients` we return them a `Vector` that is backed by the same 
underlying array as the `coefficientMatrix`, so we don't duplicate any data. At 
prediction time, we use the old prediction methods that are specialized for 
binary LOR. The benefits here are that we don't store extra data, and we won't 
see any regressions in performance. The cost of this is that we have separate 
implementations for predict methods in the binary vs multiclass case. The 
duplicated code is really not very high, but it's still a bit messy. 
    
    If we do decide to store the 2x coefficients, we would likely want to see 
some performance tests to understand the potential regressions.
    
    ### Threshold/thresholds (TODO)
    
    Currently, when `threshold` is set we clear whatever value is in 
`thresholds` and when `thresholds` is set we clear whatever value is in 
`threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) 
was created to prefer thresholds over threshold. We should decide if we should 
implement this behavior now or if we want to do it in a separate JIRA.
    
    ## Follow up
    
    * Summary model for multiclass logistic regression 
[SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
    * Thresholds vs threshold 
[SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark SPARK-17163

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14834
    
----
commit b20d2e769b837d3566ca5f1b1a75d17d755d3b9a
Author: sethah <[email protected]>
Date:   2016-08-25T00:13:33Z

    first pass at merging MLOR with LOR

commit d85381abb44ec0c17117aafdea8c5f017f2f6851
Author: sethah <[email protected]>
Date:   2016-08-25T05:05:46Z

    add initial model

commit 6b1b984f5fb892bfedfc58c39ef9a7ea9863ed29
Author: sethah <[email protected]>
Date:   2016-08-25T16:16:33Z

    fixing some todos, added dual support for weighted tests

commit 0d8693727b78afd69e60495d12ac6a4d382356bb
Author: sethah <[email protected]>
Date:   2016-08-25T20:11:44Z

    all auxiliary tests are merged to LOR, and added initial model test

commit 05f2ce02b8a97af2235c984ab4799b6ec99a67f0
Author: sethah <[email protected]>
Date:   2016-08-25T21:33:34Z

    model loading backward compat

commit 856593c67f86a343be5db0abf34909ebe705c7b5
Author: sethah <[email protected]>
Date:   2016-08-26T01:27:57Z

    correcting initial model test and deleting multinomial

commit 6d3874f6213755aa28726d5afd9b33dafb94c39e
Author: sethah <[email protected]>
Date:   2016-08-26T04:20:22Z

    small fixes, remove temp constructor

commit 00788bbe475f6eb05a7d71dfc5d79111c449e1c9
Author: sethah <[email protected]>
Date:   2016-08-26T04:24:46Z

    rebase

commit 4f8b39b04077e066077f493c1b30bebe622d381e
Author: sethah <[email protected]>
Date:   2016-08-26T15:21:56Z

    removing old test suite

commit a16a4a9f1e98199ebdee706960afbf441597c104
Author: sethah <[email protected]>
Date:   2016-08-26T16:36:05Z

    some small fixes

commit 7cfbcd3856992a0be0cb3a91ad91e608b1db3fc0
Author: sethah <[email protected]>
Date:   2016-08-26T17:52:29Z

    use _coefficients

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

Reply via email to