GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/14834
[SPARK-17163][ML][WIP] Unified LogisticRegression interface
## What changes were proposed in this pull request?
Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove
`MultinomialLogisticRegression`.
Marked as WIP because we should discuss the coefficients API in the model.
See discussion below.
JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
## How was this patch tested?
Merged test suites and added some new unit tests.
## Design
### Switching between binomial and multinomial
We default to automatically detecting whether we should run binomial or
multinomial lor. We expose a new parameter called `family` which defaults to
auto. When "auto" is used, we run normal binomial lor with pivoting if there
are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly
sets the family, then we abide by that setting. In the case where "binomial" is
set but multiclass lor is detected, we throw an error.
### coefficients/intercept model API (TODO)
This is the biggest design point remaining, IMO. We need to decide how to
store the coefficients and intercepts in the model, and in turn how to expose
them via the API. Two important points:
* We must maintain compatibility with the old API, i.e. we must expose `def
coefficients: Vector` and `def intercept: Double`
* There are two separate cases: binomial lr where we have a single set of
coefficients and a single intercept and multinomial lr where we have
`numClasses` sets of coefficients and `numClasses` intercepts.
Some options:
1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This
means that we would center the model coefficients before storing them in the
model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would
convert them to `2 x numFeatures` coefficients before storing them, effectively
doubling the storage in the model. This has the advantage that we can make the
code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to
reason about the different cases as much. It has the disadvantage that we
double the storage space and we could see small regressions at prediction time
since there are 2x the number of operations in the prediction algorithms.
Additionally, we still have to produce the uncentered coefficients/intercept
via the API, so we will have to either ALSO store the uncentered version, or
compute it in `def coefficients: Vector` every time.
2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We
still store the coefficients as a matrix and the intercepts as a vector. When
users call `coefficients` we return them a `Vector` that is backed by the same
underlying array as the `coefficientMatrix`, so we don't duplicate any data. At
prediction time, we use the old prediction methods that are specialized for
binary LOR. The benefits here are that we don't store extra data, and we won't
see any regressions in performance. The cost of this is that we have separate
implementations for predict methods in the binary vs multiclass case. The
duplicated code is really not very high, but it's still a bit messy.
If we do decide to store the 2x coefficients, we would likely want to see
some performance tests to understand the potential regressions.
### Threshold/thresholds (TODO)
Currently, when `threshold` is set we clear whatever value is in
`thresholds` and when `thresholds` is set we clear whatever value is in
`threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
was created to prefer thresholds over threshold. We should decide if we should
implement this behavior now or if we want to do it in a separate JIRA.
## Follow up
* Summary model for multiclass logistic regression
[SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Thresholds vs threshold
[SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark SPARK-17163
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14834.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14834
----
commit b20d2e769b837d3566ca5f1b1a75d17d755d3b9a
Author: sethah <[email protected]>
Date: 2016-08-25T00:13:33Z
first pass at merging MLOR with LOR
commit d85381abb44ec0c17117aafdea8c5f017f2f6851
Author: sethah <[email protected]>
Date: 2016-08-25T05:05:46Z
add initial model
commit 6b1b984f5fb892bfedfc58c39ef9a7ea9863ed29
Author: sethah <[email protected]>
Date: 2016-08-25T16:16:33Z
fixing some todos, added dual support for weighted tests
commit 0d8693727b78afd69e60495d12ac6a4d382356bb
Author: sethah <[email protected]>
Date: 2016-08-25T20:11:44Z
all auxiliary tests are merged to LOR, and added initial model test
commit 05f2ce02b8a97af2235c984ab4799b6ec99a67f0
Author: sethah <[email protected]>
Date: 2016-08-25T21:33:34Z
model loading backward compat
commit 856593c67f86a343be5db0abf34909ebe705c7b5
Author: sethah <[email protected]>
Date: 2016-08-26T01:27:57Z
correcting initial model test and deleting multinomial
commit 6d3874f6213755aa28726d5afd9b33dafb94c39e
Author: sethah <[email protected]>
Date: 2016-08-26T04:20:22Z
small fixes, remove temp constructor
commit 00788bbe475f6eb05a7d71dfc5d79111c449e1c9
Author: sethah <[email protected]>
Date: 2016-08-26T04:24:46Z
rebase
commit 4f8b39b04077e066077f493c1b30bebe622d381e
Author: sethah <[email protected]>
Date: 2016-08-26T15:21:56Z
removing old test suite
commit a16a4a9f1e98199ebdee706960afbf441597c104
Author: sethah <[email protected]>
Date: 2016-08-26T16:36:05Z
some small fixes
commit 7cfbcd3856992a0be0cb3a91ad91e608b1db3fc0
Author: sethah <[email protected]>
Date: 2016-08-26T17:52:29Z
use _coefficients
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]