Repository: spark Updated Branches: refs/heads/master 6a05eb24d -> 9df54f532
[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression ## What changes were proposed in this pull request? Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training. ## How was this patch tested? Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output. Author: sethah <seth.hendrickso...@gmail.com> Closes #15349 from sethah/SPARK-17239. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9df54f53 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9df54f53 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9df54f53 Branch: refs/heads/master Commit: 9df54f5325c2942bb77008ff1810e2fb5f6d848b Parents: 6a05eb2 Author: sethah <seth.hendrickso...@gmail.com> Authored: Wed Oct 5 18:28:21 2016 +0000 Committer: DB Tsai <dbt...@dbtsai.com> Committed: Wed Oct 5 18:28:21 2016 +0000 ---------------------------------------------------------------------- docs/ml-classification-regression.md | 65 +++++++++++++++++--- ...LogisticRegressionWithElasticNetExample.java | 14 +++++ ...LogisticRegressionWithElasticNetExample.java | 55 +++++++++++++++++ .../ml/logistic_regression_with_elastic_net.py | 10 +++ ...lass_logistic_regression_with_elastic_net.py | 48 +++++++++++++++ ...ogisticRegressionWithElasticNetExample.scala | 13 ++++ ...ogisticRegressionWithElasticNetExample.scala | 57 +++++++++++++++++ 7 files changed, 255 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/docs/ml-classification-regression.md ---------------------------------------------------------------------- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index 7c2437e..bb2e404 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -34,17 +34,22 @@ discussing specific classes of algorithms, such as linear methods, trees, and en ## Logistic regression -Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome. -For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). +Logistic regression is a popular method to predict a categorical response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcomes. +In `spark.ml` logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression. Use the `family` +parameter to select between these two algorithms, or leave it unset and Spark will infer the correct variant. - > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future. + > Multinomial logistic regression can be used for binary classification by setting the `family` param to "multinomial". It will produce two sets of coefficients and two intercepts. > When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM. +### Binomial logistic regression + +For more background and more details about the implementation of binomial logistic regression, refer to the documentation of [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression). + **Example** -The following example shows how to train a logistic regression model -with elastic net regularization. `elasticNetParam` corresponds to +The following example shows how to train binomial and multinomial logistic regression +models for binary classification with elastic net regularization. `elasticNetParam` corresponds to $\alpha$ and `regParam` corresponds to $\lambda$. <div class="codetabs"> @@ -92,8 +97,8 @@ provides a summary for a [`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). Currently, only binary classification is supported and the summary must be explicitly cast to -[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). -This will likely change when multiclass classification is supported. +[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). +Support for multiclass model summaries will be added in the future. Continuing the earlier example: @@ -107,6 +112,52 @@ Logistic regression model summary is not yet supported in Python. </div> +### Multinomial logistic regression + +Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, +the algorithm produces $K$ sets of coefficients, or a matrix of dimension $K \times J$ where $K$ is the number of outcome +classes and $J$ is the number of features. If the algorithm is fit with an intercept term then a length $K$ vector of +intercepts is available. + + > Multinomial coefficients are available as `coefficientMatrix` and intercepts are available as `interceptVector`. + + > `coefficients` and `intercept` methods on a logistic regression model trained with multinomial family are not supported. Use `coefficientMatrix` and `interceptVector` instead. + +The conditional probabilities of the outcome classes $k \in \{1, 2, ..., K\}$ are modeled using the softmax function. + +`\[ + P(Y=k|\mathbf{X}, \boldsymbol{\beta}_k, \beta_{0k}) = \frac{e^{\boldsymbol{\beta}_k \cdot \mathbf{X} + \beta_{0k}}}{\sum_{k'=0}^{K-1} e^{\boldsymbol{\beta}_{k'} \cdot \mathbf{X} + \beta_{0k'}}} +\]` + +We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting. + +`\[ +\min_{\beta, \beta_0} -\left[\sum_{i=1}^L w_i \cdot \log P(Y = y_i|\mathbf{x}_i)\right] + \lambda \left[\frac{1}{2}\left(1 - \alpha\right)||\boldsymbol{\beta}||_2^2 + \alpha ||\boldsymbol{\beta}||_1\right] +\]` + +For a detailed derivation please see [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model). + +**Example** + +The following example shows how to train a multiclass logistic regression +model with elastic net regularization. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +{% include_example scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +{% include_example java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java %} +</div> + +<div data-lang="python" markdown="1"> +{% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %} +</div> + +</div> + ## Decision tree classifier http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java ---------------------------------------------------------------------- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java index 6101c79..b8fb597 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java @@ -48,6 +48,20 @@ public class JavaLogisticRegressionWithElasticNetExample { // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); + + // We can also use the multinomial family for binary classification + LogisticRegression mlr = new LogisticRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8) + .setFamily("multinomial"); + + // Fit the model + LogisticRegressionModel mlrModel = mlr.fit(training); + + // Print the coefficients and intercepts for logistic regression with multinomial family + System.out.println("Multinomial coefficients: " + + lrModel.coefficientMatrix() + "\nMultinomial intercepts: " + mlrModel.interceptVector()); // $example off$ spark.stop(); http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java ---------------------------------------------------------------------- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java new file mode 100644 index 0000000..da410cb --- /dev/null +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml; + +// $example on$ +import org.apache.spark.ml.classification.LogisticRegression; +import org.apache.spark.ml.classification.LogisticRegressionModel; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +// $example off$ + +public class JavaMulticlassLogisticRegressionWithElasticNetExample { + public static void main(String[] args) { + SparkSession spark = SparkSession + .builder() + .appName("JavaMulticlassLogisticRegressionWithElasticNetExample") + .getOrCreate(); + + // $example on$ + // Load training data + Dataset<Row> training = spark.read().format("libsvm") + .load("data/mllib/sample_multiclass_classification_data.txt"); + + LogisticRegression lr = new LogisticRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8); + + // Fit the model + LogisticRegressionModel lrModel = lr.fit(training); + + // Print the coefficients and intercept for multinomial logistic regression + System.out.println("Coefficients: \n" + + lrModel.coefficientMatrix() + " \nIntercept: " + lrModel.interceptVector()); + // $example off$ + + spark.stop(); + } +} http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/python/ml/logistic_regression_with_elastic_net.py ---------------------------------------------------------------------- diff --git a/examples/src/main/python/ml/logistic_regression_with_elastic_net.py b/examples/src/main/python/ml/logistic_regression_with_elastic_net.py index 33d0689..d095fbd 100644 --- a/examples/src/main/python/ml/logistic_regression_with_elastic_net.py +++ b/examples/src/main/python/ml/logistic_regression_with_elastic_net.py @@ -40,6 +40,16 @@ if __name__ == "__main__": # Print the coefficients and intercept for logistic regression print("Coefficients: " + str(lrModel.coefficients)) print("Intercept: " + str(lrModel.intercept)) + + # We can also use the multinomial family for binary classification + mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial") + + # Fit the model + mlrModel = mlr.fit(training) + + # Print the coefficients and intercepts for logistic regression with multinomial family + print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix)) + print("Multinomial intercepts: " + str(mlrModel.interceptVector)) # $example off$ spark.stop() http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py ---------------------------------------------------------------------- diff --git a/examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py b/examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py new file mode 100644 index 0000000..bb9cd82 --- /dev/null +++ b/examples/src/main/python/ml/multiclass_logistic_regression_with_elastic_net.py @@ -0,0 +1,48 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +# $example on$ +from pyspark.ml.classification import LogisticRegression +# $example off$ +from pyspark.sql import SparkSession + +if __name__ == "__main__": + spark = SparkSession \ + .builder \ + .appName("MulticlassLogisticRegressionWithElasticNet") \ + .getOrCreate() + + # $example on$ + # Load training data + training = spark \ + .read \ + .format("libsvm") \ + .load("data/mllib/sample_multiclass_classification_data.txt") + + lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) + + # Fit the model + lrModel = lr.fit(training) + + # Print the coefficients and intercept for multinomial logistic regression + print("Coefficients: \n" + str(lrModel.coefficientMatrix)) + print("Intercept: " + str(lrModel.interceptVector)) + # $example off$ + + spark.stop() http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala ---------------------------------------------------------------------- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala index 616263b..1847104 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala @@ -45,6 +45,19 @@ object LogisticRegressionWithElasticNetExample { // Print the coefficients and intercept for logistic regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") + + // We can also use the multinomial family for binary classification + val mlr = new LogisticRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8) + .setFamily("multinomial") + + val mlrModel = mlr.fit(training) + + // Print the coefficients and intercepts for logistic regression with multinomial family + println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}") + println(s"Multinomial intercepts: ${mlrModel.interceptVector}") // $example off$ spark.stop() http://git-wip-us.apache.org/repos/asf/spark/blob/9df54f53/examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala ---------------------------------------------------------------------- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala new file mode 100644 index 0000000..42f0ace --- /dev/null +++ b/examples/src/main/scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.classification.LogisticRegression +// $example off$ +import org.apache.spark.sql.SparkSession + +object MulticlassLogisticRegressionWithElasticNetExample { + + def main(args: Array[String]): Unit = { + val spark = SparkSession + .builder + .appName("MulticlassLogisticRegressionWithElasticNetExample") + .getOrCreate() + + // $example on$ + // Load training data + val training = spark + .read + .format("libsvm") + .load("data/mllib/sample_multiclass_classification_data.txt") + + val lr = new LogisticRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8) + + // Fit the model + val lrModel = lr.fit(training) + + // Print the coefficients and intercept for multinomial logistic regression + println(s"Coefficients: \n${lrModel.coefficientMatrix}") + println(s"Intercepts: ${lrModel.interceptVector}") + // $example off$ + + spark.stop() + } +} +// scalastyle:on println --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org