Repository: spark
Updated Branches:
  refs/heads/master e391abdf2 -> e222d7584


[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example 
codes

This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
  * supporting feature interaction in R formula.
  * summary for gaussian GLM model.
  * coefficients for binomial GLM model.

mengxr

Author: Yanbo Liang <yblia...@gmail.com>

Closes #9727 from yanboliang/spark-11684.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e222d758
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e222d758
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e222d758

Branch: refs/heads/master
Commit: e222d758499ad2609046cc1a2cc8afb45c5bccbb
Parents: e391abd
Author: Yanbo Liang <yblia...@gmail.com>
Authored: Wed Nov 18 13:30:29 2015 -0800
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Wed Nov 18 13:30:29 2015 -0800

----------------------------------------------------------------------
 R/pkg/R/mllib.R                                 | 18 +++++--
 docs/sparkr.md                                  | 50 ++++++++++++++++----
 .../spark/ml/regression/LinearRegression.scala  |  3 ++
 3 files changed, 60 insertions(+), 11 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/e222d758/R/pkg/R/mllib.R
----------------------------------------------------------------------
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index f23e1c7..8d3b438 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -32,6 +32,12 @@ setClass("PipelineModel", representation(model = "jobj"))
 #' @param family Error distribution. "gaussian" -> linear regression, 
"binomial" -> logistic reg.
 #' @param lambda Regularization parameter
 #' @param alpha Elastic-net mixing parameter (see glmnet's documentation for 
details)
+#' @param standardize Whether to standardize features before training
+#' @param solver The solver algorithm used for optimization, this can be 
"l-bfgs", "normal" and
+#'               "auto". "l-bfgs" denotes Limited-memory BFGS which is a 
limited-memory
+#'               quasi-Newton optimization method. "normal" denotes using 
Normal Equation as an
+#'               analytical solution to the linear regression problem. The 
default value is "auto"
+#'               which means that the solver algorithm is selected 
automatically.
 #' @return a fitted MLlib model
 #' @rdname glm
 #' @export
@@ -79,9 +85,15 @@ setMethod("predict", signature(object = "PipelineModel"),
 #'
 #' Returns the summary of a model produced by glm(), similarly to R's 
summary().
 #'
-#' @param x A fitted MLlib model
-#' @return a list with a 'coefficient' component, which is the matrix of 
coefficients. See
-#'         summary.glm for more information.
+#' @param object A fitted MLlib model
+#' @return a list with 'devianceResiduals' and 'coefficients' components for 
gaussian family
+#'         or a list with 'coefficients' component for binomial family. \cr
+#'         For gaussian family: the 'devianceResiduals' gives the min/max 
deviance residuals
+#'         of the estimation, the 'coefficients' gives the estimated 
coefficients and their
+#'         estimated standard errors, t values and p-values. (It only 
available when model
+#'         fitted by normal solver.) \cr
+#'         For binomial family: the 'coefficients' gives the estimated 
coefficients.
+#'         See summary.glm for more information. \cr
 #' @rdname summary
 #' @export
 #' @examples

http://git-wip-us.apache.org/repos/asf/spark/blob/e222d758/docs/sparkr.md
----------------------------------------------------------------------
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 437bd47..a744b76 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -286,24 +286,37 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames using 
the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to 
train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', '+', and '-'. The example 
below shows the use of building a gaussian GLM model using SparkR.
+SparkR allows the fitting of generalized linear models over DataFrames using 
the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to 
train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'. 
+
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+
+* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
+* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+
+The examples below show the use of building gaussian GLM model and binomial 
GLM model using SparkR.
+
+## Gaussian GLM model
 
 <div data-lang="r"  markdown="1">
 {% highlight r %}
 # Create the DataFrame
 df <- createDataFrame(sqlContext, iris)
 
-# Fit a linear model over the dataset.
+# Fit a gaussian GLM model over the dataset.
 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
 
-# Model coefficients are returned in a similar format to R's native glm().
+# Model summary are returned in a similar format to R's native glm().
 summary(model)
+##$devianceResiduals
+## Min       Max     
+## -1.307112 1.412532
+##
 ##$coefficients
-##                    Estimate
-##(Intercept)        2.2513930
-##Sepal_Width        0.8035609
-##Species_versicolor 1.4587432
-##Species_virginica  1.9468169
+##                   Estimate  Std. Error t value  Pr(>|t|)    
+##(Intercept)        2.251393  0.3697543  6.08889  9.568102e-09
+##Sepal_Width        0.8035609 0.106339   7.556598 4.187317e-12
+##Species_versicolor 1.458743  0.1121079  13.01195 0           
+##Species_virginica  1.946817  0.100015   19.46525 0           
 
 # Make predictions based on the model.
 predictions <- predict(model, newData = df)
@@ -317,3 +330,24 @@ head(select(predictions, "Sepal_Length", "prediction"))
 ##6          5.4   5.385281
 {% endhighlight %}
 </div>
+
+## Binomial GLM model
+
+<div data-lang="r"  markdown="1">
+{% highlight r %}
+# Create the DataFrame
+df <- createDataFrame(sqlContext, iris)
+training <- filter(df, df$Species != "setosa")
+
+# Fit a binomial GLM model over the dataset.
+model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = 
"binomial")
+
+# Model coefficients are returned in a similar format to R's native glm().
+summary(model)
+##$coefficients
+##               Estimate
+##(Intercept)  -13.046005
+##Sepal_Length   1.902373
+##Sepal_Width    0.404655
+{% endhighlight %}
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/e222d758/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index ca55d59..f7c44f0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -145,6 +145,9 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
   /**
    * Set the solver algorithm used for optimization.
    * In case of linear regression, this can be "l-bfgs", "normal" and "auto".
+   * "l-bfgs" denotes Limited-memory BFGS which is a limited-memory 
quasi-Newton
+   * optimization method. "normal" denotes using Normal Equation as an 
analytical
+   * solution to the linear regression problem.
    * The default value is "auto" which means that the solver algorithm is
    * selected automatically.
    * @group setParam


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to