spark git commit: [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

2016-12-07 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3750c6e9b -> 340e9aea4


[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for 
each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most 
of them are DataFrame which are less important for R users. Meanwhile, these 
metrics ignore instance weights (setting all to 1.0) which will be changed in 
later Spark version. In case it will introduce breaking changes, we do not 
expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an 
expert Param(related with Spark architecture and job execution) that would be 
used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 versicolor  virginica   setosa
(Intercept)  1.514031-2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang 

Closes #16117 from yanboliang/spark-18686.

(cherry picked from commit 90b59d1bf262b41c3a5f780697f504030f9d079c)
Signed-off-by: Yanbo Liang 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/340e9aea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/340e9aea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/340e9aea

Branch: refs/heads/branch-2.1
Commit: 340e9aea4853805c42b8739004d93efe8fe16ba4
Parents: 3750c6e
Author: Yanbo Liang 
Authored: Wed Dec 7 00:31:11 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 00:32:32 2016 -0800

--
 R/pkg/R/mllib.R |  86 +++--
 R/pkg/inst/tests/testthat/test_mllib.R  | 183 +--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  81 
 3 files changed, 203 insertions(+), 147 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/340e9aea/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index eed8293..074e9cb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,8 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param aggregationDepth depth for treeAggregate (>= 2). If the dimensions 
of features or the number of partitions
-#' are large, this param could be adjusted to a larger 
size.
 #' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
@@ -746,45 +744,35 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
-#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, features))
-#' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
-#' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
-#'
-#' # summary of binary logistic regression
-#' blr_summary <- summary(blr_model)
-#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, 
"threshold", "F-Measure"))
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.logit(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
 #' 

spark git commit: [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

2016-12-07 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master 5c6bcdbda -> 90b59d1bf


[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for 
each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most 
of them are DataFrame which are less important for R users. Meanwhile, these 
metrics ignore instance weights (setting all to 1.0) which will be changed in 
later Spark version. In case it will introduce breaking changes, we do not 
expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an 
expert Param(related with Spark architecture and job execution) that would be 
used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 versicolor  virginica   setosa
(Intercept)  1.514031-2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
 Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang 

Closes #16117 from yanboliang/spark-18686.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90b59d1b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90b59d1b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90b59d1b

Branch: refs/heads/master
Commit: 90b59d1bf262b41c3a5f780697f504030f9d079c
Parents: 5c6bcdb
Author: Yanbo Liang 
Authored: Wed Dec 7 00:31:11 2016 -0800
Committer: Yanbo Liang 
Committed: Wed Dec 7 00:31:11 2016 -0800

--
 R/pkg/R/mllib.R |  86 +++--
 R/pkg/inst/tests/testthat/test_mllib.R  | 183 +--
 .../spark/ml/r/LogisticRegressionWrapper.scala  |  81 
 3 files changed, 203 insertions(+), 147 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90b59d1b/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index eed8293..074e9cb 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -733,8 +733,6 @@ setMethod("predict", signature(object = "KMeansModel"),
 #'  excepting that at most one value may be 0. The class with 
largest value p/t is predicted, where p
 #'  is the original probability of that class and t is the 
class's threshold.
 #' @param weightCol The weight column name.
-#' @param aggregationDepth depth for treeAggregate (>= 2). If the dimensions 
of features or the number of partitions
-#' are large, this param could be adjusted to a larger 
size.
 #' @param probabilityCol column name for predicted class conditional 
probabilities.
 #' @param ... additional arguments passed to the method.
 #' @return \code{spark.logit} returns a fitted logistic regression model
@@ -746,45 +744,35 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' \dontrun{
 #' sparkR.session()
 #' # binary logistic regression
-#' label <- c(0.0, 0.0, 0.0, 1.0, 1.0)
-#' features <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
-#' binary_data <- as.data.frame(cbind(label, features))
-#' binary_df <- createDataFrame(binary_data)
-#' blr_model <- spark.logit(binary_df, label ~ features, thresholds = 1.0)
-#' blr_predict <- collect(select(predict(blr_model, binary_df), "prediction"))
-#'
-#' # summary of binary logistic regression
-#' blr_summary <- summary(blr_model)
-#' blr_fmeasure <- collect(select(blr_summary$fMeasureByThreshold, 
"threshold", "F-Measure"))
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.logit(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
 #' # save fitted model to input path
 #' path <- "path/to/model"
-#' write.ml(blr_model, path)
+#' write.ml(m