incubator-hivemall git commit: Close #49: [HIVEMALL-26][SPARK][DOC] Make docs for regression and binary classification

myui Thu, 23 Feb 2017 05:09:00 -0800

Repository: incubator-hivemall
Updated Branches:
  refs/heads/master 054a697eb -> a89b9b8c1



Close #49: [HIVEMALL-26][SPARK][DOC] Make docs for regression and binary 
classification


Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/a89b9b8c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/a89b9b8c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/a89b9b8c

Branch: refs/heads/master
Commit: a89b9b8c19dd9fd645792d5e54a03fe697743ae1
Parents: 054a697
Author: Takeshi Yamamuro <yamam...@apache.org>
Authored: Thu Feb 23 22:07:25 2017 +0900
Committer: myui <yuin...@gmail.com>
Committed: Thu Feb 23 22:07:25 2017 +0900

----------------------------------------------------------------------
 docs/gitbook/SUMMARY.md                   |  20 +++--
 docs/gitbook/spark/binaryclass/a9a_df.md  | 100 ++++++++++++++++++++++++
 docs/gitbook/spark/binaryclass/index.md   |  18 +++++
 docs/gitbook/spark/regression/e2006_df.md | 104 +++++++++++++++++++++++++
 docs/gitbook/spark/regression/index.md    |  18 +++++
 resources/ddl/import-packages.spark       |   8 +-
 6 files changed, 255 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 5b080d2..40f20a8 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -16,14 +16,14 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
+
 # Summary
 
 ## TABLE OF CONTENTS
 
 * [Getting Started](getting_started/README.md)
     * [Installation](getting_started/installation.md)
-    * [Install as permanent functions](getting_started/permanent-functions.md) 
+    * [Install as permanent functions](getting_started/permanent-functions.md)
     * [Input Format](getting_started/input-format.md)
 
 * [Tips for Effective Hivemall](tips/README.md)
@@ -86,7 +86,7 @@
 * [KDD2010a Tutorial](binaryclass/kdd2010a.md)
     * [Data preparation](binaryclass/kdd2010a_dataset.md)
     * [PA, CW, AROW, SCW](binaryclass/kdd2010a_scw.md)
-    
+
 * [KDD2010b Tutorial](binaryclass/kdd2010b.md)
     * [Data preparation](binaryclass/kdd2010b_dataset.md)
     * [AROW](binaryclass/kdd2010b_arow.md)
@@ -96,7 +96,7 @@
     * [PA1, AROW, SCW](binaryclass/webspam_scw.md)
 
 * [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md)
-    
+
 ## Part VI - Multiclass classification
 
 * [News20 Multiclass Tutorial](multiclass/news20.md)
@@ -106,12 +106,12 @@
     * [CW, AROW, SCW](multiclass/news20_scw.md)
     * [Ensemble learning](multiclass/news20_ensemble.md)
     * [one-vs-the-rest classifier](multiclass/news20_one-vs-the-rest.md)
-    
+
 * [Iris Tutorial](multiclass/iris.md)
     * [Data preparation](multiclass/iris_dataset.md)
     * [SCW](multiclass/iris_scw.md)
     * [RandomForest](multiclass/iris_randomforest.md)
-    
+
 ## Part VII - Regression
 
 * [E2006-tfidf regression Tutorial](regression/e2006.md)
@@ -139,7 +139,7 @@
     * [Data preparation](recommend/movielens_dataset.md)
     * [Matrix Factorization](recommend/movielens_mf.md)
     * [Factorization Machine](recommend/movielens_fm.md)
-    * [10-fold Cross Validation (Matrix 
Factorization)](recommend/movielens_cv.md)    
+    * [10-fold Cross Validation (Matrix 
Factorization)](recommend/movielens_cv.md)
 
 ## Part IX - Anomaly Detection
 
@@ -149,6 +149,12 @@
 
 ## Part X - Hivemall on Spark
 
+* [Binary Classification](spark/binaryclass/index.md)
+    * [a9a Tutorial for DataFrame](spark/binaryclass/a9a_df.md)
+
+* [Regression](spark/binaryclass/index.md)
+    * [E2006-tfidf regression Tutorial for 
DataFrame](spark/regression/e2006_df.md)
+
 * [Generic features](spark/misc/misc.md)
     * [Top-k Join processing](spark/misc/topk_join.md)
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/binaryclass/a9a_df.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/spark/binaryclass/a9a_df.md 
b/docs/gitbook/spark/binaryclass/a9a_df.md
new file mode 100644
index 0000000..7c3de67
--- /dev/null
+++ b/docs/gitbook/spark/binaryclass/a9a_df.md
@@ -0,0 +1,100 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+a9a
+===
+http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a
+
+Data preparation
+================
+
+```sh
+$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a9a
+$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a9a.t
+```
+
+```scala
+scala> :paste
+val trainDf = spark.read.format("libsvm").load("a9a")
+  .select(
+    // `label` must be [0.0, 1.0]
+    rescale($"label", lit(-1.0f), lit(1.0f)).as("label"),
+    $"features"
+  )
+
+scala> trainDf.printSchema
+root
+ |-- label: float (nullable = true)
+ |-- features: vector (nullable = true)
+
+scala> :paste
+val testDf = spark.read.format("libsvm").load("a9a.t")
+  .select(rowid(), rescale($"label", lit(-1.0f), lit(1.0f)).as("label"), 
$"features")
+  .explode_vector($"features")
+  .select($"rowid", $"label".as("target"), $"feature", $"weight".as("value"))
+  .cache
+
+scala> df.printSchema
+root
+ |-- rowid: string (nullable = true)
+ |-- target: float (nullable = true)
+ |-- feature: string (nullable = true)
+ |-- value: double (nullable = true)
+```
+
+Tutorials
+================
+
+[Logistic Regression]
+---
+
+#Training
+
+```scala
+scala> :paste
+val modelDf = trainDf
+  .train_logregr(append_bias($"features"), $"label")
+  .groupBy("feature").avg("weight")
+  .toDF("feature", "weight")
+  .cache
+```
+
+#Test
+
+```scala
+scala> :paste
+val predictDf = testDf
+  .join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER")
+  .select($"rowid", ($"weight" * $"value").as("value"))
+  .groupBy("rowid").sum("value")
+  .select(
+    $"rowid",
+    when(sigmoid($"sum(value)") > 0.5, 1.0).otherwise(0.0).as("predicted")
+  )
+```
+
+#Evaluation
+
+```scala
+scala> val df = predictDf.join(testDf, predictDf("rowid").as("id") === 
testDf("rowid"), "INNER")
+
+scala> (df.where($"target" === $"predicted").count + 0.0) / df.count
+Double = 0.8327921286841418
+```
+

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/binaryclass/index.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/spark/binaryclass/index.md 
b/docs/gitbook/spark/binaryclass/index.md
new file mode 100644
index 0000000..0475c9c
--- /dev/null
+++ b/docs/gitbook/spark/binaryclass/index.md
@@ -0,0 +1,18 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/regression/e2006_df.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/spark/regression/e2006_df.md 
b/docs/gitbook/spark/regression/e2006_df.md
new file mode 100644
index 0000000..5980e3e
--- /dev/null
+++ b/docs/gitbook/spark/regression/e2006_df.md
@@ -0,0 +1,104 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+E2006
+===
+http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
+
+Data preparation
+================
+
+```sh
+$ wget 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2
+$ wget 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2
+```
+
+```scala
+scala> :paste
+val trainDf = spark.read.format("libsvm").load("E2006.train.bz2")
+  .select(
+    // `label` must be [0.0, 1.0]
+    rescale($"label", lit(-7.899578f), lit(-0.51940954f)).as("label"),
+    $"features"
+  )
+
+scala> trainDf.printSchema
+root
+ |-- label: float (nullable = true)
+ |-- features: vector (nullable = true)
+
+scala> :paste
+val testDf = spark.read.format("libsvm").load("E2006.test.bz2")
+  .select(rowid(), rescale($"label", lit(-7.899578f), 
lit(-0.51940954f)).as("label"), $"features")
+  .explode_vector($"features")
+  .select($"rowid", $"label".as("target"), $"feature", $"weight".as("value"))
+  .cache
+
+scala> df.printSchema
+root
+ |-- rowid: string (nullable = true)
+ |-- target: float (nullable = true)
+ |-- feature: string (nullable = true)
+ |-- value: double (nullable = true)
+```
+
+Tutorials
+================
+
+[AROWe2]
+---
+
+#Training
+
+```scala
+scala> :paste
+val modelDf = trainDf
+  .train_arowe2_regr(append_bias($"features"), $"label")
+  .groupBy("feature").avg("weight")
+  .toDF("feature", "weight")
+  .cache
+```
+
+#Test
+
+```scala
+scala> :paste
+val predictDf = testDf
+  .join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER")
+  .select($"rowid", ($"weight" * $"value").as("value"))
+  .groupBy("rowid").sum("value")
+  .select($"rowid", sigmoid($"sum(value)").as("predicted"))
+```
+
+#Evaluation
+
+```scala
+scala> :paste
+predictDf
+  .join(testDf, predictDf("rowid").as("id") === testDf("rowid"), "INNER")
+  .groupBy().avg("target", "predicted")
+  .show()
+
++------------------+------------------+
+|       avg(target)|    avg(predicted)|
++------------------+------------------+
+|0.5489154884487879|0.6030108853227014|
++------------------+------------------+
+```
+

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/regression/index.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/spark/regression/index.md 
b/docs/gitbook/spark/regression/index.md
new file mode 100644
index 0000000..0475c9c
--- /dev/null
+++ b/docs/gitbook/spark/regression/index.md
@@ -0,0 +1,18 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/resources/ddl/import-packages.spark
----------------------------------------------------------------------
diff --git a/resources/ddl/import-packages.spark 
b/resources/ddl/import-packages.spark
index 2015cd8..7476ae3 100644
--- a/resources/ddl/import-packages.spark
+++ b/resources/ddl/import-packages.spark
@@ -2,12 +2,8 @@
  * An initialization script for DataFrame use
  */
 
-import org.apache.spark.sql._
-import org.apache.spark.sql.functions._
-import org.apache.spark.sql.types._
 import org.apache.spark.sql.hive.HivemallOps._
-import org.apache.spark.sql.hive.HivemallUtils
-import hivemall.xgboost.XGBoostOptions
-// Needed for implicit conversions
+import org.apache.spark.sql.hive.HivemallGroupedDataset._
 import org.apache.spark.sql.hive.HivemallUtils._
+import hivemall.xgboost.XGBoostOptions

incubator-hivemall git commit: Close #49: [HIVEMALL-26][SPARK][DOC] Make docs for regression and binary classification

Reply via email to