Repository: incubator-hivemall Updated Branches: refs/heads/master 054a697eb -> a89b9b8c1
Close #49: [HIVEMALL-26][SPARK][DOC] Make docs for regression and binary classification Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/a89b9b8c Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/a89b9b8c Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/a89b9b8c Branch: refs/heads/master Commit: a89b9b8c19dd9fd645792d5e54a03fe697743ae1 Parents: 054a697 Author: Takeshi Yamamuro <yamam...@apache.org> Authored: Thu Feb 23 22:07:25 2017 +0900 Committer: myui <yuin...@gmail.com> Committed: Thu Feb 23 22:07:25 2017 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 20 +++-- docs/gitbook/spark/binaryclass/a9a_df.md | 100 ++++++++++++++++++++++++ docs/gitbook/spark/binaryclass/index.md | 18 +++++ docs/gitbook/spark/regression/e2006_df.md | 104 +++++++++++++++++++++++++ docs/gitbook/spark/regression/index.md | 18 +++++ resources/ddl/import-packages.spark | 8 +- 6 files changed, 255 insertions(+), 13 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 5b080d2..40f20a8 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -16,14 +16,14 @@ specific language governing permissions and limitations under the License. --> - + # Summary ## TABLE OF CONTENTS * [Getting Started](getting_started/README.md) * [Installation](getting_started/installation.md) - * [Install as permanent functions](getting_started/permanent-functions.md) + * [Install as permanent functions](getting_started/permanent-functions.md) * [Input Format](getting_started/input-format.md) * [Tips for Effective Hivemall](tips/README.md) @@ -86,7 +86,7 @@ * [KDD2010a Tutorial](binaryclass/kdd2010a.md) * [Data preparation](binaryclass/kdd2010a_dataset.md) * [PA, CW, AROW, SCW](binaryclass/kdd2010a_scw.md) - + * [KDD2010b Tutorial](binaryclass/kdd2010b.md) * [Data preparation](binaryclass/kdd2010b_dataset.md) * [AROW](binaryclass/kdd2010b_arow.md) @@ -96,7 +96,7 @@ * [PA1, AROW, SCW](binaryclass/webspam_scw.md) * [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md) - + ## Part VI - Multiclass classification * [News20 Multiclass Tutorial](multiclass/news20.md) @@ -106,12 +106,12 @@ * [CW, AROW, SCW](multiclass/news20_scw.md) * [Ensemble learning](multiclass/news20_ensemble.md) * [one-vs-the-rest classifier](multiclass/news20_one-vs-the-rest.md) - + * [Iris Tutorial](multiclass/iris.md) * [Data preparation](multiclass/iris_dataset.md) * [SCW](multiclass/iris_scw.md) * [RandomForest](multiclass/iris_randomforest.md) - + ## Part VII - Regression * [E2006-tfidf regression Tutorial](regression/e2006.md) @@ -139,7 +139,7 @@ * [Data preparation](recommend/movielens_dataset.md) * [Matrix Factorization](recommend/movielens_mf.md) * [Factorization Machine](recommend/movielens_fm.md) - * [10-fold Cross Validation (Matrix Factorization)](recommend/movielens_cv.md) + * [10-fold Cross Validation (Matrix Factorization)](recommend/movielens_cv.md) ## Part IX - Anomaly Detection @@ -149,6 +149,12 @@ ## Part X - Hivemall on Spark +* [Binary Classification](spark/binaryclass/index.md) + * [a9a Tutorial for DataFrame](spark/binaryclass/a9a_df.md) + +* [Regression](spark/binaryclass/index.md) + * [E2006-tfidf regression Tutorial for DataFrame](spark/regression/e2006_df.md) + * [Generic features](spark/misc/misc.md) * [Top-k Join processing](spark/misc/topk_join.md) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/binaryclass/a9a_df.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/binaryclass/a9a_df.md b/docs/gitbook/spark/binaryclass/a9a_df.md new file mode 100644 index 0000000..7c3de67 --- /dev/null +++ b/docs/gitbook/spark/binaryclass/a9a_df.md @@ -0,0 +1,100 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +a9a +=== +http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a + +Data preparation +================ + +```sh +$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a9a +$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a9a.t +``` + +```scala +scala> :paste +val trainDf = spark.read.format("libsvm").load("a9a") + .select( + // `label` must be [0.0, 1.0] + rescale($"label", lit(-1.0f), lit(1.0f)).as("label"), + $"features" + ) + +scala> trainDf.printSchema +root + |-- label: float (nullable = true) + |-- features: vector (nullable = true) + +scala> :paste +val testDf = spark.read.format("libsvm").load("a9a.t") + .select(rowid(), rescale($"label", lit(-1.0f), lit(1.0f)).as("label"), $"features") + .explode_vector($"features") + .select($"rowid", $"label".as("target"), $"feature", $"weight".as("value")) + .cache + +scala> df.printSchema +root + |-- rowid: string (nullable = true) + |-- target: float (nullable = true) + |-- feature: string (nullable = true) + |-- value: double (nullable = true) +``` + +Tutorials +================ + +[Logistic Regression] +--- + +#Training + +```scala +scala> :paste +val modelDf = trainDf + .train_logregr(append_bias($"features"), $"label") + .groupBy("feature").avg("weight") + .toDF("feature", "weight") + .cache +``` + +#Test + +```scala +scala> :paste +val predictDf = testDf + .join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER") + .select($"rowid", ($"weight" * $"value").as("value")) + .groupBy("rowid").sum("value") + .select( + $"rowid", + when(sigmoid($"sum(value)") > 0.5, 1.0).otherwise(0.0).as("predicted") + ) +``` + +#Evaluation + +```scala +scala> val df = predictDf.join(testDf, predictDf("rowid").as("id") === testDf("rowid"), "INNER") + +scala> (df.where($"target" === $"predicted").count + 0.0) / df.count +Double = 0.8327921286841418 +``` + http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/binaryclass/index.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/binaryclass/index.md b/docs/gitbook/spark/binaryclass/index.md new file mode 100644 index 0000000..0475c9c --- /dev/null +++ b/docs/gitbook/spark/binaryclass/index.md @@ -0,0 +1,18 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/regression/e2006_df.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/regression/e2006_df.md b/docs/gitbook/spark/regression/e2006_df.md new file mode 100644 index 0000000..5980e3e --- /dev/null +++ b/docs/gitbook/spark/regression/e2006_df.md @@ -0,0 +1,104 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +E2006 +=== +http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf + +Data preparation +================ + +```sh +$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2 +$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2 +``` + +```scala +scala> :paste +val trainDf = spark.read.format("libsvm").load("E2006.train.bz2") + .select( + // `label` must be [0.0, 1.0] + rescale($"label", lit(-7.899578f), lit(-0.51940954f)).as("label"), + $"features" + ) + +scala> trainDf.printSchema +root + |-- label: float (nullable = true) + |-- features: vector (nullable = true) + +scala> :paste +val testDf = spark.read.format("libsvm").load("E2006.test.bz2") + .select(rowid(), rescale($"label", lit(-7.899578f), lit(-0.51940954f)).as("label"), $"features") + .explode_vector($"features") + .select($"rowid", $"label".as("target"), $"feature", $"weight".as("value")) + .cache + +scala> df.printSchema +root + |-- rowid: string (nullable = true) + |-- target: float (nullable = true) + |-- feature: string (nullable = true) + |-- value: double (nullable = true) +``` + +Tutorials +================ + +[AROWe2] +--- + +#Training + +```scala +scala> :paste +val modelDf = trainDf + .train_arowe2_regr(append_bias($"features"), $"label") + .groupBy("feature").avg("weight") + .toDF("feature", "weight") + .cache +``` + +#Test + +```scala +scala> :paste +val predictDf = testDf + .join(modelDf, testDf("feature") === modelDf("feature"), "LEFT_OUTER") + .select($"rowid", ($"weight" * $"value").as("value")) + .groupBy("rowid").sum("value") + .select($"rowid", sigmoid($"sum(value)").as("predicted")) +``` + +#Evaluation + +```scala +scala> :paste +predictDf + .join(testDf, predictDf("rowid").as("id") === testDf("rowid"), "INNER") + .groupBy().avg("target", "predicted") + .show() + ++------------------+------------------+ +| avg(target)| avg(predicted)| ++------------------+------------------+ +|0.5489154884487879|0.6030108853227014| ++------------------+------------------+ +``` + http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/docs/gitbook/spark/regression/index.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/spark/regression/index.md b/docs/gitbook/spark/regression/index.md new file mode 100644 index 0000000..0475c9c --- /dev/null +++ b/docs/gitbook/spark/regression/index.md @@ -0,0 +1,18 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a89b9b8c/resources/ddl/import-packages.spark ---------------------------------------------------------------------- diff --git a/resources/ddl/import-packages.spark b/resources/ddl/import-packages.spark index 2015cd8..7476ae3 100644 --- a/resources/ddl/import-packages.spark +++ b/resources/ddl/import-packages.spark @@ -2,12 +2,8 @@ * An initialization script for DataFrame use */ -import org.apache.spark.sql._ -import org.apache.spark.sql.functions._ -import org.apache.spark.sql.types._ import org.apache.spark.sql.hive.HivemallOps._ -import org.apache.spark.sql.hive.HivemallUtils -import hivemall.xgboost.XGBoostOptions -// Needed for implicit conversions +import org.apache.spark.sql.hive.HivemallGroupedDataset._ import org.apache.spark.sql.hive.HivemallUtils._ +import hivemall.xgboost.XGBoostOptions