Repository: incubator-hivemall Updated Branches: refs/heads/master 2915a78c7 -> c06378a81
Close #88: [HIVEMALL-50] Add a description about Feature Pairing Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/c06378a8 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/c06378a8 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/c06378a8 Branch: refs/heads/master Commit: c06378a81723e3998f90c08ec7444ead5b6f2263 Parents: 2915a78 Author: Makoto Yui <[email protected]> Authored: Fri Jun 23 18:56:57 2017 +0900 Committer: Makoto Yui <[email protected]> Committed: Fri Jun 23 18:56:57 2017 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 38 ++++++++------ docs/gitbook/binaryclass/general.md | 6 ++- docs/gitbook/clustering/lda.md | 48 +++++++++-------- docs/gitbook/clustering/plsa.md | 48 ++++++++++------- docs/gitbook/eval/auc.md | 9 +++- docs/gitbook/ft_engineering/pairing.md | 19 +++++++ docs/gitbook/ft_engineering/polynomial.md | 73 ++++++++++++++++++++++++++ docs/gitbook/misc/prediction.md | 36 +++++++------ 8 files changed, 202 insertions(+), 75 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 638b77b..32b0150 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -57,10 +57,12 @@ * [Feature Hashing](ft_engineering/hashing.md) * [Feature Selection](ft_engineering/selection.md) * [Feature Binning](ft_engineering/binning.md) -* [TF-IDF Calculation](ft_engineering/tfidf.md) +* [FEATURE PAIRING](ft_engineering/pairing.md) + * [Polynomial Features](ft_engineering/polynomial.md) * [FEATURE TRANSFORMATION](ft_engineering/ft_trans.md) * [Feature Vectorization](ft_engineering/vectorization.md) * [Quantify non-number features](ft_engineering/quantify.md) +* [TF-IDF Calculation](ft_engineering/tfidf.md) ## Part IV - Evaluation @@ -72,43 +74,43 @@ * [Data Generation](eval/datagen.md) * [Logistic Regression data generation](eval/lr_datagen.md) -## Part V - Prediction +## Part V - Supervised Learning * [How Prediction Works](misc/prediction.md) -* [Regression](regression/general.md) -* [Binary Classification](binaryclass/general.md) -## Part VI - Binary classification tutorials +## Part VI - Binary classification -* [a9a](binaryclass/a9a.md) +* [Binary Classification](binaryclass/general.md) + +* [a9a tutorial](binaryclass/a9a.md) * [Data preparation](binaryclass/a9a_dataset.md) * [Logistic Regression](binaryclass/a9a_lr.md) * [Mini-batch Gradient Descent](binaryclass/a9a_minibatch.md) -* [News20](binaryclass/news20.md) +* [News20 tutorial](binaryclass/news20.md) * [Data preparation](binaryclass/news20_dataset.md) * [Perceptron, Passive Aggressive](binaryclass/news20_pa.md) * [CW, AROW, SCW](binaryclass/news20_scw.md) * [AdaGradRDA, AdaGrad, AdaDelta](binaryclass/news20_adagrad.md) -* [KDD2010a](binaryclass/kdd2010a.md) +* [KDD2010a tutorial](binaryclass/kdd2010a.md) * [Data preparation](binaryclass/kdd2010a_dataset.md) * [PA, CW, AROW, SCW](binaryclass/kdd2010a_scw.md) -* [KDD2010b](binaryclass/kdd2010b.md) +* [KDD2010b tutorial](binaryclass/kdd2010b.md) * [Data preparation](binaryclass/kdd2010b_dataset.md) * [AROW](binaryclass/kdd2010b_arow.md) -* [Webspam](binaryclass/webspam.md) +* [Webspam tutorial](binaryclass/webspam.md) * [Data pareparation](binaryclass/webspam_dataset.md) * [PA1, AROW, SCW](binaryclass/webspam_scw.md) -* [Kaggle Titanic](binaryclass/titanic_rf.md) +* [Kaggle Titanic tutorial](binaryclass/titanic_rf.md) -## Part VII - Multiclass classification tutorials +## Part VII - Multiclass classification -* [News20 Multiclass](multiclass/news20.md) +* [News20 Multiclass tutorial](multiclass/news20.md) * [Data preparation](multiclass/news20_dataset.md) * [Data preparation for one-vs-the-rest classifiers](multiclass/news20_one-vs-the-rest_dataset.md) * [PA](multiclass/news20_pa.md) @@ -116,18 +118,20 @@ * [Ensemble learning](multiclass/news20_ensemble.md) * [one-vs-the-rest classifier](multiclass/news20_one-vs-the-rest.md) -* [Iris](multiclass/iris.md) +* [Iris tutorial](multiclass/iris.md) * [Data preparation](multiclass/iris_dataset.md) * [SCW](multiclass/iris_scw.md) * [RandomForest](multiclass/iris_randomforest.md) -## Part VIII - Regression tutorials +## Part VIII - Regression + +* [Regression](regression/general.md) -* [E2006-tfidf regression](regression/e2006.md) +* [E2006-tfidf regression tutorial](regression/e2006.md) * [Data preparation](regression/e2006_dataset.md) * [Passive Aggressive, AROW](regression/e2006_arow.md) -* [KDDCup 2012 track 2 CTR prediction](regression/kddcup12tr2.md) +* [KDDCup 2012 track 2 CTR prediction tutorial](regression/kddcup12tr2.md) * [Data preparation](regression/kddcup12tr2_dataset.md) * [Logistic Regression, Passive Aggressive](regression/kddcup12tr2_lr.md) * [Logistic Regression with Amplifier](regression/kddcup12tr2_lr_amplify.md) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/binaryclass/general.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/binaryclass/general.md b/docs/gitbook/binaryclass/general.md index 50ea688..931cc58 100644 --- a/docs/gitbook/binaryclass/general.md +++ b/docs/gitbook/binaryclass/general.md @@ -56,6 +56,10 @@ from group by feature; ``` +> #### Note +> +> `-total_steps` option is an optional parameter and training works without it. + # Prediction & evaluation ```sql @@ -72,7 +76,7 @@ predict as ( select t.rowid, sigmoid(sum(m.weight * t.value)) as prob, - CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label + (case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end)as label from test_exploded t LEFT OUTER JOIN classification_model m ON (t.feature = m.feature) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/clustering/lda.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/clustering/lda.md b/docs/gitbook/clustering/lda.md index 8b8e5f5..b433472 100644 --- a/docs/gitbook/clustering/lda.md +++ b/docs/gitbook/clustering/lda.md @@ -46,19 +46,21 @@ with word_counts as ( select docid, feature(word, count(word)) as word_count - from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word + from + docs t1 + LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word ) -select docid, collect_set(word_count) as feature +select docid, collect_list(word_count) as features from word_counts group by docid ; ``` -| docid | feature | +| docid | features | |:---:|:---| |1 | ["fruits:1","healthy:1","vegetables:1"] | |2 | ["apples:1","avocados:1","colds:1","flu:1","like:2","oranges:1"] | @@ -80,15 +82,16 @@ with word_counts as ( not is_stopword(word) group by docid, word -) -select - train_lda(feature, "-topics 2 -iter 20") as (label, word, lambda) -from ( - select docid, collect_set(word_count) as feature +), +input as ( + select docid, collect_list(word_count) as features from word_counts group by docid - order by docid -) t +) +select + train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) +from + input ; ``` @@ -99,20 +102,22 @@ Notice that `order by docid` ensures building a LDA model precisely in a single ```sql with word_counts as ( -- same as above +), +input as ( + select docid, collect_list(f) as features + from word_counts + group by docid ) select label, word, avg(lambda) as lambda from ( select - train_lda(feature, "-topics 2 -iter 20") as (label, word, lambda) - from ( - select docid, collect_set(f) as feature - from word_counts - group by docid - ) t1 + train_lda(features, '-topics 2 -iter 20') as (label, word, lambda) + from + input ) t2 group by label, word -order by lambda desc +-- order by lambda desc -- ordering is optional ; ``` @@ -155,7 +160,9 @@ with test as ( docid, word, count(word) as value - from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word + from + docs t1 + LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by @@ -163,7 +170,7 @@ with test as ( ) select t.docid, - lda_predict(t.word, t.value, m.label, m.lambda, "-topics 2") as probabilities + lda_predict(t.word, t.value, m.label, m.lambda, '-topics 2') as probabilities from test t JOIN lda_model m ON (t.word = m.word) @@ -183,8 +190,7 @@ Since the probabilities are sorted in descending order, a label of the most prom ```sql select docid, probabilities[0].label -from topic -; +from topic; ``` | docid | label | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/clustering/plsa.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/clustering/plsa.md b/docs/gitbook/clustering/plsa.md index 7cd3a9d..31cc08d 100644 --- a/docs/gitbook/clustering/plsa.md +++ b/docs/gitbook/clustering/plsa.md @@ -52,19 +52,23 @@ with word_counts as ( select docid, feature(word, count(word)) as f - from docs t1 lateral view explode(tokenize(doc, true)) t2 as word + from + docs t1 + lateral view explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by docid, word -) -select - train_plsa(feature, "-topics 2 -eps 0.00001 -iter 2048 -alpha 0.01") as (label, word, prob) -from ( - select docid, collect_set(f) as feature +), +input as ( + select docid, collect_list(f) as features from word_counts group by docid -) t +) +select + train_plsa(features, '-topics 2 -eps 0.00001 -iter 2048 -alpha 0.01') as (label, word, prob) +from + input ; ``` @@ -90,7 +94,6 @@ from ( |1| colds | 0.001978546| - And prediction can be done as: ```sql @@ -99,7 +102,9 @@ test as ( docid, word, count(word) as value - from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word + from + docs t1 + LATERAL VIEW explode(tokenize(doc, true)) t2 as word where not is_stopword(word) group by @@ -108,20 +113,25 @@ test as ( topic as ( select t.docid, - plsa_predict(t.word, t.value, m.label, m.prob, "-topics 2") as probabilities + plsa_predict(t.word, t.value, m.label, m.prob, '-topics 2') as probabilities from test t JOIN plsa_model m ON (t.word = m.word) group by t.docid ) -select docid, probabilities, probabilities[0].label, m.words -- topic each document should be assigned -from topic t -join ( - select label, collect_set(feature(word, prob)) as words - from plsa_model - group by label -) m on t.probabilities[0].label = m.label +select + docid, + probabilities, + probabilities[0].label, + m.words -- topic each document should be assigned +from + topic t + JOIN ( + select label, collect_list(feature(word, prob)) as words + from plsa_model + group by label + ) m on t.probabilities[0].label = m.label ; ``` @@ -144,7 +154,7 @@ For the reasons that we mentioned above, we recommend you to first use LDA. Afte For training pLSA, we set a hyper-parameter `alpha` in the above example: ```sql -SELECT train_plsa(feature, "-topics 2 -eps 0.00001 -iter 2048 -alpha 0.01") +SELECT train_plsa(feature, '-topics 2 -eps 0.00001 -iter 2048 -alpha 0.01') ``` This value controls **how much iterative model update is affected by the old results**. @@ -162,7 +172,7 @@ In that case, you need to try different hyper-parameters to avoid overfitting as For instance, [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) which consists of 10906 realistic documents empirically requires the following options: ```sql -SELECT train_plsa(features, "-topics 20 -iter 10 -s 128 -delta 0.01 -alpha 512 -eps 0.1") +SELECT train_plsa(features, '-topics 20 -iter 10 -s 128 -delta 0.01 -alpha 512 -eps 0.1') ``` Clearly, `alpha` is much larger than `0.01` which was used for the dummy data above. Let you keep in mind that an appropriate value of `alpha` highly depends on the number of documents and mini-batch size. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/eval/auc.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/eval/auc.md b/docs/gitbook/eval/auc.md index 3fba0bb..8cad8f6 100644 --- a/docs/gitbook/eval/auc.md +++ b/docs/gitbook/eval/auc.md @@ -41,7 +41,9 @@ Once the rows are sorted by the probabilities in a descending order, AUC gives a In Hivemall, a function `auc(double score, int label)` provides a way to compute AUC for pairs of probability and truth label. -For instance, following query computes AUC of the table which was shown above: +## Sequential AUC computation on a single node + +For instance, the following query computes AUC of the table which was shown above: ```sql with data as ( @@ -68,6 +70,8 @@ This query returns `0.83333` as AUC. Since AUC is a metric based on ranked probability-label pairs as mentioned above, input data (rows) needs to be ordered by scores in a descending order. +## Parallel approximate AUC computation + Meanwhile, Hive's `distribute by` clause allows you to compute AUC in parallel: ```sql @@ -82,7 +86,8 @@ with data as ( union all select 0.7 as prob, 1 as label ) -select auc(prob, label) as auc +select + auc(prob, label) as auc from ( select prob, label from data http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/ft_engineering/pairing.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/pairing.md b/docs/gitbook/ft_engineering/pairing.md new file mode 100644 index 0000000..2959148 --- /dev/null +++ b/docs/gitbook/ft_engineering/pairing.md @@ -0,0 +1,19 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/ft_engineering/polynomial.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/polynomial.md b/docs/gitbook/ft_engineering/polynomial.md new file mode 100644 index 0000000..8f3d8cf --- /dev/null +++ b/docs/gitbook/ft_engineering/polynomial.md @@ -0,0 +1,73 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +<!-- toc --> + +[Polynomial features](http://en.wikipedia.org/wiki/Polynomial_kernel) allows you to do [non-linear regression](https://class.coursera.org/ml-005/lecture/23)/classification with a linear model. + +> #### Caution +> +> Polynomial Features assumes normalized inputs because `x**n` easily becomes INF/-INF where `n` is large. + +# Polynomial Features + +As [a similar to one in Scikit-Learn](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html), `polynomial_feature(array<String> features, int degree [, boolean interactionOnly=false, boolean truncate=true])` is a function to generate polynomial and interaction features. + +```sql +select polynomial_features(array("a:0.5","b:0.2"), 2); +> ["a:0.5","a^a:0.25","a^b:0.1","b:0.2","b^b:0.040000003"] + +select polynomial_features(array("a:0.5","b:0.2"), 3); +> ["a:0.5","a^a:0.25","a^a^a:0.125","a^a^b:0.05","a^b:0.1","a^b^b:0.020000001","b:0.2","b^b:0.040000003","b^b^b:0.008"] + +-- interaction only +select polynomial_features(array("a:0.5","b:0.2"), 3, true); +> ["a:0.5","a^b:0.1","b:0.2"] + +select polynomial_features(array("a:0.5","b:0.2","c:0.3"), 3, true); +> ["a:0.5","a^b:0.1","a^b^c:0.030000001","a^c:0.15","b:0.2","b^c:0.060000002","c:0.3"] + +-- interaction only + no truncate +select polynomial_features(array("a:0.5","b:1.0", "c:0.3"), 3, true, false); +> ["a:0.5","a^b:0.5","a^b^c:0.15","a^c:0.15","b:1.0","b^c:0.3","c:0.3"] + +-- interaction only + truncate +select polynomial_features(array("a:0.5","b:1.0","c:0.3"), 3, true, true); +> ["a:0.5","a^c:0.15","b:1.0","c:0.3"] + +-- truncate +select polynomial_features(array("a:0.5","b:1.0", "c:0.3"), 3, false, true); +> ["a:0.5","a^a:0.25","a^a^a:0.125","a^a^c:0.075","a^c:0.15","a^c^c:0.045","b:1.0","c:0.3","c^c:0.09","c^c^c:0.027000003"] + +-- do not truncate +select polynomial_features(array("a:0.5","b:1.0", "c:0.3"), 3, false, false); +> ["a:0.5","a^a:0.25","a^a^a:0.125","a^a^b:0.25","a^a^c:0.075","a^b:0.5","a^b^b:0.5","a^b^c:0.15","a^c:0.15","a^c^c:0.045","b:1.0","b^b:1.0","b^b^b:1.0","b^b^c:0.3","b^c:0.3","b^c^c:0.09","c:0.3","c^c:0.09","c^c^c:0.027000003"] +> +``` + +_Note: `truncate` is used to eliminate unnecessary combinations._ + +# Powered Features + +The `powered_features(array<String> features, int degree [, boolean truncate=true] )` is a function to generate polynomial features. + +```sql +select powered_features(array("a:0.5","b:0.2"), 3); +> ["a:0.5","a^2:0.25","a^3:0.125","b:0.2","b^2:0.040000003","b^3:0.008"] +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/c06378a8/docs/gitbook/misc/prediction.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/prediction.md b/docs/gitbook/misc/prediction.md index 8c17ec6..317d688 100644 --- a/docs/gitbook/misc/prediction.md +++ b/docs/gitbook/misc/prediction.md @@ -107,21 +107,23 @@ Below we list possible options for `train_regression` and `train_classifier`, an - Loss function: `-loss`, `-loss_function` - For `train_regression` - - SquaredLoss - - QuantileLoss - - EpsilonInsensitiveLoss - - SquaredEpsilonInsensitiveLoss - - HuberLoss + - SquaredLoss (synonym: squared) + - QuantileLoss (synonym: quantile) + - EpsilonInsensitiveLoss (synonym: epsilon_intensitive) + - SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_intensitive) + - HuberLoss (synonym: huber) - For `train_classifier` - - HingeLoss - - LogLoss - - SquaredHingeLoss - - ModifiedHuberLoss - - SquaredLoss - - QuantileLoss - - EpsilonInsensitiveLoss - - SquaredEpsilonInsensitiveLoss - - HuberLoss + - HingeLoss (synonym: hinge) + - LogLoss (synonym: log, logistic) + - SquaredHingeLoss (synonym: squared_hinge) + - ModifiedHuberLoss (synonym: modified_huber) + - The following losses are mainly designed for regression but can sometimes be useful in classification as well: + - SquaredLoss (synonym: squared) + - QuantileLoss (synonym: quantile) + - EpsilonInsensitiveLoss (synonym: epsilon_intensitive) + - SquaredEpsilonInsensitiveLoss (synonym: squared_epsilon_intensitive) + - HuberLoss (synonym: huber) + - Regularization function: `-reg`, `-regularization` - L1 - L2 @@ -135,5 +137,9 @@ Additionally, there are several variants of the SGD technique, and it is also co - AdaGrad - AdaDelta - Adam - + +> #### Note +> +> Option values are case insensitive and you can use `sgd` or `rda`, or `huberloss`. + In practice, you can try different combinations of the options in order to achieve higher prediction accuracy. \ No newline at end of file
