Repository: incubator-hivemall Updated Branches: refs/heads/master 0737e23eb -> 7205de1e9
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/misc/prediction.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/prediction.md b/docs/gitbook/misc/prediction.md index ee85e40..53d0cea 100644 --- a/docs/gitbook/misc/prediction.md +++ b/docs/gitbook/misc/prediction.md @@ -56,7 +56,7 @@ The goal of regression is to predict **real values** as shown below: In practice, target values could be any of small/large float/int negative/positive values. [Our CTR prediction tutorial](../regression/kddcup12tr2.md) solves regression problem with small floating point target values in a 0-1 range, for example. -While there are several ways to realize regression by using Hivemall, `train_regression()` is one of the most flexible functions. This feature is explained in: [Regression](../regression/general.md). +While there are several ways to realize regression by using Hivemall, `train_regressor()` is one of the most flexible functions. This feature is explained in [this page](../regression/general.md). # Classification @@ -103,10 +103,10 @@ Eventually, minimizing the function $$E(\mathbf{w})$$ can be implemented by the Interestingly, depending on a choice of loss and regularization function, prediction model you obtained will behave differently; even if one combination could work as a classifier, another choice might be appropriate for regression. -Below we list possible options for `train_regression` and `train_classifier`, and this is the reason why these two functions are the most flexible in Hivemall: +Below we list possible options for `train_regressor` and `train_classifier`, and this is the reason why these two functions are the most flexible in Hivemall: - Loss function: `-loss`, `-loss_function` - - For `train_regression` + - For `train_regressor` - SquaredLoss (synonym: squared) - QuantileLoss (synonym: quantile) - EpsilonInsensitiveLoss (synonym: epsilon_insensitive) @@ -156,8 +156,8 @@ Furthermore, optimizer offers to set auxiliary options such as: For details of available options, following queries might be helpful to list all of them: ```sql -select train_regression(array(), 0, '-help'); +select train_regressor(array(), 0, '-help'); select train_classifier(array(), 0, '-help'); ``` -In practice, you can try different combinations of the options in order to achieve higher prediction accuracy. \ No newline at end of file +In practice, you can try different combinations of the options in order to achieve higher prediction accuracy. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/misc/tokenizer.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/tokenizer.md b/docs/gitbook/misc/tokenizer.md index 07c8cd1..b056874 100644 --- a/docs/gitbook/misc/tokenizer.md +++ b/docs/gitbook/misc/tokenizer.md @@ -101,4 +101,4 @@ select tokenize_cn("Smartcn为Apache2.0åè®®ç弿ºä¸æåè¯ç³»ç»ï¼Java ``` > [smartcn, 为, apach, 2, 0, åè®®, ç, 弿º, 䏿, åè¯, ç³»ç», > java, è¯è¨, ç¼å, ä¿®æ¹, ç, ä¸ç§é¢, 计ç®, æ, ictcla, åè¯, > ç³»ç»] -For detailed APIs, please refer Javadoc of [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) as well. \ No newline at end of file +For detailed APIs, please refer Javadoc of [SmartChineseAnalyzer](http://lucene.apache.org/core/5_3_1/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html) as well. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/iris_randomforest.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/iris_randomforest.md b/docs/gitbook/multiclass/iris_randomforest.md index 771c733..b421297 100644 --- a/docs/gitbook/multiclass/iris_randomforest.md +++ b/docs/gitbook/multiclass/iris_randomforest.md @@ -381,4 +381,4 @@ digraph Tree { <img src="../resources/images/iris.png" alt="Iris Graphvis output"/> -You can draw a graph by `dot -Tpng iris.dot -o iris.png` or using [Viz.js](http://viz-js.com/). \ No newline at end of file +You can draw a graph by `dot -Tpng iris.dot -o iris.png` or using [Viz.js](http://viz-js.com/). http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/news20_dataset.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/news20_dataset.md b/docs/gitbook/multiclass/news20_dataset.md index 96decec..4cc9b83 100644 --- a/docs/gitbook/multiclass/news20_dataset.md +++ b/docs/gitbook/multiclass/news20_dataset.md @@ -92,5 +92,5 @@ select -- cast(extract_feature(feature) as int) as feature, -- extract_weight(feature) as value from - news20mc_test LATERAL VIEW explode(addBias(features)) t AS feature; -``` \ No newline at end of file + news20mc_test LATERAL VIEW explode(add_bias(features)) t AS feature; +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/news20_ensemble.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/news20_ensemble.md b/docs/gitbook/multiclass/news20_ensemble.md index 6bf1c93..7389a47 100644 --- a/docs/gitbook/multiclass/news20_ensemble.md +++ b/docs/gitbook/multiclass/news20_ensemble.md @@ -48,20 +48,20 @@ select voted_avg(weight) as weight from (select - -- train_multiclass_cw(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_cw(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_cw(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_cw(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 union all select - -- train_multiclass_arow(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_arow(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_arow(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_arow(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 union all select - -- train_multiclass_scw(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_scw(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_scw(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_scw(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 ) t @@ -196,4 +196,4 @@ Unfortunately, too many cooks spoil the broth in this case too :-( | SCW2 | 0.8482344102178813 | | Ensemble(model) | 0.8494866015527173 | | Ensemble(prediction) | 0.8499874780866516 | -| CW | 0.850488354620586 | \ No newline at end of file +| CW | 0.850488354620586 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/news20_one-vs-the-rest_dataset.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/news20_one-vs-the-rest_dataset.md b/docs/gitbook/multiclass/news20_one-vs-the-rest_dataset.md index f437399..6f76d28 100644 --- a/docs/gitbook/multiclass/news20_one-vs-the-rest_dataset.md +++ b/docs/gitbook/multiclass/news20_one-vs-the-rest_dataset.md @@ -44,7 +44,7 @@ SET hivevar:possible_labels="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,16,19,18,20" ``` create or replace view news20_onevsrest_train as -select transform(${possible_labels}, rowid, label, addBias(features)) +select transform(${possible_labels}, rowid, label, add_bias(features)) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" COLLECTION ITEMS TERMINATED BY "," http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/news20_pa.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/news20_pa.md b/docs/gitbook/multiclass/news20_pa.md index 26083f9..c57d08d 100644 --- a/docs/gitbook/multiclass/news20_pa.md +++ b/docs/gitbook/multiclass/news20_pa.md @@ -44,7 +44,7 @@ select voted_avg(weight) as weight from (select - train_multiclass_pa2(addBias(features),label) as (label,feature,weight) + train_multiclass_pa2(add_bias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t @@ -106,4 +106,4 @@ where actual == predicted; drop table news20mc_pa2_model1; drop table news20mc_pa2_predict1; drop view news20mc_pa2_submit1; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/multiclass/news20_scw.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/multiclass/news20_scw.md b/docs/gitbook/multiclass/news20_scw.md index 24e0fad..fbe5153 100644 --- a/docs/gitbook/multiclass/news20_scw.md +++ b/docs/gitbook/multiclass/news20_scw.md @@ -51,8 +51,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_multiclass_cw(addBias(features),label) as (label,feature,weight) -- [hivemall v0.1] - train_multiclass_cw(addBias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] + -- train_multiclass_cw(add_bias(features),label) as (label,feature,weight) -- [hivemall v0.1] + train_multiclass_cw(add_bias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] from news20mc_train_x3 ) t @@ -126,8 +126,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_multiclass_arow(addBias(features),label) as (label,feature,weight) -- [hivemall v0.1] - train_multiclass_arow(addBias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] + -- train_multiclass_arow(add_bias(features),label) as (label,feature,weight) -- [hivemall v0.1] + train_multiclass_arow(add_bias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] from news20mc_train_x3 ) t @@ -201,8 +201,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_multiclass_scw(addBias(features),label) as (label,feature,weight) -- [hivemall v0.1] - train_multiclass_scw(addBias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] + -- train_multiclass_scw(add_bias(features),label) as (label,feature,weight) -- [hivemall v0.1] + train_multiclass_scw(add_bias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] from news20mc_train_x3 ) t @@ -276,8 +276,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_multiclass_scw2(addBias(features),label) as (label,feature,weight) -- [hivemall v0.1] - train_multiclass_scw2(addBias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] + -- train_multiclass_scw2(add_bias(features),label) as (label,feature,weight) -- [hivemall v0.1] + train_multiclass_scw2(add_bias(features),label) as (label,feature,weight,covar) -- [hivemall v0.2 or later] from news20mc_train_x3 ) t http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/item_based_cf.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/item_based_cf.md b/docs/gitbook/recommend/item_based_cf.md index 9515184..9e4f7e4 100644 --- a/docs/gitbook/recommend/item_based_cf.md +++ b/docs/gitbook/recommend/item_based_cf.md @@ -714,4 +714,4 @@ similarity as ( -- copy (i1, i2)'s similarity as (i2, i1)'s one ), topk as ( ... -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/movielens_cf.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/movielens_cf.md b/docs/gitbook/recommend/movielens_cf.md index e0ed545..faa555c 100644 --- a/docs/gitbook/recommend/movielens_cf.md +++ b/docs/gitbook/recommend/movielens_cf.md @@ -253,4 +253,4 @@ where -- at least 10 recommended items are necessary to compute recall@10 and pr |**MRR**| 0.03507380742291146 | |**NDCG**| 0.15787655209987522 | -If you set larger value to the DIMSUM's `-threshold` option, similarity will be more aggressively approximated. Consequently, while efficiency is improved, the accuracy is likely to be decreased. \ No newline at end of file +If you set larger value to the DIMSUM's `-threshold` option, similarity will be more aggressively approximated. Consequently, while efficiency is improved, the accuracy is likely to be decreased. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/movielens_cv.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/movielens_cv.md b/docs/gitbook/recommend/movielens_cv.md index a1f7b2f..6ac54c7 100644 --- a/docs/gitbook/recommend/movielens_cv.md +++ b/docs/gitbook/recommend/movielens_cv.md @@ -79,4 +79,4 @@ Then, issue SQL queies in [generate_cv.sql](https://gist.github.com/myui/2e20182 > 0.8502739040257945 (RMSE) -_We recommend to use [Tez](http://tez.apache.org/) for running queries having many stages._ \ No newline at end of file +_We recommend to use [Tez](http://tez.apache.org/) for running queries having many stages._ http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/movielens_fm.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/movielens_fm.md b/docs/gitbook/recommend/movielens_fm.md index ad59324..64039fe 100644 --- a/docs/gitbook/recommend/movielens_fm.md +++ b/docs/gitbook/recommend/movielens_fm.md @@ -264,4 +264,4 @@ select from testing_fm as t JOIN predicted as p on (t.rowid = p.rowid); -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/movielens_mf.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/movielens_mf.md b/docs/gitbook/recommend/movielens_mf.md index ca38fec..003082a 100644 --- a/docs/gitbook/recommend/movielens_mf.md +++ b/docs/gitbook/recommend/movielens_mf.md @@ -157,4 +157,4 @@ limit ${topk}; | 2503 | 4.788541 | | 53 | 4.7518783 | | 904 | 4.7463417 | -| 953 | 4.732769 | \ No newline at end of file +| 953 | 4.732769 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/news20_bbit_minhash.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/news20_bbit_minhash.md b/docs/gitbook/recommend/news20_bbit_minhash.md index 474a40d..93cb47b 100644 --- a/docs/gitbook/recommend/news20_bbit_minhash.md +++ b/docs/gitbook/recommend/news20_bbit_minhash.md @@ -66,4 +66,4 @@ limit ${topn}; | 3839 | 0.328125 | 41 | | 12669 | 0.328125 | 37 | | 13604 | 0.3125 | 41 | -| 6333 | 0.3125 | 39 | \ No newline at end of file +| 6333 | 0.3125 | 39 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/recommend/news20_jaccard.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/recommend/news20_jaccard.md b/docs/gitbook/recommend/news20_jaccard.md index 6a30fb8..0166ed5 100644 --- a/docs/gitbook/recommend/news20_jaccard.md +++ b/docs/gitbook/recommend/news20_jaccard.md @@ -139,4 +139,4 @@ from where similarity >= 0.1 ; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/e2006_arow.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/e2006_arow.md b/docs/gitbook/regression/e2006_arow.md index abdb725..ddf6398 100644 --- a/docs/gitbook/regression/e2006_arow.md +++ b/docs/gitbook/regression/e2006_arow.md @@ -32,7 +32,7 @@ select avg(weight) as weight from (select - train_pa1a_regr(addBias(features),target) as (feature,weight) + train_pa1a_regr(add_bias(features),target) as (feature,weight) from e2006tfidf_train_x3 ) t @@ -96,7 +96,7 @@ select avg(weight) as weight from (select - train_pa2a_regr(addBias(features),target) as (feature,weight) + train_pa2a_regr(add_bias(features),target) as (feature,weight) from e2006tfidf_train_x3 ) t @@ -160,8 +160,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_arow_regr(addBias(features),target) as (feature,weight) -- [hivemall v0.1] - train_arow_regr(addBias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later] + -- train_arow_regr(add_bias(features),target) as (feature,weight) -- [hivemall v0.1] + train_arow_regr(add_bias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later] from e2006tfidf_train_x3 ) t @@ -226,8 +226,8 @@ select argmin_kld(weight, covar) as weight -- [hivemall v0.2 or later] from (select - -- train_arowe_regr(addBias(features),target) as (feature,weight) -- [hivemall v0.1] - train_arowe_regr(addBias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later] + -- train_arowe_regr(add_bias(features),target) as (feature,weight) -- [hivemall v0.1] + train_arowe_regr(add_bias(features),target) as (feature,weight,covar) -- [hivemall v0.2 or later] from e2006tfidf_train_x3 ) t http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/e2006_dataset.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/e2006_dataset.md b/docs/gitbook/regression/e2006_dataset.md index 001eda2..804fa40 100644 --- a/docs/gitbook/regression/e2006_dataset.md +++ b/docs/gitbook/regression/e2006_dataset.md @@ -17,13 +17,11 @@ under the License. --> -http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf - Prerequisite ============ -* [hivemall.jar](https://github.com/myui/hivemall/tree/master/target/hivemall.jar) -* [conv.awk](https://github.com/myui/hivemall/tree/master/scripts/misc/conv.awk) -* [define-all.hive](https://github.com/myui/hivemall/tree/master/scripts/ddl/define-all.hive) + +* [E2006-tfidf Dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf) +* [conv.awk](https://github.com/apache/incubator-hivemall/blob/master/resources/misc/conv.awk) Data preparation ================ @@ -43,12 +41,7 @@ hadoop fs -put E2006.test.tsv /dataset/E2006-tfidf/test create database E2006; use E2006; -delete jar /home/myui/tmp/hivemall.jar; -add jar /home/myui/tmp/hivemall.jar; - -source /home/myui/tmp/define-all.hive; - -Create external table e2006tfidf_train ( +create external table e2006tfidf_train ( rowid int, target float, features ARRAY<STRING> @@ -56,7 +49,7 @@ Create external table e2006tfidf_train ( ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; -Create external table e2006tfidf_test ( +create external table e2006tfidf_test ( rowid int, target float, features ARRAY<STRING> @@ -68,24 +61,28 @@ create table e2006tfidf_test_exploded as select rowid, target, - split(feature,":")[0] as feature, - cast(split(feature,":")[1] as float) as value + -- split(feature,":")[0] as feature, + -- cast(split(feature,":")[1] as float) as value -- hivemall v0.3.1 or later - -- extract_feature(feature) as feature, - -- extract_weight(feature) as value + extract_feature(feature) as feature, + extract_weight(feature) as value from - e2006tfidf_test LATERAL VIEW explode(addBias(features)) t AS feature; + e2006tfidf_test LATERAL VIEW explode(add_bias(features)) t AS feature; ``` ## Amplify training examples (global shuffle) + ```sql -- set mapred.reduce.tasks=32; set hivevar:seed=31; set hivevar:xtimes=3; + create or replace view e2006tfidf_train_x3 as select * from ( -select amplify(${xtimes}, *) as (rowid, target, features) from e2006tfidf_train + select amplify(${xtimes}, *) as (rowid, target, features) + from e2006tfidf_train ) t CLUSTER BY rand(${seed}); + -- set mapred.reduce.tasks=-1; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/general.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/general.md b/docs/gitbook/regression/general.md index dee0719..4750ea4 100644 --- a/docs/gitbook/regression/general.md +++ b/docs/gitbook/regression/general.md @@ -24,7 +24,7 @@ In our regression tutorials, you can tackle realistic prediction problems by usi - [AROW](e2006_arow.html#arow) - [AROWe](e2006_arow.html#arowe) -Our `train_regression` function enables you to solve the regression problems with flexible configureable options. Let us try the function below. +Our `train_regressor` function enables you to solve the regression problems with flexible configurable options. Let us try the function below. It should be noted that the sample queries require you to prepare [E2006-tfidf data](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf). See [our E2006-tfidf tutorial page](../regression/e2006_dataset.md) for further instructions. @@ -42,7 +42,7 @@ select avg(weight) as weight from ( select - train_regression(features,target,'-loss squaredloss -opt AdaGrad -reg no') as (feature,weight) + train_regressor(features,target,'-loss squaredloss -opt AdaGrad -reg no') as (feature,weight) from e2006tfidf_train_x3 ) t http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/kddcup12tr2_dataset.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/kddcup12tr2_dataset.md b/docs/gitbook/regression/kddcup12tr2_dataset.md index c32958f..e4a541b 100644 --- a/docs/gitbook/regression/kddcup12tr2_dataset.md +++ b/docs/gitbook/regression/kddcup12tr2_dataset.md @@ -243,4 +243,4 @@ from testing2 LATERAL VIEW explode(features) t AS feature; ``` -_Caution: We recommend you to set "mapred.reduce.tasks" in the above example to partition the training_orcfile table into pieces._ \ No newline at end of file +_Caution: We recommend you to set "mapred.reduce.tasks" in the above example to partition the training_orcfile table into pieces._ http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/kddcup12tr2_lr.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/kddcup12tr2_lr.md b/docs/gitbook/regression/kddcup12tr2_lr.md index 6db07ab..b9f8bdf 100644 --- a/docs/gitbook/regression/kddcup12tr2_lr.md +++ b/docs/gitbook/regression/kddcup12tr2_lr.md @@ -157,4 +157,4 @@ pypy scoreKDD.py KDD_Track2_solution.csv pa_predict.submit |:-----------|------------:| | AUC | 0.739722 | | NWMAE | 0.049582 | -| WRMSE | 0.143698 | \ No newline at end of file +| WRMSE | 0.143698 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/regression/kddcup12tr2_lr_amplify.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md index 5ede953..b363051 100644 --- a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md +++ b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md @@ -119,4 +119,4 @@ We recommend users to use *amplify()* for small training inputs and to use *rand |:-----------|--------------------|----:| | Plain | 89.718 | 0.734805 | | amplifier+clustered by | 479.855 | 0.746214 | -| rand_amplifier | 116.424 | 0.743392 | \ No newline at end of file +| rand_amplifier | 116.424 | 0.743392 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/addbias.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/addbias.md b/docs/gitbook/tips/addbias.md index 021ca64..75ef451 100644 --- a/docs/gitbook/tips/addbias.md +++ b/docs/gitbook/tips/addbias.md @@ -26,8 +26,8 @@ With bias clause b, a trainer learns the following f(x). _f(x)=Wx+b_ Then, the predicted model considers bias existing in the dataset and the predicted hyperplane does not always cross the origin. -**addBias()** of Hivemall, adds a bias to a feature vector. -To enable a bias clause, use addBias() for **both**_(important!)_ training and test data as follows. +**add_bias()** of Hivemall, adds a bias to a feature vector. +To enable a bias clause, use add_bias() for **both**_(important!)_ training and test data as follows. The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See [AddBiasUDF](../tips/addbias.html) for the detail. Note that Bias is expressed as a feature that found in all training/testing examples. @@ -43,7 +43,7 @@ select -- extract_feature(feature) as feature, -- hivemall v0.3.1 or later -- extract_weight(feature) as value -- hivemall v0.3.1 or later from - e2006tfidf_test LATERAL VIEW explode(addBias(features)) t AS feature; + e2006tfidf_test LATERAL VIEW explode(add_bias(features)) t AS feature; ``` # Adding a bias clause to training data @@ -54,9 +54,9 @@ select avg(weight) as weight from (select - pa1a_regress(addBias(features),target) as (feature,weight) + pa1a_regress(add_bias(features),target) as (feature,weight) from e2006tfidf_train_x3 ) t group by feature; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/emr.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/emr.md b/docs/gitbook/tips/emr.md index 049e6da..44e0855 100644 --- a/docs/gitbook/tips/emr.md +++ b/docs/gitbook/tips/emr.md @@ -107,7 +107,7 @@ select cast(split(feature,":")[0] as int) as feature, cast(split(feature,":")[1] as float) as value from - news20b_test LATERAL VIEW explode(addBias(features)) t AS feature; + news20b_test LATERAL VIEW explode(add_bias(features)) t AS feature; ``` --- @@ -132,7 +132,7 @@ select cast(voted_avg(weight) as float) as weight from (select - train_arow(addBias(features),label) as (feature,weight) + train_arow(add_bias(features),label) as (feature,weight) from news20b_train_x3 ) t @@ -202,4 +202,4 @@ We recommended users to use m1.xlarge running Hivemall on EMR as follows. --bootstrap-name "install ganglia" \ --availability-zone ap-northeast-1a ``` -Using spot instance for core/task instance groups is the best way to save your money. \ No newline at end of file +Using spot instance for core/task instance groups is the best way to save your money. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/ensemble_learning.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/ensemble_learning.md b/docs/gitbook/tips/ensemble_learning.md index 9288f84..2157a5b 100644 --- a/docs/gitbook/tips/ensemble_learning.md +++ b/docs/gitbook/tips/ensemble_learning.md @@ -49,20 +49,20 @@ select voted_avg(weight) as weight from (select - -- train_multiclass_cw(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_cw(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_cw(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_cw(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 union all select - -- train_multiclass_arow(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_arow(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_arow(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_arow(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 union all select - -- train_multiclass_scw(addBias(features),label) as (label,feature,weight) -- hivemall v0.1 - train_multiclass_scw(addBias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later + -- train_multiclass_scw(add_bias(features),label) as (label,feature,weight) -- hivemall v0.1 + train_multiclass_scw(add_bias(features),label) as (label,feature,weight,covar) -- hivemall v0.2 or later from news20mc_train_x3 ) t @@ -196,4 +196,4 @@ Unfortunately, too many cooks spoil the broth in this case too :-( | SCW2 | 0.8482344102178813 | | Ensemble(model) | 0.8494866015527173 | | Ensemble(prediction) | 0.8499874780866516 | -| CW | 0.850488354620586 | \ No newline at end of file +| CW | 0.850488354620586 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/hadoop_tuning.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/hadoop_tuning.md b/docs/gitbook/tips/hadoop_tuning.md index 507e19d..c516820 100644 --- a/docs/gitbook/tips/hadoop_tuning.md +++ b/docs/gitbook/tips/hadoop_tuning.md @@ -97,4 +97,4 @@ You can use the plain old MapReduce by setting following setting: ```sql set mapreduce.framework.name=yarn; set hive.execution.engine=mr; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/mixserver.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/mixserver.md b/docs/gitbook/tips/mixserver.md index f9878e6..91aff87 100644 --- a/docs/gitbook/tips/mixserver.md +++ b/docs/gitbook/tips/mixserver.md @@ -69,7 +69,7 @@ select cast(voted_avg(weight) as float) as weight from (select - train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) + train_pa1(add_bias(features),label,"-mix host01,host02,host03") as (feature,weight) from kdd10a_train_x3 ) t @@ -83,4 +83,4 @@ The effect of model mixing In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix). -The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing. \ No newline at end of file +The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/rand_amplify.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/rand_amplify.md b/docs/gitbook/tips/rand_amplify.md index 6d68dea..73b1c3a 100644 --- a/docs/gitbook/tips/rand_amplify.md +++ b/docs/gitbook/tips/rand_amplify.md @@ -118,4 +118,4 @@ We recommend users to use *amplify()* for small training inputs and to use *rand |:-----------|--------------------|----:| | Plain | 89.718 | 0.734805 | | amplifier+clustered by | 479.855 | 0.746214 | -| rand_amplifier | 116.424 | 0.743392 | \ No newline at end of file +| rand_amplifier | 116.424 | 0.743392 | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/tips/rt_prediction.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/rt_prediction.md b/docs/gitbook/tips/rt_prediction.md index 96641a3..e1a1fff 100644 --- a/docs/gitbook/tips/rt_prediction.md +++ b/docs/gitbook/tips/rt_prediction.md @@ -135,7 +135,7 @@ select extract_feature(feature) as feature, extract_weight(feature) as value from - a9atest LATERAL VIEW explode(addBias(features)) t AS feature; + a9atest LATERAL VIEW explode(add_bias(features)) t AS feature; desc extended a9atest_exploded_tsv; > location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv, @@ -252,4 +252,4 @@ Alternatively, you can use SQL views for testing target 't' in the above query. | 0.05595205126313402 | 0.0 | +---------------------+-----------+ 1 row in set (0.00 sec) -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/troubleshooting/asterisk.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/asterisk.md b/docs/gitbook/troubleshooting/asterisk.md index 621ab3f..3c8c08b 100644 --- a/docs/gitbook/troubleshooting/asterisk.md +++ b/docs/gitbook/troubleshooting/asterisk.md @@ -19,4 +19,4 @@ See [HIVE-4181](https://issues.apache.org/jira/browse/HIVE-4181) that asterisk argument without table alias for UDTF is not working. It has been fixed as part of Hive v0.12 release. -A possible workaround is to use asterisk with a table alias, or to specify names of arguments explicitly. \ No newline at end of file +A possible workaround is to use asterisk with a table alias, or to specify names of arguments explicitly. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/troubleshooting/mapjoin_classcastex.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/mapjoin_classcastex.md b/docs/gitbook/troubleshooting/mapjoin_classcastex.md index 28e7709..ade4f52 100644 --- a/docs/gitbook/troubleshooting/mapjoin_classcastex.md +++ b/docs/gitbook/troubleshooting/mapjoin_classcastex.md @@ -24,4 +24,4 @@ Map-side join on Tez causes [ClassCastException](http://markmail.org/message/7cw set hive.mapjoin.optimized.hashtable=false; ``` -Caution: Fixed in Hive 1.3.0. Refer [HIVE_11051](https://issues.apache.org/jira/browse/HIVE-11051) for the detail. \ No newline at end of file +Caution: Fixed in Hive 1.3.0. Refer [HIVE_11051](https://issues.apache.org/jira/browse/HIVE-11051) for the detail. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/troubleshooting/mapjoin_task_error.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/mapjoin_task_error.md b/docs/gitbook/troubleshooting/mapjoin_task_error.md index 78b4e32..185378b 100644 --- a/docs/gitbook/troubleshooting/mapjoin_task_error.md +++ b/docs/gitbook/troubleshooting/mapjoin_task_error.md @@ -24,4 +24,4 @@ When using complex queries using views, the auto conversion sometimes throws Sem Workaround for the exception is to disable **hive.auto.convert.join** before the execution as follows. ``` set hive.auto.convert.join=false; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/troubleshooting/num_mappers.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/num_mappers.md b/docs/gitbook/troubleshooting/num_mappers.md index c1820db..67ce7b5 100644 --- a/docs/gitbook/troubleshooting/num_mappers.md +++ b/docs/gitbook/troubleshooting/num_mappers.md @@ -36,4 +36,4 @@ set hive.tez.input.format; You can then control the maximum number of mappers via setting: ``` set mapreduce.job.maps=128; -``` \ No newline at end of file +``` http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/docs/gitbook/troubleshooting/oom.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/oom.md b/docs/gitbook/troubleshooting/oom.md index 50bee25..dc375bf 100644 --- a/docs/gitbook/troubleshooting/oom.md +++ b/docs/gitbook/troubleshooting/oom.md @@ -36,4 +36,4 @@ If OOM caused during the merge step, try setting a larger **mapred.reduce.tasks* SET mapred.reduce.tasks=64; ``` -If your OOM happened by using amplify(), try using rand_amplify() instead. \ No newline at end of file +If your OOM happened by using amplify(), try using rand_amplify() instead. http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/resources/ddl/define-all-as-permanent.hive ---------------------------------------------------------------------- diff --git a/resources/ddl/define-all-as-permanent.hive b/resources/ddl/define-all-as-permanent.hive index c59678a..feb1a08 100644 --- a/resources/ddl/define-all-as-permanent.hive +++ b/resources/ddl/define-all-as-permanent.hive @@ -337,8 +337,8 @@ CREATE FUNCTION tf as 'hivemall.ftvec.text.TermFrequencyUDAF' USING JAR '${hivem -- Regression functions -- -------------------------- -DROP FUNCTION IF EXISTS train_regression; -CREATE FUNCTION train_regression as 'hivemall.regression.GeneralRegressionUDTF' USING JAR '${hivemall_jar}'; +DROP FUNCTION IF EXISTS train_regressor; +CREATE FUNCTION train_regressor as 'hivemall.regression.GeneralRegressorUDTF' USING JAR '${hivemall_jar}'; DROP FUNCTION IF EXISTS train_logregr; CREATE FUNCTION train_logregr as 'hivemall.regression.LogressUDTF' USING JAR '${hivemall_jar}'; http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/resources/ddl/define-all.hive ---------------------------------------------------------------------- diff --git a/resources/ddl/define-all.hive b/resources/ddl/define-all.hive index 4514535..310f9f4 100644 --- a/resources/ddl/define-all.hive +++ b/resources/ddl/define-all.hive @@ -333,8 +333,8 @@ create temporary function tf as 'hivemall.ftvec.text.TermFrequencyUDAF'; -- Regression functions -- -------------------------- -drop temporary function if exists train_regression; -create temporary function train_regression as 'hivemall.regression.GeneralRegressionUDTF'; +drop temporary function if exists train_regressor; +create temporary function train_regressor as 'hivemall.regression.GeneralRegressorUDTF'; drop temporary function if exists logress; create temporary function logress as 'hivemall.regression.LogressUDTF'; http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/resources/ddl/define-all.spark ---------------------------------------------------------------------- diff --git a/resources/ddl/define-all.spark b/resources/ddl/define-all.spark index 2cf4d60..42b235b 100644 --- a/resources/ddl/define-all.spark +++ b/resources/ddl/define-all.spark @@ -336,8 +336,8 @@ sqlContext.sql("CREATE TEMPORARY FUNCTION tf AS 'hivemall.ftvec.text.TermFrequen * Regression functions */ -sqlContext.sql("DROP TEMPORARY FUNCTION IF EXISTS train_regression") -sqlContext.sql("CREATE TEMPORARY FUNCTION train_regression AS 'hivemall.regression.GeneralRegressionUDTF'") +sqlContext.sql("DROP TEMPORARY FUNCTION IF EXISTS train_regressor") +sqlContext.sql("CREATE TEMPORARY FUNCTION train_regressor AS 'hivemall.regression.GeneralRegressorUDTF'") sqlContext.sql("DROP TEMPORARY FUNCTION IF EXISTS logress") sqlContext.sql("CREATE TEMPORARY FUNCTION logress AS 'hivemall.regression.LogressUDTF'") http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/7205de1e/resources/ddl/define-udfs.td.hql ---------------------------------------------------------------------- diff --git a/resources/ddl/define-udfs.td.hql b/resources/ddl/define-udfs.td.hql index d1bdfa4..dd694e3 100644 --- a/resources/ddl/define-udfs.td.hql +++ b/resources/ddl/define-udfs.td.hql @@ -172,7 +172,7 @@ create temporary function haversine_distance as 'hivemall.geospatial.HaversineDi create temporary function l2_norm as 'hivemall.tools.math.L2NormUDAF'; create temporary function dimsum_mapper as 'hivemall.knn.similarity.DIMSUMMapperUDTF'; create temporary function train_classifier as 'hivemall.classifier.GeneralClassifierUDTF'; -create temporary function train_regression as 'hivemall.regression.GeneralRegressionUDTF'; +create temporary function train_regressor as 'hivemall.regression.GeneralRegressorUDTF'; create temporary function tree_export as 'hivemall.smile.tools.TreeExportUDF'; -- NLP features
