[44/50] [abbrv] incubator-hivemall git commit: Updated the userguide

myui Wed, 30 Nov 2016 21:26:22 -0800

Updated the userguide

Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/a71bbb75
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/a71bbb75
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/a71bbb75


Branch: refs/heads/master
Commit: a71bbb75b3d6dc9d820ccf33939eb17d1de51d43
Parents: ae2307f
Author: myui <[email protected]>
Authored: Thu Nov 17 21:16:14 2016 +0900
Committer: myui <[email protected]>
Committed: Thu Nov 17 23:40:56 2016 +0900

----------------------------------------------------------------------
 docs/gitbook/SUMMARY.md                         |   2 +
 docs/gitbook/anomaly/lof.md                     |  16 +-
 docs/gitbook/binaryclass/a9a_lr.md              | 187 ++++++-----
 docs/gitbook/binaryclass/a9a_minibatch.md       |   7 +-
 docs/gitbook/binaryclass/kdd2010a_dataset.md    |   6 +-
 docs/gitbook/binaryclass/kdd2010b_dataset.md    |   6 +-
 docs/gitbook/binaryclass/news20_scw.md          |   2 +-
 docs/gitbook/binaryclass/titanic_rf.md          | 318 +++++++++++++++++++
 docs/gitbook/binaryclass/webspam_scw.md         |   2 +-
 docs/gitbook/eval/lr_datagen.md                 |   6 +-
 docs/gitbook/eval/stat_eval.md                  |  10 +-
 docs/gitbook/ft_engineering/hashing.md          |   4 +-
 docs/gitbook/getting_started/input-format.md    |  14 +-
 .../getting_started/permanent-functions.md      |   5 +-
 docs/gitbook/misc/generic_funcs.md              | 203 ++++++------
 docs/gitbook/misc/topk.md                       |  17 +-
 docs/gitbook/multiclass/iris_dataset.md         |   2 +-
 docs/gitbook/multiclass/iris_randomforest.md    |   4 +-
 docs/gitbook/multiclass/iris_scw.md             |   2 +-
 docs/gitbook/multiclass/news20_scw.md           |   2 +-
 docs/gitbook/recommend/item_based_cf.md         |   4 +-
 docs/gitbook/recommend/movielens_fm.md          |   7 +-
 docs/gitbook/recommend/movielens_mf.md          |  20 +-
 docs/gitbook/recommend/news20_knn.md            |   2 +-
 docs/gitbook/regression/e2006_arow.md           |   2 +-
 docs/gitbook/regression/kddcup12tr2_adagrad.md  | 254 +++++++--------
 docs/gitbook/regression/kddcup12tr2_dataset.md  |   2 +-
 .../regression/kddcup12tr2_lr_amplify.md        |   6 +-
 .../resources/images/kddtrack2tables.png        | Bin 0 -> 30323 bytes
 docs/gitbook/tips/addbias.md                    |   2 +-
 docs/gitbook/tips/emr.md                        |   2 +
 docs/gitbook/tips/hadoop_tuning.md              |   2 +
 docs/gitbook/tips/mixserver.md                  | 169 +++++-----
 docs/gitbook/tips/rand_amplify.md               |  12 +-
 docs/gitbook/tips/rowid.md                      |  27 +-
 docs/gitbook/tips/rt_prediction.md              |  16 +-
 36 files changed, 834 insertions(+), 508 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 7ef1b9b..c333c98 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -92,6 +92,8 @@
 * [Webspam Tutorial](binaryclass/webspam.md)
     * [Data pareparation](binaryclass/webspam_dataset.md)
     * [PA1, AROW, SCW](binaryclass/webspam_scw.md)
+
+* [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md)
     
 ## Part VI - Multiclass classification
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/anomaly/lof.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/anomaly/lof.md b/docs/gitbook/anomaly/lof.md
index 48990f8..39a6e9f 100644
--- a/docs/gitbook/anomaly/lof.md
+++ b/docs/gitbook/anomaly/lof.md
@@ -19,6 +19,8 @@
         
 This article introduce how to find outliers using [Local Outlier Detection 
(LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall.
 
+<!-- toc -->
+
 # Data Preparation
 
 ```sql
@@ -36,9 +38,9 @@ ROW FORMAT DELIMITED
 STORED AS TEXTFILE LOCATION '/dataset/lof/hundred_balls';
 ```
 
-Download 
[hundred_balls.txt](https://github.com/myui/hivemall/blob/master/resources/examples/lof/hundred_balls.txt)
 that is originally provides in [this 
article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).
+Download 
[hundred_balls.txt](https://gist.githubusercontent.com/myui/f8b44ab925bc198e6d11b18fdd21269d/raw/bed05f811e4c351ed959e0159405690f2f11e577/hundred_balls.txt)
 that is originally provides in [this 
article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).
 
-You can find outliers in [this 
picture](http://next.rikunabi.com/tech/contents/ts_report/img/201303/002259/part1_img1.jpg).
 As you can see, Rowid `87` is apparently an outlier.
+In this example, Rowid `87` is apparently an outlier.
 
 ```sh
 awk '{FS=" "; OFS=" "; print NR,$0}' hundred_balls.txt | \
@@ -144,11 +146,15 @@ where
 ;
 ```
 
-_Note: `list_neighbours` table SHOULD be created because `list_neighbours` is 
used multiple times._
+> #### Caution
+>
+> `list_neighbours` table SHOULD be created because `list_neighbours` is used 
multiple times.
 
-_Note: [`each_top_k`](https://github.com/myui/hivemall/pull/196) is supported 
from Hivemall v0.3.2-3 or later._
+# Parallelize Top-k computation
 
-_Note: To parallelize a top-k computation, break LEFT-hand table into piece as 
describe in [this 
page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#parallelization-of-similarity-computation-using-with-clause)._
+> #### Info
+>
+> To parallelize a top-k computation, break LEFT-hand table into piece as 
describe in [this page](../misc/topk.html).
 
 ```sql
 WITH k_distance as (

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/a9a_lr.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/a9a_lr.md 
b/docs/gitbook/binaryclass/a9a_lr.md
index 17d91c0..9bac63e 100644
--- a/docs/gitbook/binaryclass/a9a_lr.md
+++ b/docs/gitbook/binaryclass/a9a_lr.md
@@ -1,98 +1,91 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
 -->
-        
-a9a
-===
-http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a
-
-_Training with iterations is OBSOLUTE in Hivemall._  
-_Using amplifier and shuffling inputs is RECOMMENDED in Hivemall._
-
----
-
-## UDF preparation
-
-```sql
-select count(1) from a9atrain;
--- set total_steps ideally be "count(1) / #map tasks"
-set hivevar:total_steps=32561;
-
-select count(1) from a9atest;
-set hivevar:num_test_instances=16281;
-```
-
-## training
-```sql
-create table a9a_model1 
-as
-select 
- cast(feature as int) as feature,
- avg(weight) as weight
-from 
- (select 
-     logress(addBias(features),label,"-total_steps ${total_steps}") as 
(feature,weight)
-  from 
-     a9atrain
- ) t 
-group by feature;
-```
-_"-total_steps" option is optional for logress() function._  
-_I recommend you NOT to use options (e.g., total_steps and eta0) if you are 
not familiar with those options. Hivemall then uses an autonomic ETA (learning 
rate) estimator._
-
-## prediction
-```sql
-create or replace view a9a_predict1 
-as
-WITH a9atest_exploded as (
-select 
-  rowid,
-  label,
-  extract_feature(feature) as feature,
-  extract_weight(feature) as value
-from 
-  a9atest LATERAL VIEW explode(addBias(features)) t AS feature
-)
-select
-  t.rowid, 
-  sigmoid(sum(m.weight * t.value)) as prob,
-  CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 
end) as FLOAT) as label
-from 
-  a9atest_exploded t LEFT OUTER JOIN
-  a9a_model1 m ON (t.feature = m.feature)
-group by
-  t.rowid;
-```
-
-## evaluation
-```sql
-create or replace view a9a_submit1 as
-select 
-  t.label as actual, 
-  pd.label as predicted, 
-  pd.prob as probability
-from 
-  a9atest t JOIN a9a_predict1 pd 
-    on (t.rowid = pd.rowid);
-```
-
-```sql
-select count(1) / ${num_test_instances} from a9a_submit1 
-where actual == predicted;
-```
-> 0.8430071862907684
\ No newline at end of file
+
+<!-- toc -->
+
+# UDF preparation
+
+```sql
+select count(1) from a9atrain;
+-- set total_steps ideally be "count(1) / #map tasks"
+set hivevar:total_steps=32561;
+
+select count(1) from a9atest;
+set hivevar:num_test_instances=16281;
+```
+
+# training
+```sql
+create table a9a_model1 
+as
+select 
+ cast(feature as int) as feature,
+ avg(weight) as weight
+from 
+ (select 
+     logress(addBias(features),label,"-total_steps ${total_steps}") as 
(feature,weight)
+  from 
+     a9atrain
+ ) t 
+group by feature;
+```
+_"-total_steps" option is optional for logress() function._  
+_I recommend you NOT to use options (e.g., total_steps and eta0) if you are 
not familiar with those options. Hivemall then uses an autonomic ETA (learning 
rate) estimator._
+
+# prediction
+```sql
+create or replace view a9a_predict1 
+as
+WITH a9atest_exploded as (
+select 
+  rowid,
+  label,
+  extract_feature(feature) as feature,
+  extract_weight(feature) as value
+from 
+  a9atest LATERAL VIEW explode(addBias(features)) t AS feature
+)
+select
+  t.rowid, 
+  sigmoid(sum(m.weight * t.value)) as prob,
+  CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 
end) as FLOAT) as label
+from 
+  a9atest_exploded t LEFT OUTER JOIN
+  a9a_model1 m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+# evaluation
+```sql
+create or replace view a9a_submit1 as
+select 
+  t.label as actual, 
+  pd.label as predicted, 
+  pd.prob as probability
+from 
+  a9atest t JOIN a9a_predict1 pd 
+    on (t.rowid = pd.rowid);
+```
+
+```sql
+select count(1) / ${num_test_instances} from a9a_submit1 
+where actual == predicted;
+```
+> 0.8430071862907684

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/a9a_minibatch.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/a9a_minibatch.md 
b/docs/gitbook/binaryclass/a9a_minibatch.md
index eaa7a06..a79ed86 100644
--- a/docs/gitbook/binaryclass/a9a_minibatch.md
+++ b/docs/gitbook/binaryclass/a9a_minibatch.md
@@ -17,13 +17,12 @@
   under the License.
 -->
         
-This page explains how to apply [Mini-Batch Gradient 
Descent](https://class.coursera.org/ml-003/lecture/106) for the training of 
logistic regression explained in [this 
example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)).
 
-
-See [this 
page](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression))
 first. This content depends on it.
+This page explains how to apply [Mini-Batch Gradient 
Descent](https://class.coursera.org/ml-003/lecture/106) for the training of 
logistic regression explained in [this example](./a9a_lr.html). 
+So, refer [this page](./a9a_lr.html) first. This content depends on it.
 
 # Training
 
-Replace `a9a_model1` of [this 
example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)).
+Replace `a9a_model1` of [this example](./a9a_lr.html).
 
 ```sql
 set hivevar:total_steps=32561;

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/kdd2010a_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/kdd2010a_dataset.md 
b/docs/gitbook/binaryclass/kdd2010a_dataset.md
index ca221c3..7634f66 100644
--- a/docs/gitbook/binaryclass/kdd2010a_dataset.md
+++ b/docs/gitbook/binaryclass/kdd2010a_dataset.md
@@ -19,9 +19,9 @@
         
 [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 
(algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010
 (algebra))
 
-* # of classes: 2
-* # of data: 8,407,752 (training) / 510,302 (testing)
-* # of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing) 
+* the number of classes: 2
+* the number of data: 8,407,752 (training) / 510,302 (testing)
+* the number of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 
(testing) 
 
 ---
 # Define training/testing tables

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/kdd2010b_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/kdd2010b_dataset.md 
b/docs/gitbook/binaryclass/kdd2010b_dataset.md
index 41f0513..291a783 100644
--- a/docs/gitbook/binaryclass/kdd2010b_dataset.md
+++ b/docs/gitbook/binaryclass/kdd2010b_dataset.md
@@ -19,9 +19,9 @@
         
 [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 
(bridge to 
algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010
 (bridge to algebra))
 
-* # of classes: 2
-* # of data: 19,264,097 / 748,401 (testing)
-* # of features: 29,890,095 / 29,890,095 (testing)
+* the number of classes: 2
+* the number of examples: 19,264,097 (training) / 748,401 (testing)
+* the number of features: 29,890,095 (training) / 29,890,095 (testing)
 
 ---
 # Define training/testing tables

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/news20_scw.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/news20_scw.md 
b/docs/gitbook/binaryclass/news20_scw.md
index fa1da7f..c3f51f4 100644
--- a/docs/gitbook/binaryclass/news20_scw.md
+++ b/docs/gitbook/binaryclass/news20_scw.md
@@ -16,7 +16,7 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
+
 ## UDF preparation
 ```
 use news20;

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/titanic_rf.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/titanic_rf.md 
b/docs/gitbook/binaryclass/titanic_rf.md
new file mode 100644
index 0000000..1a9786e
--- /dev/null
+++ b/docs/gitbook/binaryclass/titanic_rf.md
@@ -0,0 +1,318 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+This examples gives a basic usage of RandomForest on Hivemall using [Kaggle 
Titanic](https://www.kaggle.com/c/titanic) dataset.
+The example gives a baseline score without any feature engineering.
+
+<!-- toc -->
+
+# Data preparation
+
+```sql
+create database titanic;
+use titanic;
+
+drop table train;
+create external table train (
+  passengerid int, -- unique id
+  survived int, -- target label
+  pclass int,
+  name string,
+  sex string,
+  age int,
+  sibsp int, -- Number of Siblings/Spouses Aboard
+  parch int, -- Number of Parents/Children Aboard
+  ticket string,
+  fare double,
+  cabin string,
+  embarked string
+) 
+ROW FORMAT DELIMITED
+   FIELDS TERMINATED BY '|'
+   LINES TERMINATED BY '\n'
+STORED AS TEXTFILE LOCATION '/dataset/titanic/train';
+
+hadoop fs -rm /dataset/titanic/train/train.csv
+awk '{ FPAT="([^,]*)|(\"[^\"]+\")";OFS="|"; } NR >1 
{$1=$1;$4=substr($4,2,length($4)-2);print $0}' train.csv | hadoop fs -put - 
/dataset/titanic/train/train.csv
+
+drop table test_raw;
+create external table test_raw (
+  passengerid int,
+  pclass int,
+  name string,
+  sex string,
+  age int,
+  sibsp int, -- Number of Siblings/Spouses Aboard
+  parch int, -- Number of Parents/Children Aboard
+  ticket string,
+  fare double,
+  cabin string,
+  embarked string
+)
+ROW FORMAT DELIMITED
+   FIELDS TERMINATED BY '|'
+   LINES TERMINATED BY '\n'
+STORED AS TEXTFILE LOCATION '/dataset/titanic/test_raw';
+
+hadoop fs -rm /dataset/titanic/test_raw/test.csv
+awk '{ FPAT="([^,]*)|(\"[^\"]+\")";OFS="|"; } NR >1 
{$1=$1;$3=substr($3,2,length($3)-2);print $0}' test.csv | hadoop fs -put - 
/dataset/titanic/test_raw/test.csv
+```
+
+## Data preparation for RandomForest
+
+```sql
+set hivevar:output_row=true;
+
+drop table train_rf;
+create table train_rf
+as
+WITH train_quantified as (
+  select    
+    quantify(
+      ${output_row}, passengerid, survived, pclass, name, sex, age, sibsp, 
parch, ticket, fare, cabin, embarked
+    ) as (passengerid, survived, pclass, name, sex, age, sibsp, parch, ticket, 
fare, cabin, embarked)
+  from (
+    select * from train
+    order by passengerid asc
+  ) t
+)
+select
+  rand(31) as rnd,
+  passengerid, 
+  array(pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) 
as features,
+  survived
+from
+  train_quantified
+;
+
+drop table test_rf;
+create table test_rf
+as
+WITH test_quantified as (
+  select 
+    quantify(
+      output_row, passengerid, pclass, name, sex, age, sibsp, parch, ticket, 
fare, cabin, embarked
+    ) as (passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, 
cabin, embarked)
+  from (
+    -- need training data to assign consistent ids to categorical variables
+    select * from (
+      select
+        1 as train_first, false as output_row, passengerid, pclass, name, sex, 
age, sibsp, parch, ticket, fare, cabin, embarked
+      from
+        train
+      union all
+      select
+        2 as train_first, true as output_row, passengerid, pclass, name, sex, 
age, sibsp, parch, ticket, fare, cabin, embarked
+      from
+        test_raw
+    ) t0
+    order by train_first asc, passengerid asc
+  ) t1
+)
+select
+  passengerid, 
+  array(pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) 
as features
+from
+  test_quantified
+;
+```
+
+---
+
+# Training
+
+`select guess_attribute_types(pclass, name, sex, age, sibsp, parch, ticket, 
fare, cabin, embarked) from train limit 1;`
+> Q,C,C,Q,Q,Q,C,Q,C,C
+
+`Q` and `C` represent quantitative variable and categorical variables, 
respectively.
+
+*Caution:* Note that the output of `guess_attribute_types` is not perfect. 
Revise it by your self.
+For example, `pclass` is a categorical variable.
+
+```sql
+set hivevar:attrs=C,C,C,Q,Q,Q,C,Q,C,C;
+
+drop table model_rf;
+create table model_rf
+AS
+select
+  train_randomforest_classifier(features, survived, "-trees 500 -attrs 
${attrs}") 
+    -- as (model_id, model_type, pred_model, var_importance, oob_errors, 
oob_tests)
+from
+  train_rf
+;
+
+select
+  array_sum(var_importance) as var_importance,
+  sum(oob_errors) / sum(oob_tests) as oob_err_rate
+from
+  model_rf;
+
+> 
[137.00242639169272,1194.2140119834373,328.78017188176966,628.2568660509628,200.31275032394072,160.12876797647078,1083.5987543408116,664.1234312561456,422.89449844090393,130.72019667694784]
     0.18742985409652077
+```
+
+# Prediction
+
+```sql
+SET hivevar:classification=true;
+set hive.auto.convert.join=true;
+SET hive.mapjoin.optimized.hashtable=false;
+SET mapred.reduce.tasks=16;
+
+drop table predicted_rf;
+create table predicted_rf
+as
+SELECT 
+  passengerid,
+  predicted.label,
+  predicted.probability,
+  predicted.probabilities
+FROM (
+  SELECT
+    passengerid,
+    rf_ensemble(predicted) as predicted
+  FROM (
+    SELECT
+      t.passengerid, 
+      -- hivemall v0.4.1-alpha.2 or before
+      -- tree_predict(p.model, t.features, ${classification}) as predicted
+ãã   -- hivemall v0.4.1-alpha.3 or later
+      tree_predict(p.model_id, p.model_type, p.pred_model, t.features, 
${classification}) as predicted
+    FROM (
+      SELECT model_id, model_type, pred_model FROM model_rf 
+      DISTRIBUTE BY rand(1)
+    ) p
+    LEFT OUTER JOIN test_rf t
+  ) t1
+  group by
+    passengerid
+) t2
+;
+```
+
+# Kaggle submission
+
+```sql
+drop table predicted_rf_submit;
+create table predicted_rf_submit
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY ","
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+SELECT passengerid, label as survived
+FROM predicted_rf
+ORDER BY passengerid ASC;
+```
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/titanic.db/predicted_rf_submit 
predicted_rf_submit.csv
+
+sed -i -e "1i PassengerId,Survived" predicted_rf_submit.csv
+```
+
+Accuracy would gives `0.76555` for a Kaggle submission.
+
+---
+
+# Test by dividing training dataset
+
+```sql
+drop table train_rf_07;
+create table train_rf_07 
+as
+select * from train_rf 
+where rnd < 0.7;
+
+drop table test_rf_03;
+create table test_rf_03
+as
+select * from train_rf
+where rnd >= 0.7;
+
+drop table model_rf_07;
+create table model_rf_07
+AS
+select
+  train_randomforest_classifier(features, survived, "-trees 500 -attrs 
${attrs}") 
+from
+  train_rf_07;
+
+select
+  array_sum(var_importance) as var_importance,
+  sum(oob_errors) / sum(oob_tests) as oob_err_rate
+from
+  model_rf_07;
+> 
[116.12055542977338,960.8569891444097,291.08765260103837,469.74671636586226,163.721292772701,120.784769882858,847.9769298113661,554.4617571355476,346.3500941757221,97.42593940113392]
    0.1838351822503962
+
+SET hivevar:classification=true;
+SET hive.mapjoin.optimized.hashtable=false;
+SET mapred.reduce.tasks=16;
+
+drop table predicted_rf_03;
+create table predicted_rf_03
+as
+SELECT 
+  passengerid,
+  predicted.label,
+  predicted.probability,
+  predicted.probabilities
+FROM (
+  SELECT
+    passengerid,
+    rf_ensemble(predicted) as predicted
+  FROM (
+    SELECT
+      t.passengerid, 
+      -- hivemall v0.4.1-alpha.2 or before
+      -- tree_predict(p.model, t.features, ${classification}) as predicted
+      -- hivemall v0.4.1-alpha.3 or later
+      tree_predict(p.model_id, p.model_type, p.pred_model, t.features, 
${classification}) as predicted
+    FROM (
+      SELECT model_id, model_type, pred_model FROM model_rf_07
+      DISTRIBUTE BY rand(1)
+    ) p
+    LEFT OUTER JOIN test_rf_03 t
+  ) t1
+  group by
+    passengerid
+) t2
+;
+
+create or replace view rf_submit_03 as
+select 
+  t.survived as actual, 
+  p.label as predicted,
+  p.probabilities
+from 
+  test_rf_03 t 
+  JOIN predicted_rf_03 p on (t.passengerid = p.passengerid)
+;
+
+select count(1) from test_rf_03;
+> 260
+
+set hivevar:testcnt=260;
+
+select count(1)/${testcnt} as accuracy 
+from rf_submit_03 
+where actual = predicted;
+
+> 0.8
+```

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/binaryclass/webspam_scw.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/binaryclass/webspam_scw.md 
b/docs/gitbook/binaryclass/webspam_scw.md
index cadd0ab..067e8f2 100644
--- a/docs/gitbook/binaryclass/webspam_scw.md
+++ b/docs/gitbook/binaryclass/webspam_scw.md
@@ -152,4 +152,4 @@ from
 select count(1)/70000 from webspam_scw_submit1 
 where actual = predicted;
 ```
-> Prediction accuracy: 0.9778714285714286
\ No newline at end of file
+> Prediction accuracy: 0.9778714285714286

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/eval/lr_datagen.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/eval/lr_datagen.md b/docs/gitbook/eval/lr_datagen.md
index 8fa5239..c0cbce0 100644
--- a/docs/gitbook/eval/lr_datagen.md
+++ b/docs/gitbook/eval/lr_datagen.md
@@ -17,7 +17,7 @@
   under the License.
 -->
         
-_Note this feature is supported on hivemall v0.2-alpha3 or later._
+<!-- toc -->
 
 # create a dual table
 
@@ -33,10 +33,10 @@ INSERT INTO TABLE dual SELECT count(*)+1 FROM dual;
 ```sql
 create table regression_data1
 as
-select lr_datagen("-n_examples 10k -n_features 10 -seed 100") as 
(label,features)
+select lr_datagen('-n_examples 10k -n_features 10 -seed 100') as 
(label,features)
 from dual;
 ```
-Find the details of the option in 
[LogisticRegressionDataGeneratorUDTF.java](https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/dataset/LogisticRegressionDataGeneratorUDTF.java#L69).
+Find the details of the option, run `lr_datagen('-help')`.
 
 You can generate a sparse dataset as well as a dense dataset. By the default, 
a sparse dataset is generated.
 ```sql

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/eval/stat_eval.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/eval/stat_eval.md b/docs/gitbook/eval/stat_eval.md
index 6b0af8e..149adf8 100644
--- a/docs/gitbook/eval/stat_eval.md
+++ b/docs/gitbook/eval/stat_eval.md
@@ -17,7 +17,9 @@
   under the License.
 -->
         
-Using the [E2006 tfidf regression 
example](https://github.com/myui/hivemall/wiki/E2006-tfidf-regression-evaluation-(PA,-AROW)),
 we explain how to evaluate the prediction model on Hive.
+Using the [E2006 tfidf regression example](../regression/e2006_arow.html), we 
explain how to evaluate the prediction model on Hive.
+
+<!-- toc -->
 
 # Scoring by evaluation metrics
 
@@ -69,7 +71,7 @@ from t;
 ```
 > 1.9610366706408238   1.9610366706408238
 
---
-**References**
+# References
+
 * R2 http://en.wikipedia.org/wiki/Coefficient_of_determination
-* Evaluation Metrics https://www.kaggle.com/wiki/Metrics
\ No newline at end of file
+* Evaluation Metrics https://www.kaggle.com/wiki/Metrics

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/ft_engineering/hashing.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/hashing.md 
b/docs/gitbook/ft_engineering/hashing.md
index daf4a23..f467002 100644
--- a/docs/gitbook/ft_engineering/hashing.md
+++ b/docs/gitbook/ft_engineering/hashing.md
@@ -17,10 +17,10 @@
   under the License.
 -->
         
-Hivemall supports [Feature 
Hashing](https://github.com/myui/hivemall/wiki/Feature-hashing) (a.k.a. hashing 
trick) through `feature_hashing` and `mhash` functions. 
+Hivemall supports [Feature 
Hashing](https://en.wikipedia.org/wiki/Feature_hashing) (a.k.a. hashing trick) 
through `feature_hashing` and `mhash` functions. 
 Find the differences in the following examples.
 
-_Note: `feature_hashing` UDF is supported since Hivemall `v0.4.2-rc.1`._
+<!-- toc -->
 
 ## `feature_hashing` function
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/getting_started/input-format.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/getting_started/input-format.md 
b/docs/gitbook/getting_started/input-format.md
index 698c095..59e6a5f 100644
--- a/docs/gitbook/getting_started/input-format.md
+++ b/docs/gitbook/getting_started/input-format.md
@@ -24,14 +24,14 @@ Here, we use 
[EBNF](http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Fo
 
 # Input Format for Classification 
 
-The classifiers of Hivemall takes 2 (or 3) arguments: *features*, *label*, and 
*options* (a.k.a. 
[hyperparameters](http://en.wikipedia.org/wiki/Hyperparameter)). The first two 
arguments of training functions (e.g., 
[logress](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression))
 and 
[train_scw](https://github.com/myui/hivemall/wiki/news20-binary-classification-%232-(CW,-AROW,-SCW)))
 represents training examples. 
+The classifiers of Hivemall takes 2 (or 3) arguments: *features*, *label*, and 
*options* (a.k.a. 
[hyperparameters](http://en.wikipedia.org/wiki/Hyperparameter)). The first two 
arguments of training functions represents training examples. 
 
 In Statistics, *features* and *label* are called [Explanatory variable and 
Response Variable](http://www.oswego.edu/~srp/stats/variable_types.htm), 
respectively.
 
 # Features format (for classification and regression)
 
 The format of *features* is common between (binary and multi-class) 
classification and regression.
-Hivemall accepts ARRAY&lt;INT|BIGINT|TEXT> for the type of *features* column.
+Hivemall accepts `ARRAY&lt;INT|BIGINT|TEXT>` for the type of *features* column.
 
 Hivemall uses a *sparse* data format (cf. [Compressed Row 
Storage](http://netlib.org/linalg/html_templates/node91.html)) which is similar 
to 
[LIBSVM](http://stackoverflow.com/questions/12112558/read-write-data-in-libsvm-format)
 and [Vowpal 
Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format).
 
@@ -52,7 +52,7 @@ Here is an instance of a features.
 10:3.4  123:0.5  34567:0.231
 ```
 
-*Note:* As mentioned later, *index* "0" is reserved for a [Bias/Dummy 
variable](https://github.com/myui/hivemall/wiki/Using-explicit-addBias()-for-a-better-prediction).
+*Note:* As mentioned later, *index* "0" is reserved for a [Bias/Dummy 
variable](../tips/addbias.html).
 
 In addition to numbers, you can use a TEXT value for an index. For example, 
you can use array("height:1.5", "length:2.0") for the features.
 ```
@@ -80,15 +80,15 @@ Note 1.0 is used for the weight when omitting *weight*.
 
 Note that "0" is reserved for a Bias variable (called dummy variable in 
Statistics). 
 
-The 
[addBias](https://github.com/myui/hivemall/wiki/Using-explicit-addBias()-for-a-better-prediction)
 function is Hivemall appends "0:1.0" as an element of array in *features*.
+The [addBias](../tips/addbias.html) function is Hivemall appends "0:1.0" as an 
element of array in *features*.
 
 ## Feature hashing
 
-Hivemall supports [feature hashing/hashing 
trick](http://en.wikipedia.org/wiki/Feature_hashing) through [mhash 
function](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing).
+Hivemall supports [feature hashing/hashing 
trick](http://en.wikipedia.org/wiki/Feature_hashing) through [mhash 
function](../ft_engineering/hashing.html#mhash-function).
 
 The mhash function takes a feature (i.e., *index*) of TEXT format and 
generates a hash number of a range from 1 to 2^24 (=16777216) by the default 
setting.
 
-Feature hashing is useful where the dimension of feature vector (i.e., the 
number of elements in *features*) is so large. Consider applying [mhash 
function]((https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing))
 when a prediction model does not fit in memory and OutOfMemory exception 
happens.
+Feature hashing is useful where the dimension of feature vector (i.e., the 
number of elements in *features*) is so large. Consider applying [mhash 
function]((../ft_engineering/hashing.html#mhash-function)) when a prediction 
model does not fit in memory and OutOfMemory exception happens.
 
 In general, you don't need to use mhash when the dimension of feature vector 
is less than 16777216.
 If feature *index* is very long TEXT (e.g., "xxxxxxx-yyyyyy-weight:55.3") and 
uses huge memory spaces, consider using mhash as follows:
@@ -103,7 +103,7 @@ 
feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xx
 
 ## Feature Normalization
 
-Feature (weight) normalization is important in machine learning. Please refer 
[https://github.com/myui/hivemall/wiki/Feature-scaling](https://github.com/myui/hivemall/wiki/Feature-scaling)
 for detail.
+Feature (weight) normalization is important in machine learning. Please refer 
[this article](../ft_engineering/scaling.html) for detail.
 
 ***
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/getting_started/permanent-functions.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/getting_started/permanent-functions.md 
b/docs/gitbook/getting_started/permanent-functions.md
index 75156fe..7afc780 100644
--- a/docs/gitbook/getting_started/permanent-functions.md
+++ b/docs/gitbook/getting_started/permanent-functions.md
@@ -21,8 +21,6 @@ Hive v0.13 or later supports [permanent 
functions](https://cwiki.apache.org/conf
 
 Permanent functions are useful when you are using Hive through Hiveserver or 
to avoid hivemall installation for each session.
 
-_Note: This feature is supported since hivemall-0.3 beta 3 or later._
-
 <!-- toc -->
 
 # Put hivemall jar to HDFS
@@ -58,4 +56,5 @@ show functions "hivemall.*";
 ```
 
 > #### Caution
-You need to specify "hivemall." prefix to call hivemall UDFs in your queries 
if UDFs are loaded into non-default scheme, in this case _hivemall_.
+>
+> You need to specify "hivemall." prefix to call hivemall UDFs in your queries 
if UDFs are loaded into non-default scheme, in this case _hivemall_.

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/misc/generic_funcs.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/misc/generic_funcs.md 
b/docs/gitbook/misc/generic_funcs.md
index 9749dae..b3a0421 100644
--- a/docs/gitbook/misc/generic_funcs.md
+++ b/docs/gitbook/misc/generic_funcs.md
@@ -19,61 +19,63 @@
         
 This page describes a list of useful Hivemall generic functions.
 
+<!-- toc -->
+
 # Array functions
 
 ## Array UDFs
 
 - `array_concat(array<ANY> x1, array<ANY> x2, ..)` - Returns a concatenated 
array
 
-```sql
-select array_concat(array(1),array(2,3));
-> [1,2,3]
-```
+    ```sql
+    select array_concat(array(1),array(2,3));
+    > [1,2,3]
+    ```
 
 - `array_intersect(array<ANY> x1, array<ANY> x2, ..)` - Returns an intersect 
of given arrays
 
-```sql
-select array_intersect(array(1,3,4),array(2,3,4),array(3,5));
-> [3]
-```
+    ```sql
+    select array_intersect(array(1,3,4),array(2,3,4),array(3,5));
+    > [3]
+    ```
 
 - `array_remove(array<int|text> original, int|text|array<int> target)` - 
Returns an array that the target is removed from the original array
 
-```sql
-select array_remove(array(1,null,3),array(null));
-> [3]
-
-select array_remove(array("aaa","bbb"),"bbb");
-> ["aaa"]
-```
+    ```sql
+    select array_remove(array(1,null,3),array(null));
+    > [3]
+    
+    select array_remove(array("aaa","bbb"),"bbb");
+    > ["aaa"]
+    ```
 
-- `sort_and_uniq_array(array<int>)` - Takes an array of type int and returns a 
sorted array in a natural order with duplicate elements eliminated
+- `sort_and_uniq_array(array<int>)` - Takes an array of type INT and returns a 
sorted array in a natural order with duplicate elements eliminated
 
-```sql
-select sort_and_uniq_array(array(3,1,1,-2,10));
-> [-2,1,3,10]
-```
+    ```sql
+    select sort_and_uniq_array(array(3,1,1,-2,10));
+    > [-2,1,3,10]
+    ```
 
 - `subarray_endwith(array<int|text> original, int|text key)` - Returns an 
array that ends with the specified key
-
-```sql
-select subarray_endwith(array(1,2,3,4), 3);
-> [1,2,3]
-```
+    
+    ```sql
+    select subarray_endwith(array(1,2,3,4), 3);
+    > [1,2,3]
+    ```
 
 - `subarray_startwith(array<int|text> original, int|text key)` - Returns an 
array that starts with the specified key
 
-```sql
-select subarray_startwith(array(1,2,3,4), 2);
-> [2,3,4]
-```
+    ```sql
+    select subarray_startwith(array(1,2,3,4), 2);
+    > [2,3,4]
+    ```
 
-- `subarray(array<int> orignal, int fromIndex, int toIndex)` - Returns a slice 
of the original array between the inclusive fromIndex and the exclusive toIndex
+- `subarray(array<int> orignal, int fromIndex, int toIndex)` - Returns a slice 
of the original array between the inclusive `fromIndex` and the exclusive 
`toIndex`
 
-```sql
-select subarray(array(1,2,3,4,5,6), 2,4);
-> [3,4]
-```
+    ```sql
+    select subarray(array(1,2,3,4,5,6), 2,4);
+    > [3,4]
+    ```
 
 ## Array UDAFs
 
@@ -87,47 +89,45 @@ select subarray(array(1,2,3,4,5,6), 2,4);
 
 - `to_bits(int[] indexes)` - Returns an bitset representation if the given 
indexes in long[]
 
-```sql
-select to_bits(array(1,2,3,128));
->[14,-9223372036854775808]
-```
+    ```sql
+    select to_bits(array(1,2,3,128));
+    >[14,-9223372036854775808]
+    ```
 
 - `unbits(long[] bitset)` - Returns an long array of the give bitset 
representation
 
-```sql
-select unbits(to_bits(array(1,4,2,3)));
-> [1,2,3,4]
-```
+    ```sql
+    select unbits(to_bits(array(1,4,2,3)));
+    > [1,2,3,4]
+    ```
 
 - `bits_or(array<long> b1, array<long> b2, ..)` - Returns a logical OR given 
bitsets
 
-```sql
-select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
-> [1,2,3,4]
-```
+    ```sql
+    select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
+    > [1,2,3,4]
+    ```
 
 ## Bitset UDAF
 
 - `bits_collect(int|long x)` - Returns a bitset in array<long>
 
-
 # Compression functions
 
-- `deflate(TEXT data [, const int compressionLevel])` - Returns a compressed 
BINARY obeject by using Deflater.
+- `deflate(TEXT data [, const int compressionLevel])` - Returns a compressed 
BINARY object by using Deflater.
 The compression level must be in range [-1,9]
 
-```sql
-select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
-> AA+=kaIM|WTt!+wbGAA
-```
+    ```sql
+    select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
+    > AA+=kaIM|WTt!+wbGAA
+    ```
 
 - `inflate(BINARY compressedData)` - Returns a decompressed STRING by using 
Inflater
 
-
-```sql
-select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
-> aaaaaaaaaaaaaaaabbbbccc
-```
+    ```sql
+    select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
+    > aaaaaaaaaaaaaaaabbbbccc
+    ```
 
 # Map functions
 
@@ -152,33 +152,33 @@ select 
inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
 
 # Math functions
 
-- `sigmoid(x)` - Returns 1.0 / (1.0 + exp(-x))
+- `sigmoid(x)` - Returns `1.0 / (1.0 + exp(-x))`
 
 # Text processing functions
 
 - `base91(binary)` - Convert the argument from binary to a BASE91 string
 
-```sql
-select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
-> AA+=kaIM|WTt!+wbGAA
-```
+    ```sql
+    select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
+    > AA+=kaIM|WTt!+wbGAA
+    ```
 
 - `unbase91(string)` - Convert a BASE91 string to a binary
 
-```sql
-select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
-> aaaaaaaaaaaaaaaabbbbccc
-```
+    ```sql
+    select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
+    > aaaaaaaaaaaaaaaabbbbccc
+    ```
 
 - `normalize_unicode(string str [, string form])` - Transforms `str` with the 
specified normalization form. The `form` takes one of NFC (default), NFD, NFKC, 
or NFKD
 
-```sql
-select normalize_unicode('ï¾ï¾ï½¶ï½¸ï½¶ï¾','NFKC');
-> ãã³ã«ã¯ã«ã
-
-select normalize_unicode('ã±ã§ã¦â¢','NFKC');
-> (æ ª)ãã³ãã«III
-```
+    ```sql
+    select normalize_unicode('ï¾ï¾ï½¶ï½¸ï½¶ï¾','NFKC');
+    > ãã³ã«ã¯ã«ã
+    
+    select normalize_unicode('ã±ã§ã¦â¢','NFKC');
+    > (æ ª)ãã³ãã«III
+    ```
 
 - `split_words(string query [, string regex])` - Returns an array<text> 
containing splitted strings
 
@@ -186,44 +186,37 @@ select normalize_unicode('ã±ã§ã¦â¢','NFKC');
 
 - `tokenize(string englishText [, boolean toLowerCase])` - Returns words in 
array<string>
 
-- `tokenize_ja(String line [, const string mode = "normal", const list<string> 
stopWords, const list<string> stopTags])` - returns tokenized strings in 
array<string>
-
-```sql
-select 
tokenize_ja("kuromojiãä½¿ã£ãåãã¡æ¸ãã®ãã¹ãã§ããç¬¬äºå¼æ°ã«ã¯normal/search/extendedãæå®ã§ãã¾ããããã©ã«ãã§ã¯normalã¢ã¼ãã§ãã");
+- `tokenize_ja(String line [, const string mode = "normal", const list<string> 
stopWords, const list<string> stopTags])` - returns tokenized strings in 
array<string>. Refer [this article](../misc/tokenizer.html) for detail.
 
-> 
["kuromoji","ä½¿ã","åãã¡æ¸ã","ãã¹ã","ç¬¬","äº","å¼æ°","normal","search","extended","æå®","ããã©ã«ã","normal","
 ã¢ã¼ã"]
-```
-
-https://github.com/myui/hivemall/wiki/Tokenizer
+    ```sql
+    select 
tokenize_ja("kuromojiãä½¿ã£ãåãã¡æ¸ãã®ãã¹ãã§ããç¬¬äºå¼æ°ã«ã¯normal/search/extendedãæå®ã§ãã¾ããããã©ã«ãã§ã¯normalã¢ã¼ãã§ãã");
+    
+    > 
["kuromoji","ä½¿ã","åãã¡æ¸ã","ãã¹ã","ç¬¬","äº","å¼æ°","normal","search","extended","æå®","ããã©ã«ã","normal","
 ã¢ã¼ã"]
+    ```
 
 # Other functions
 
 - `convert_label(const int|const float)` - Convert from -1|1 to 0.0f|1.0f, or 
from 0.0f|1.0f to -1|1
 
-- `each_top_k(int K, Object group, double cmpKey, *)` - Returns top-K values 
(or tail-K values when k is less than 0)
-
-https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF
+- `each_top_k(int K, Object group, double cmpKey, *)` - Returns top-K values 
(or tail-K values when k is less than 0). Refer [this 
article](../misc/topk.html) for detail.
 
 - `generate_series(const int|bigint start, const int|bigint end)` - Generate a 
series of values, from start to end
 
-```sql
-WITH dual as (
-  select 1
-)
-select generate_series(1,9)
-from dual;
-
-1
-2
-3
-4
-5
-6
-7
-8
-9
-```
-
-A similar function to PostgreSQL's `generate_serics`.
-http://www.postgresql.org/docs/current/static/functions-srf.html
-- `x_rank(KEY)` - Generates a pseudo sequence number starting from 1 for each 
key
\ No newline at end of file
+    ```sql
+    select generate_series(1,9);
+    
+    1
+    2
+    3
+    4
+    5
+    6
+    7
+    8
+    9
+    ```
+
+    A similar function to PostgreSQL's `generate_serics`.
+    http://www.postgresql.org/docs/current/static/functions-srf.html
+
+- `x_rank(KEY)` - Generates a pseudo sequence number starting from 1 for each 
key

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/misc/topk.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/misc/topk.md b/docs/gitbook/misc/topk.md
index d6e7b93..6a80514 100644
--- a/docs/gitbook/misc/topk.md
+++ b/docs/gitbook/misc/topk.md
@@ -23,7 +23,10 @@ This function is particularly useful for applying a 
similarity/distance function
 
 `each_top_k` is very fast when compared to other methods running top-k queries 
(e.g., [`rank/distribute 
by`](https://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/))
 in Hive.
 
-## Caution
+<!-- toc -->
+
+# Caution
+
 * `each_top_k` is supported from Hivemall v0.3.2-3 or later.
 * This UDTF assumes that input records are sorted by `group`. Use `DISTRIBUTE 
BY group SORT BY group` to ensure that. Or, you can use `LEFT OUTER JOIN` for 
certain cases.
 * It takes variable lengths arguments in `argN`. 
@@ -32,7 +35,9 @@ This function is particularly useful for applying a 
similarity/distance function
 * If k is less than 0, reverse order is used and `tail-K` records are returned 
for each `group`.
 * Note that this function returns [a pseudo 
ranking](http://www.michaelpollmeier.com/selecting-top-k-items-from-a-list-efficiently-in-java-groovy/)
 for top-k. It always returns `at-most K` records for each group. The ranking 
scheme is similar to `dense_rank` but slightly different in certain cases.
 
-# Efficient Top-k Query Processing using `each_top_k`
+# Usage
+
+## Efficient Top-k Query Processing using `each_top_k`
 
 Efficient processing of Top-k queries is a crucial requirement in many 
interactive environments that involve massive amounts of data. 
 Our Hive extension `each_top_k` helps running Top-k processing efficiently.
@@ -87,7 +92,8 @@ FROM (
 ```
 
 > #### Note
-`CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required 
when using `each_top_k`.
+>
+> `CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and 
required when using `each_top_k`.
 
 The function signature of `each_top_k` is `each_top_k(int k, ANY group, double 
value, arg1, arg2, ..., argN)` and it returns a relation `(int rank, double 
value, arg1, arg2, .., argN)`.
 
@@ -99,9 +105,8 @@ If `k` is less than 0, reverse order is used and tail-K 
records are returned for
 The ranking semantics of `each_top_k` follows SQL's `dense_rank` and then 
limits results by `k`. 
 
 > #### Caution
-`each_top_k` is benefical where the number of grouping keys are large. If the 
number of grouping keys are not so large (e.g., less than 100), consider using 
`rank() over` instead.
-
-# Usage
+>
+> `each_top_k` is benefical where the number of grouping keys are large. If 
the number of grouping keys are not so large (e.g., less than 100), consider 
using `rank() over` instead.
 
 ## top-k clicks 
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/multiclass/iris_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/multiclass/iris_dataset.md 
b/docs/gitbook/multiclass/iris_dataset.md
index 38a6831..e67737e 100644
--- a/docs/gitbook/multiclass/iris_dataset.md
+++ b/docs/gitbook/multiclass/iris_dataset.md
@@ -113,7 +113,7 @@ select * from iris_scaled limit 3;
 > 3       Iris-setosa     
 > ["1:0.11111101","2:0.5","3:0.05084745","4:0.041666664","0:1.0"]
 ```
 
-_[LibSVM web 
page](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris)
 provides a normalized (using 
[ZScore](https://github.com/myui/hivemall/wiki/Feature-scaling)) version of 
Iris dataset._
+_[LibSVM web 
page](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris)
 provides a normalized (using 
[ZScore](../ft_engineering/scaling.html#feature-scaling-by-zscore)) version of 
Iris dataset._
 
 # Create training/test data
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/multiclass/iris_randomforest.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/multiclass/iris_randomforest.md 
b/docs/gitbook/multiclass/iris_randomforest.md
index fd85471..4b0750c 100644
--- a/docs/gitbook/multiclass/iris_randomforest.md
+++ b/docs/gitbook/multiclass/iris_randomforest.md
@@ -16,8 +16,6 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
-*NOTE: RandomForest is being supported from Hivemall v0.4 or later.*
 
 # Dataset
 
@@ -323,4 +321,4 @@ WHERE
   actual = predicted
 ;
 ```
-> 0.9533333333333334
\ No newline at end of file
+> 0.9533333333333334

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/multiclass/iris_scw.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/multiclass/iris_scw.md 
b/docs/gitbook/multiclass/iris_scw.md
index fd85471..79cdaf4 100644
--- a/docs/gitbook/multiclass/iris_scw.md
+++ b/docs/gitbook/multiclass/iris_scw.md
@@ -323,4 +323,4 @@ WHERE
   actual = predicted
 ;
 ```
-> 0.9533333333333334
\ No newline at end of file
+> 0.9533333333333334

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/multiclass/news20_scw.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/multiclass/news20_scw.md 
b/docs/gitbook/multiclass/news20_scw.md
index f6f21af..24e0fad 100644
--- a/docs/gitbook/multiclass/news20_scw.md
+++ b/docs/gitbook/multiclass/news20_scw.md
@@ -335,4 +335,4 @@ where actual == predicted;
 drop table news20mc_scw2_model1;
 drop table news20mc_scw2_predict1;
 drop view news20mc_scw2_submit1;
-```
\ No newline at end of file
+```

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/recommend/item_based_cf.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/item_based_cf.md 
b/docs/gitbook/recommend/item_based_cf.md
index 2eb7890..a674f70 100644
--- a/docs/gitbook/recommend/item_based_cf.md
+++ b/docs/gitbook/recommend/item_based_cf.md
@@ -90,7 +90,7 @@ group by
 
 **Caution:** _Item-Item cooccurrence matrix is a symmetric matrix that has the 
number of total occurrence for each diagonal element . If the size of items are 
`k`, then the size of expected matrix is `k * (k - 1) / 2`, usually a very 
large one._
 
-_Better to use 
[2.2.2.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#limiting-size-of-elements-in-cooccurrence_upper_triangular)
 instead of 
[2.2.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#221-create-cooccurrence-table-directly)
 for creating a `cooccurrence` table where dataset is large._
+_Better to use 
[2.2.2.](#222-create-cooccurrence-table-from-upper-triangular-matrix-of-cooccurrence)
 instead of [2.2.1.](#221-create-cooccurrence-table-directly) for creating a 
`cooccurrence` table where dataset is large._
 
 ### 2.2.1. Create cooccurrence table directly
 
@@ -257,7 +257,7 @@ GROUP BY
 Item-Item similarity computation is known to be computation complexity 
`O(n^2)` where `n` is the number of items.
 Depending on your cluster size and your dataset, the optimal solution differs.
 
-**Note:** _Better to use 
[3.1.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#311-similarity-computation-using-the-symmetric-property-of-item-similarity-matrix)
 scheme where dataset is large._
+**Note:** _Better to use 
[3.1.1.](#311-similarity-computation-using-the-symmetric-property-of-item-similarity-matrix)
 scheme where dataset is large._
 
 ### 3.1. Shuffle heavy similarity computation
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/recommend/movielens_fm.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_fm.md 
b/docs/gitbook/recommend/movielens_fm.md
index eac8013..ad59324 100644
--- a/docs/gitbook/recommend/movielens_fm.md
+++ b/docs/gitbook/recommend/movielens_fm.md
@@ -21,8 +21,7 @@ _Caution: Factorization Machine is supported from Hivemall 
v0.4 or later._
 
 # Data preparation
 
-First of all, please create `ratings` table described in the following page: 
-https://github.com/myui/hivemall/wiki/MovieLens-Dataset
+First of all, please create `ratings` table described in [this 
article](../recommend/movielens_dataset.html).
 
 ```sql
 use movielens;
@@ -190,7 +189,7 @@ usage: train_fm(array<string> x, double y [, const string 
options]) -
 
 ```sql
 -- workaround for a bug 
--- 
https://github.com/myui/hivemall/wiki/Map-side-Join-causes-ClassCastException-on-Tez:-LazyBinaryArray-cannot-be-cast-to-%5BLjava.lang.Object;
+-- https://issues.apache.org/jira/browse/HIVE-11051
 set hive.mapjoin.optimized.hashtable=false;
 
 drop table fm_predict;
@@ -222,7 +221,7 @@ from
 # Fast Factorization Machines Training using Int Features
 
 Training of Factorization Machines (FM) can be done more efficietly, in term 
of speed, by using INT features.
-In this section, we show how to run FM training by using int features, more 
specifically by using [feature 
hashing](https://github.com/myui/hivemall/wiki/Feature-hashing).
+In this section, we show how to run FM training by using int features, more 
specifically by using [feature hashing](../ft_engineering/hashing.html).
 
 ```sql
 set hivevar:factor=10;

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/recommend/movielens_mf.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/movielens_mf.md 
b/docs/gitbook/recommend/movielens_mf.md
index f275df8..ca38fec 100644
--- a/docs/gitbook/recommend/movielens_mf.md
+++ b/docs/gitbook/recommend/movielens_mf.md
@@ -17,9 +17,9 @@
   under the License.
 -->
         
-This page explains how to run matrix factorization on [MovieLens 1M 
dataset](https://github.com/myui/hivemall/wiki/MovieLens-Dataset).
+This page explains how to run matrix factorization on [MovieLens 1M 
dataset](../recommend/movielens_dataset.html).
 
-*Caution:* Matrix factorization is supported in Hivemall v0.3 or later.
+<!-- toc -->
 
 ## Calculate the mean rating in the training dataset
 ```sql
@@ -38,9 +38,8 @@ set hivevar:factor=10;
 -- maximum number of training iterations
 set hivevar:iters=50;
 ```
-See [this 
article](https://github.com/myui/hivemall/wiki/List-of-parameters-of-Matrix-Factorization)
 or 
[OnlineMatrixFactorizationUDTF#getOption()](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/mf/OnlineMatrixFactorizationUDTF.java#L123)
 to get the details of options.
 
-Note that there are no need to set an exact value for $mu. It actually works 
without setting $mu but recommended to set one for getting a better prediction.
+Note that there are no need to set an exact value for `$mu`. It actually works 
without setting `$mu` but recommended to set one for getting a better 
prediction.
 
 _Due to [a bug](https://issues.apache.org/jira/browse/HIVE-8396) in Hive, do 
not issue comments in CLI._
 
@@ -56,13 +55,17 @@ select
   avg(m_bias) as Bi
 from (
   select 
-    train_mf_sgd(userid, movieid, rating, "-factor ${factor} -mu ${mu} -iter 
${iters}") as (idx, u_rank, m_rank, u_bias, m_bias)
+    train_mf_sgd(userid, movieid, rating, '-factor ${factor} -mu ${mu} -iter 
${iters}') as (idx, u_rank, m_rank, u_bias, m_bias)
   from 
     training
 ) t
 group by idx;
 ```
-Note: Hivemall also provides *train_mf_adagrad* for training using AdaGrad.
+
+> #### Note
+>
+> Hivemall also provides *train_mf_adagrad* for training using AdaGrad. 
+> `-help` option shows a complete list of hyperparameters.
 
 # Predict
 
@@ -109,9 +112,10 @@ from (
   ON (t2.movieid = p2.idx)
 ) t;
 ```
-> 0.6728969407733578 (MAE) 
 
-> 0.8584162122694449 (RMSE)
+| MAE | RMSE |
+|:---:|:----:|
+| 0.6728969407733578 | 0.8584162122694449 |
 
 # Item Recommendation
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/recommend/news20_knn.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/recommend/news20_knn.md 
b/docs/gitbook/recommend/news20_knn.md
index 1e0ae97..fca9db5 100644
--- a/docs/gitbook/recommend/news20_knn.md
+++ b/docs/gitbook/recommend/news20_knn.md
@@ -119,4 +119,4 @@ limit ${topn};
 | 8482  | 0.15229382 |
 
 
-Refer [this 
page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#top-k-similarity-computation)
 for efficient top-k kNN computation.
\ No newline at end of file
+Refer [this page](../misc/topk.html#top-k-similarity-computation) for 
efficient top-k kNN computation.

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/regression/e2006_arow.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/e2006_arow.md 
b/docs/gitbook/regression/e2006_arow.md
index a02dfa8..abdb725 100644
--- a/docs/gitbook/regression/e2006_arow.md
+++ b/docs/gitbook/regression/e2006_arow.md
@@ -275,4 +275,4 @@ select
 from 
    e2006tfidf_arowe_submit;
 ```
-> 0.37789148212861856     0.14280197226536404     0.2357339155291536      
0.5060283955470721
\ No newline at end of file
+> 0.37789148212861856     0.14280197226536404     0.2357339155291536      
0.5060283955470721

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/regression/kddcup12tr2_adagrad.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_adagrad.md 
b/docs/gitbook/regression/kddcup12tr2_adagrad.md
index f6c7675..1b82bd9 100644
--- a/docs/gitbook/regression/kddcup12tr2_adagrad.md
+++ b/docs/gitbook/regression/kddcup12tr2_adagrad.md
@@ -1,128 +1,128 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
 -->
-        
-_Note adagrad/adadelta is supported from hivemall v0.3b2 or later (or in the 
master branch)._
-
-# Preparation 
-```sql
-add jar ./tmp/hivemall-with-dependencies.jar;
-source ./tmp/define-all.hive;
-
-use kdd12track2;
-
--- SET mapreduce.framework.name=yarn;
--- SET hive.execution.engine=mr;
--- SET mapreduce.framework.name=yarn-tez;
--- SET hive.execution.engine=tez;
-SET mapred.reduce.tasks=32; -- [optional] set the explicit number of reducers 
to make group-by aggregation faster
-```
-
-# AdaGrad
-```sql
-drop table adagrad_model;
-create table adagrad_model 
-as
-select 
- feature,
- avg(weight) as weight
-from 
- (select 
-     adagrad(features,label) as (feature,weight)
-  from 
-     training_orcfile
- ) t 
-group by feature;
-
-drop table adagrad_predict;
-create table adagrad_predict
-  ROW FORMAT DELIMITED 
-    FIELDS TERMINATED BY "\t"
-    LINES TERMINATED BY "\n"
-  STORED AS TEXTFILE
-as
-select
-  t.rowid, 
-  sigmoid(sum(m.weight)) as prob
-from 
-  testing_exploded  t LEFT OUTER JOIN
-  adagrad_model m ON (t.feature = m.feature)
-group by 
-  t.rowid
-order by 
-  rowid ASC;
-```
-
-```sh
-hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adagrad_predict 
adagrad_predict.tbl
-
-gawk -F "\t" '{print $2;}' adagrad_predict.tbl > adagrad_predict.submit
-
-pypy scoreKDD.py KDD_Track2_solution.csv adagrad_predict.submit
-```
->AUC(SGD) : 0.739351
-
->AUC(ADAGRAD) : 0.743279
-
-# AdaDelta
-```sql
-drop table adadelta_model;
-create table adadelta_model 
-as
-select 
- feature,
- cast(avg(weight) as float) as weight
-from 
- (select 
-     adadelta(features,label) as (feature,weight)
-  from 
-     training_orcfile
- ) t 
-group by feature;
-
-drop table adadelta_predict;
-create table adadelta_predict
-  ROW FORMAT DELIMITED 
-    FIELDS TERMINATED BY "\t"
-    LINES TERMINATED BY "\n"
-  STORED AS TEXTFILE
-as
-select
-  t.rowid, 
-  sigmoid(sum(m.weight)) as prob
-from 
-  testing_exploded  t LEFT OUTER JOIN
-  adadelta_model m ON (t.feature = m.feature)
-group by 
-  t.rowid
-order by 
-  rowid ASC;
-```
-
-```sh
-hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adadelta_predict 
adadelta_predict.tbl
-
-gawk -F "\t" '{print $2;}' adadelta_predict.tbl > adadelta_predict.submit
-
-pypy scoreKDD.py KDD_Track2_solution.csv adadelta_predict.submit
-```
->AUC(SGD) : 0.739351
-
->AUC(ADAGRAD) : 0.743279
-
-> AUC(AdaDelta) : 0.746878
\ No newline at end of file
+        
+_Note adagrad/adadelta is supported from hivemall v0.3b2 or later (or in the 
master branch)._
+
+# Preparation 
+```sql
+add jar ./tmp/hivemall-with-dependencies.jar;
+source ./tmp/define-all.hive;
+
+use kdd12track2;
+
+-- SET mapreduce.framework.name=yarn;
+-- SET hive.execution.engine=mr;
+-- SET mapreduce.framework.name=yarn-tez;
+-- SET hive.execution.engine=tez;
+SET mapred.reduce.tasks=32; -- [optional] set the explicit number of reducers 
to make group-by aggregation faster
+```
+
+# AdaGrad
+```sql
+drop table adagrad_model;
+create table adagrad_model 
+as
+select 
+ feature,
+ avg(weight) as weight
+from 
+ (select 
+     adagrad(features,label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+
+drop table adagrad_predict;
+create table adagrad_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sigmoid(sum(m.weight)) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  adagrad_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adagrad_predict 
adagrad_predict.tbl
+
+gawk -F "\t" '{print $2;}' adagrad_predict.tbl > adagrad_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv adagrad_predict.submit
+```
+>AUC(SGD) : 0.739351
+
+>AUC(ADAGRAD) : 0.743279
+
+# AdaDelta
+```sql
+drop table adadelta_model;
+create table adadelta_model 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     adadelta(features,label) as (feature,weight)
+  from 
+     training_orcfile
+ ) t 
+group by feature;
+
+drop table adadelta_predict;
+create table adadelta_predict
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+as
+select
+  t.rowid, 
+  sigmoid(sum(m.weight)) as prob
+from 
+  testing_exploded  t LEFT OUTER JOIN
+  adadelta_model m ON (t.feature = m.feature)
+group by 
+  t.rowid
+order by 
+  rowid ASC;
+```
+
+```sh
+hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adadelta_predict 
adadelta_predict.tbl
+
+gawk -F "\t" '{print $2;}' adadelta_predict.tbl > adadelta_predict.submit
+
+pypy scoreKDD.py KDD_Track2_solution.csv adadelta_predict.submit
+```
+>AUC(SGD) : 0.739351
+
+>AUC(ADAGRAD) : 0.743279
+
+> AUC(AdaDelta) : 0.746878

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/regression/kddcup12tr2_dataset.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_dataset.md 
b/docs/gitbook/regression/kddcup12tr2_dataset.md
index 15bfbfd..c32958f 100644
--- a/docs/gitbook/regression/kddcup12tr2_dataset.md
+++ b/docs/gitbook/regression/kddcup12tr2_dataset.md
@@ -35,7 +35,7 @@ http://www.kddcup2012.org/c/kddcup2012-track2
 | training.txt | 9.9GB | 149,639,105 |
 | serid_profile.txt | 283MB | 23,669,283 |
 
-![tables](https://raw.github.com/myui/hivemall/master/resources/examples/kddtrack2/tables.png)
+![tables](../resources/images/kddtrack2tables.png)
 
 _Tokens are actually not used in this example. Try using them on your own._
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md 
b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
index e402ce4..5ede953 100644
--- a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
+++ b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md
@@ -21,7 +21,7 @@ This article explains *amplify* technique that is useful for 
improving predictio
 
 Iterations are mandatory in machine learning (e.g., in [stochastic gradient 
descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good 
prediction models. However, MapReduce is known to be not suited for iterative 
algorithms because IN/OUT of each MapReduce job is through HDFS.
 
-In this example, we show how Hivemall deals with this problem. We use [KDD Cup 
2012, Track 2 
Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset)
 as an example.
+In this example, we show how Hivemall deals with this problem. We use [KDD Cup 
2012, Track 2 Task](../regression/kddcup12tr2_dataset.html) as an example.
 
 **WARNING**: rand_amplify() is supported in v0.2-beta1 and later.
 
@@ -73,7 +73,7 @@ The above query is executed by 2 MapReduce jobs as shown 
below:
 
 <img src="../resources/images/amplify.png" alt="amplifier"/>
 
-Using *trainning_x3*  instead of the plain training table results in higher 
and better AUC (0.746214) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+Using *trainning_x3*  instead of the plain training table results in higher 
and better AUC (0.746214) in [this 
example](../regression/kddcup12tr2_lr.html#evaluation).
 
 A problem in amplify() is that the shuffle (copy) and merge phase of the stage 
1 could become a bottleneck.
 When the training table is so large that involves 100 Map tasks, the merge 
operator needs to merge at least 100 files by (external) merge sort! 
@@ -108,7 +108,7 @@ The map-local multiplication and shuffling has no 
bottleneck in the merge phase
 
 <img src="../resources/images/randamplify_elapsed.png" alt="rand_amplify 
elapsed"/>
 
-Using *rand_amplify* results in a better AUC (0.743392) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+Using *rand_amplify* results in a better AUC (0.743392) in [this 
example](../regression/kddcup12tr2_lr.html#evaluation).
 
 ---
 # Conclusion

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/resources/images/kddtrack2tables.png
----------------------------------------------------------------------
diff --git a/docs/gitbook/resources/images/kddtrack2tables.png 
b/docs/gitbook/resources/images/kddtrack2tables.png
new file mode 100644
index 0000000..90012db
Binary files /dev/null and b/docs/gitbook/resources/images/kddtrack2tables.png 
differ

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/addbias.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/addbias.md b/docs/gitbook/tips/addbias.md
index dfa4bfc..021ca64 100644
--- a/docs/gitbook/tips/addbias.md
+++ b/docs/gitbook/tips/addbias.md
@@ -28,7 +28,7 @@ Then, the predicted model considers bias existing in the 
dataset and the predict
 
 **addBias()** of Hivemall, adds a bias to a feature vector. 
 To enable a bias clause, use addBias() for **both**_(important!)_ training and 
test data as follows.
-The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See 
[AddBiasUDF](https://github.com/myui/hivemall/blob/master/src/main/hivemall/ftvec/AddBiasUDF.java)
 for the detail.
+The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See 
[AddBiasUDF](../tips/addbias.html) for the detail.
 
 Note that Bias is expressed as a feature that found in all training/testing 
examples.
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/emr.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/emr.md b/docs/gitbook/tips/emr.md
index 61cb25b..049e6da 100644
--- a/docs/gitbook/tips/emr.md
+++ b/docs/gitbook/tips/emr.md
@@ -16,6 +16,8 @@
   specific language governing permissions and limitations
   under the License.
 -->
+
+<!-- toc -->
         
 ## Prerequisite
 Learn how to use Hive with Elastic MapReduce (EMR).  

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/hadoop_tuning.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/hadoop_tuning.md 
b/docs/gitbook/tips/hadoop_tuning.md
index 7125508..507e19d 100644
--- a/docs/gitbook/tips/hadoop_tuning.md
+++ b/docs/gitbook/tips/hadoop_tuning.md
@@ -16,6 +16,8 @@
   specific language governing permissions and limitations
   under the License.
 -->
+
+<!-- toc -->
         
 # Prerequisites 
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/mixserver.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/mixserver.md b/docs/gitbook/tips/mixserver.md
index bd58279..f9878e6 100644
--- a/docs/gitbook/tips/mixserver.md
+++ b/docs/gitbook/tips/mixserver.md
@@ -1,87 +1,86 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
 -->
-        
-In this page, we will explain how to use model mixing on Hivemall. The model 
mixing is useful for a better prediction performance and faster convergence in 
training classifiers. 
-
-<!--
-You can find a brief explanation of the internal design of MIX protocol in 
[this slide](http://www.slideshare.net/myui/hivemall-mix-internal).
--->
-
-Prerequisite
-============
-
-* Hivemall v0.3 or later
-
-We recommend to use Mixing in a cluster with fast networking. The current 
standard GbE is enough though.
-
-Running Mix Server
-===================
-
-First, put the following files on server(s) that are accessible from Hadoop 
worker nodes:
-* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
-* 
[bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)
-
-_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix 
servers._
-
-```sh
-# run a Mix Server
-./run_mixserv.sh
-```
-
-We assume in this example that Mix servers are running on host01, host03 and 
host03.
-The default port used by Mix server is 11212 and the port is configurable 
through "-port" option of run_mixserv.sh. 
-
-See 
[MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90)
 to get detail of the Mix server options.
-
-We recommended to use multiple MIX servers to get better MIX throughput (3-5 
or so would be enough for normal cluster size). The MIX protocol of Hivemall is 
*horizontally scalable* by adding MIX server nodes.
-
-Using Mix Protocol through Hivemall
-===================================
-
-[Install Hivemall](https://github.com/myui/hivemall/wiki/Installation) on Hive.
-
-_Make sure that 
[hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar)
 is used for installation. The jar contains minimum requirement jars 
(netty,jsr305) for running Hivemall on Hive._
-
-Now, we explain that how to use mixing in [an example using KDD2010a 
dataset](https://github.com/myui/hivemall/wiki/KDD2010a-classification).
-
-Enabling the mixing on Hivemall is simple as follows:
-```sql
-use kdd2010;
-
-create table kdd10a_pa1_model1 as
-select 
- feature,
- cast(voted_avg(weight) as float) as weight
-from 
- (select 
-     train_pa1(addBias(features),label,"-mix host01,host02,host03") as 
(feature,weight)
-  from 
-     kdd10a_train_x3
- ) t 
-group by feature;
-```
-
-All you have to do is just adding "*-mix*" training option as seen in the 
above query.
-
-The effect of model mixing
-===========================
-
-In my experience, the MIX improved the prediction accuracy of the above 
KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 
0.8678096499719774 (w/ mix).
-
+        
+In this page, we will explain how to use model mixing on Hivemall. The model 
mixing is useful for a better prediction performance and faster convergence in 
training classifiers. 
+You can find a brief explanation of the internal design of MIX protocol in 
[this slide](http://www.slideshare.net/myui/hivemall-mix-internal).
+
+<!-- toc -->
+
+Prerequisite
+============
+
+* Hivemall v0.3 or later
+
+    We recommend to use Mixing in a cluster with fast networking. The current 
standard GbE is enough though.
+
+Running Mix Server
+===================
+
+First, put the following files on server(s) that are accessible from Hadoop 
worker nodes:
+* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
+* 
[bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)
+
+_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix 
servers._
+
+```sh
+# run a Mix Server
+./run_mixserv.sh
+```
+
+We assume in this example that Mix servers are running on host01, host03 and 
host03.
+The default port used by Mix server is 11212 and the port is configurable 
through "-port" option of run_mixserv.sh. 
+
+See 
[MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90)
 to get detail of the Mix server options.
+
+We recommended to use multiple MIX servers to get better MIX throughput (3-5 
or so would be enough for normal cluster size). The MIX protocol of Hivemall is 
*horizontally scalable* by adding MIX server nodes.
+
+Using Mix Protocol through Hivemall
+===================================
+
+[Install Hivemall](../getting_started/installation.html) on Hive.
+
+_Make sure that 
[hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar)
 is used for installation. The jar contains minimum requirement jars 
(netty,jsr305) for running Hivemall on Hive._
+
+Now, we explain that how to use mixing in [an example using KDD2010a 
dataset](../binaryclass/kdd2010a_dataset.html).
+
+Enabling the mixing on Hivemall is simple as follows:
+```sql
+use kdd2010;
+
+create table kdd10a_pa1_model1 as
+select 
+ feature,
+ cast(voted_avg(weight) as float) as weight
+from 
+ (select 
+     train_pa1(addBias(features),label,"-mix host01,host02,host03") as 
(feature,weight)
+  from 
+     kdd10a_train_x3
+ ) t 
+group by feature;
+```
+
+All you have to do is just adding "*-mix*" training option as seen in the 
above query.
+
+The effect of model mixing
+===========================
+
+In my experience, the MIX improved the prediction accuracy of the above 
KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 
0.8678096499719774 (w/ mix).
+
 The overhead of using the MIX protocol is *almost negligible* because the MIX 
communication is efficiently handled using asynchronous non-blocking I/O. 
Furthermore, the training time could be improved on certain settings because of 
the faster convergence due to mixing. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/rand_amplify.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rand_amplify.md 
b/docs/gitbook/tips/rand_amplify.md
index cd546ec..6d68dea 100644
--- a/docs/gitbook/tips/rand_amplify.md
+++ b/docs/gitbook/tips/rand_amplify.md
@@ -21,16 +21,16 @@ This article explains *amplify* technique that is useful 
for improving predictio
 
 Iterations are mandatory in machine learning (e.g., in [stochastic gradient 
descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good 
prediction models. However, MapReduce is known to be not suited for iterative 
algorithms because IN/OUT of each MapReduce job is through HDFS.
 
-In this example, we show how Hivemall deals with this problem. We use [KDD Cup 
2012, Track 2 
Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset)
 as an example.
+In this example, we show how Hivemall deals with this problem. We use [KDD Cup 
2012, Track 2 Task](../regression/kddcup12tr2_dataset.html) as an example.
 
-**WARNING**: rand_amplify() is supported in v0.2-beta1 and later.
+<!-- toc -->
 
 ---
 # Amplify training examples in Map phase and shuffle them in Reduce phase
 Hivemall provides the **amplify** UDTF to enumerate iteration effects in 
machine learning without several MapReduce steps. 
 
 The amplify function returns multiple rows for each row.
-The first argument ${xtimes} is the multiplication factor.  
+The first argument `${xtimes}` is the multiplication factor.  
 In the following examples, the multiplication factor is set to 3.
 
 ```sql
@@ -72,9 +72,9 @@ group by feature;
 The above query is executed by 2 MapReduce jobs as shown below:
 <img src="../resources/images/amplify.png" alt="amplifier"/>
 
-Using *trainning_x3*  instead of the plain training table results in higher 
and better AUC (0.746214) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+Using *trainning_x3*  instead of the plain training table results in higher 
and better AUC (0.746214) in [this 
example](../regression/kddcup12tr2_lr_amplify.html#conclusion).
 
-A problem in amplify() is that the shuffle (copy) and merge phase of the stage 
1 could become a bottleneck.
+A problem in `amplify()` is that the shuffle (copy) and merge phase of the 
stage 1 could become a bottleneck.
 When the training table is so large that involves 100 Map tasks, the merge 
operator needs to merge at least 100 files by (external) merge sort! 
 
 Note that the actual bottleneck is not M/R iterations but shuffling training 
instance. Iteration without shuffling (as in [the Spark 
example](http://spark.incubator.apache.org/examples.html)) causes very slow 
convergence and results in requiring more iterations. Shuffling cannot be 
avoided even in iterative MapReduce variants.
@@ -107,7 +107,7 @@ The map-local multiplication and shuffling has no 
bottleneck in the merge phase
 
 <img src="../resources/images/randamplify_elapsed.png" 
alt="randamplify_elapsed"/>
 
-Using *rand_amplify* results in a better AUC (0.743392) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+Using *rand_amplify* results in a better AUC (0.743392) in [this 
example](../regression/kddcup12tr2_lr_amplify.html#conclusion).
 
 ---
 # Conclusion

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/rowid.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rowid.md b/docs/gitbook/tips/rowid.md
index 2b24401..ed6431e 100644
--- a/docs/gitbook/tips/rowid.md
+++ b/docs/gitbook/tips/rowid.md
@@ -16,7 +16,21 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
+
+<!-- toc -->
+
+# Rowid generator provided in Hivemall
+You can use [rowid() 
function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java)
 to generate an unique rowid in Hivemall v0.2 or later.
+```sql
+select
+  rowid() as rowid, -- returns ${task_id}-${sequence_number}
+  *
+from 
+  xxx
+```
+
+# Other Rowid generation schemes using SQL
+
 ```sql
 CREATE TABLE xxx
 AS
@@ -37,14 +51,3 @@ select
   * 
 from a9atest;
 ```
-
-***
-# Rowid generator provided in Hivemall v0.2 or later
-You can use [rowid() 
function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java)
 to generate an unique rowid in Hivemall v0.2 or later.
-```sql
-select
-  rowid() as rowid, -- returns ${task_id}-${sequence_number}
-  *
-from 
-  xxx
-```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/a71bbb75/docs/gitbook/tips/rt_prediction.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rt_prediction.md 
b/docs/gitbook/tips/rt_prediction.md
index c342378..25d9ff7 100644
--- a/docs/gitbook/tips/rt_prediction.md
+++ b/docs/gitbook/tips/rt_prediction.md
@@ -16,23 +16,25 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
-Hivemall provides a batch learning scheme that builds prediction models on 
Apache Hive.
+
+Apache Hivemall provides a batch learning scheme that builds prediction models 
on Apache Hive.
 The learning process itself is a batch process; however, an online/real-time 
prediction can be achieved by carrying a prediction on a transactional 
relational DBMS.
 
 In this article, we explain how to run a real-time prediction using a 
relational DBMS. 
-We assume that you have already run the [a9a binary classification 
task](https://github.com/myui/hivemall/wiki#a9a-binary-classification).
+We assume that you have already run the [a9a binary classification 
task](../binaryclass/a9a.html).
+
+<!-- toc -->
 
 # Prerequisites
 
 - MySQL
 
-Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.
+    Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.
 
 - [Sqoop](http://sqoop.apache.org/)
 
-Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for 
Hadoop 2.6.
-To do that you need to edit build.xml and ivy.xml as shown in [this 
patch](https://gist.github.com/myui/e8db4a31b574103133c6).
+    Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages 
for Hadoop 2.6.
+    To do that you need to edit build.xml and ivy.xml as shown in [this 
patch](https://gist.github.com/myui/e8db4a31b574103133c6).
 
 # Preparing Model Tables on MySQL
 
@@ -228,7 +230,7 @@ where
 1 row in set (0.00 sec)
 ```
 
-Similar to [the way in 
Hive](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)#prediction),
 you can run prediction as follows:
+Similar to [the way in Hive](../binaryclass/a9a_lr.html#prediction), you can 
run prediction as follows:
 
 ```sql
 select

[44/50] [abbrv] incubator-hivemall git commit: Updated the userguide

Reply via email to