Repository: incubator-hivemall
Updated Branches:
  refs/heads/master [created] 72d6a629f


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/hadoop_tuning.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/hadoop_tuning.md 
b/docs/gitbook/tips/hadoop_tuning.md
new file mode 100644
index 0000000..a6c1854
--- /dev/null
+++ b/docs/gitbook/tips/hadoop_tuning.md
@@ -0,0 +1,79 @@
+# Prerequisites 
+
+Please refer the following guides for Hadoop tuning:
+
+* http://hadoopbook.com/
+* http://www.slideshare.net/cloudera/mr-perf
+
+---
+# Mapper-side configuration
+_Mapper configuration is important for hivemall when training runs on mappers 
(e.g., when using rand_amplify())._
+
+```
+mapreduce.map.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
+mapred.map.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)
+
+mapreduce.task.io.sort.mb=1024 (YARN)
+io.sort.mb=1024 (MR v1)
+```
+
+Hivemall can use at max 1024MB in the above case.
+> mapreduce.map.java.opts - mapreduce.task.io.sort.mb = 2048MB - 1024MB = 
1024MB
+
+Moreover, other Hadoop components consumes memory spaces. It would be about 
1024MB * 0.5 or so is available for Hivemall. We recommend to set at least 
-Xmx2048m for a mapper.
+ 
+So, make `mapreduce.map.java.opts - mapreduce.task.io.sort.mb` as large as 
possible.
+
+# Reducer-side configuration
+_Reducer configuration is important for hivemall when training runs on 
reducers (e.g., when using amplify())._
+
+```
+mapreduce.reduce.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
+mapred.reduce.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)
+
+mapreduce.reduce.shuffle.input.buffer.percent=0.6 (YARN)
+mapred.reduce.shuffle.input.buffer.percent=0.6 (MR v1)
+
+-- mapreduce.reduce.input.buffer.percent=0.2 (YARN)
+-- mapred.job.reduce.input.buffer.percent=0.2 (MR v1)
+```
+
+Hivemall can use at max 820MB in the above case.
+> mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent) = 
2048 * (1 - 0.6) ≈ 820 MB
+
+Moreover, other Hadoop components consumes memory spaces. It would be about 
820MB * 0.5 or so is available for Hivemall. We recommend to set at least 
-Xmx2048m for a reducer.
+
+So, make `mapreduce.reduce.java.opts * (1 - 
mapreduce.reduce.input.buffer.percent)` as large as possible.
+
+---
+# Formula to estimate consumed memory in Hivemall
+
+For a dense model, the consumed memory in Hivemall is as follows:
+```
+feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance 
is calculated) * 1.2 (heuristics)
+```
+> 2^24 * 4 bytes * 2 * 1.2 ≈ 161MB
+
+When 
[SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java)
 is used, the formula changes as follows:
+```
+feature_dimensions (assume here 2^25) * 2 bytes (short) * 2 (iff covariance is 
calculated) * 1.2 (heuristics)
+```
+> 2^25 * 2 bytes * 2 * 1.2 ≈ 161MB
+
+Note: Hivemall uses a [sparse 
representation](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SparseModel.java)
 of prediction model (using a hash table) by the default. Use 
"[-densemodel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/LearnerBaseUDTF.java#L87)"
 option to use a dense model.
+
+# Execution Engine of Hive
+
+We recommend to use Apache Tez for execute engine of Hive for Hivemall queries.
+
+```sql
+set mapreduce.framework.name=yarn-tez;
+set hive.execution.engine=tez;
+```
+
+You can use the plain old MapReduce by setting following setting:
+
+```sql
+set mapreduce.framework.name=yarn;
+set hive.execution.engine=mr;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/mixserver.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/mixserver.md b/docs/gitbook/tips/mixserver.md
new file mode 100644
index 0000000..631557c
--- /dev/null
+++ b/docs/gitbook/tips/mixserver.md
@@ -0,0 +1,68 @@
+In this page, we will explain how to use model mixing on Hivemall. The model 
mixing is useful for a better prediction performance and faster convergence in 
training classifiers. 
+
+<!--
+You can find a brief explanation of the internal design of MIX protocol in 
[this slide](http://www.slideshare.net/myui/hivemall-mix-internal).
+-->
+
+Prerequisite
+============
+
+* Hivemall v0.3 or later
+
+We recommend to use Mixing in a cluster with fast networking. The current 
standard GbE is enough though.
+
+Running Mix Server
+===================
+
+First, put the following files on server(s) that are accessible from Hadoop 
worker nodes:
+* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
+* 
[bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)
+
+_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix 
servers._
+
+```sh
+# run a Mix Server
+./run_mixserv.sh
+```
+
+We assume in this example that Mix servers are running on host01, host03 and 
host03.
+The default port used by Mix server is 11212 and the port is configurable 
through "-port" option of run_mixserv.sh. 
+
+See 
[MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90)
 to get detail of the Mix server options.
+
+We recommended to use multiple MIX servers to get better MIX throughput (3-5 
or so would be enough for normal cluster size). The MIX protocol of Hivemall is 
*horizontally scalable* by adding MIX server nodes.
+
+Using Mix Protocol through Hivemall
+===================================
+
+[Install Hivemall](https://github.com/myui/hivemall/wiki/Installation) on Hive.
+
+_Make sure that 
[hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar)
 is used for installation. The jar contains minimum requirement jars 
(netty,jsr305) for running Hivemall on Hive._
+
+Now, we explain that how to use mixing in [an example using KDD2010a 
dataset](https://github.com/myui/hivemall/wiki/KDD2010a-classification).
+
+Enabling the mixing on Hivemall is simple as follows:
+```sql
+use kdd2010;
+
+create table kdd10a_pa1_model1 as
+select 
+ feature,
+ cast(voted_avg(weight) as float) as weight
+from 
+ (select 
+     train_pa1(addBias(features),label,"-mix host01,host02,host03") as 
(feature,weight)
+  from 
+     kdd10a_train_x3
+ ) t 
+group by feature;
+```
+
+All you have to do is just adding "*-mix*" training option as seen in the 
above query.
+
+The effect of model mixing
+===========================
+
+In my experience, the MIX improved the prediction accuracy of the above 
KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 
0.8678096499719774 (w/ mix).
+
+The overhead of using the MIX protocol is *almost negligible* because the MIX 
communication is efficiently handled using asynchronous non-blocking I/O. 
Furthermore, the training time could be improved on certain settings because of 
the faster convergence due to mixing. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rand_amplify.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rand_amplify.md 
b/docs/gitbook/tips/rand_amplify.md
new file mode 100644
index 0000000..4df124e
--- /dev/null
+++ b/docs/gitbook/tips/rand_amplify.md
@@ -0,0 +1,103 @@
+This article explains *amplify* technique that is useful for improving 
prediction score.
+
+Iterations are mandatory in machine learning (e.g., in [stochastic gradient 
descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good 
prediction models. However, MapReduce is known to be not suited for iterative 
algorithms because IN/OUT of each MapReduce job is through HDFS.
+
+In this example, we show how Hivemall deals with this problem. We use [KDD Cup 
2012, Track 2 
Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset)
 as an example.
+
+**WARNING**: rand_amplify() is supported in v0.2-beta1 and later.
+
+---
+# Amplify training examples in Map phase and shuffle them in Reduce phase
+Hivemall provides the **amplify** UDTF to enumerate iteration effects in 
machine learning without several MapReduce steps. 
+
+The amplify function returns multiple rows for each row.
+The first argument ${xtimes} is the multiplication factor.  
+In the following examples, the multiplication factor is set to 3.
+
+```sql
+set hivevar:xtimes=3;
+
+create or replace view training_x3
+as
+select 
+  * 
+from (
+select
+   amplify(${xtimes}, *) as (rowid, label, features)
+from  
+   training_orcfile
+) t
+CLUSTER BY rand();
+```
+
+In the above example, the  [CLUSTER 
BY](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-SyntaxofClusterByandDistributeBy)
 clause distributes Map outputs to reducers using a random key for the 
distribution key. And then, the input records of the reducer is randomly 
shuffled.
+
+The multiplication of records and  the random shuffling has a similar effect 
to iterations.
+So, we recommend users to use an amplified view for training as follows:
+
+```sql
+create table lr_model_x3 
+as
+select 
+ feature,
+ cast(avg(weight) as float) as weight
+from 
+ (select 
+     logress(features,label) as (feature,weight)
+  from 
+     training_x3
+ ) t 
+group by feature;
+```
+
+The above query is executed by 2 MapReduce jobs as shown below:
+![amplifier](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify.png)
+[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify_plan.txt) 
is the actual plan generated by the Hive.
+
+Using *trainning_x3*  instead of the plain training table results in higher 
and better AUC (0.746214) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+
+A problem in amplify() is that the shuffle (copy) and merge phase of the stage 
1 could become a bottleneck.
+When the training table is so large that involves 100 Map tasks, the merge 
operator needs to merge at least 100 files by (external) merge sort! 
+
+Note that the actual bottleneck is not M/R iterations but shuffling training 
instance. Iteration without shuffling (as in [the Spark 
example](http://spark.incubator.apache.org/examples.html)) causes very slow 
convergence and results in requiring more iterations. Shuffling cannot be 
avoided even in iterative MapReduce variants.
+
+![amplify 
elapsed](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify_elapsed.png)
+
+---
+# Amplify and shuffle training examples in each Map task
+
+To deal with large training data, Hivemall provides **rand_amplify** UDTF that 
randomly shuffles input rows in a Map task.
+The rand_amplify UDTF outputs rows in a random order when the local buffer 
specified by ${shufflebuffersize} is filled.
+
+With rand_amplify(), the view definition of training_x3 becomes as follows:
+```sql
+set hivevar:shufflebuffersize=1000;
+
+create or replace view training_x3
+as
+select
+   rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features)
+from  
+   training_orcfile;
+```
+
+The training query is executed as follows:
+![Random 
amplify](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify.png) 
 
+[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify_plan.txt)
 is the actual query plan.
+
+The map-local multiplication and shuffling has no bottleneck in the merge 
phase and the query is efficiently executed within a single MapReduce job.
+
+![rand_amplify elapsed 
](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify_elapsed.png)
+
+Using *rand_amplify* results in a better AUC (0.743392) in 
[this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\))
 example.
+
+---
+# Conclusion
+
+We recommend users to use *amplify()* for small training inputs and to use 
*rand_amplify()* for large training inputs to get a better accuracy in a 
reasonable training time.
+
+| Method     | ELAPSED TIME (sec) | AUC |
+|:-----------|--------------------|----:|
+| Plain | 89.718 | 0.734805 |
+| amplifier+clustered by | 479.855  | 0.746214 |
+| rand_amplifier | 116.424 | 0.743392 |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rowid.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rowid.md b/docs/gitbook/tips/rowid.md
new file mode 100644
index 0000000..c43aa74
--- /dev/null
+++ b/docs/gitbook/tips/rowid.md
@@ -0,0 +1,31 @@
+```sql
+CREATE TABLE xxx
+AS
+SELECT 
+  regexp_replace(reflect('java.util.UUID','randomUUID'), '-', '') as rowid,
+  *
+FROM
+  ..;
+```
+
+Another option to generate rowid is to use row_number(). 
+However, the query execution would become too slow for large dataset because 
the rowid generation is executed on a single reducer.
+```sql
+CREATE TABLE xxx
+AS
+select 
+  row_number() over () as rowid, 
+  * 
+from a9atest;
+```
+
+***
+# Rowid generator provided in Hivemall v0.2 or later
+You can use [rowid() 
function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java)
 to generate an unique rowid in Hivemall v0.2 or later.
+```sql
+select
+  rowid() as rowid, -- returns ${task_id}-${sequence_number}
+  *
+from 
+  xxx
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rt_prediction.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/tips/rt_prediction.md 
b/docs/gitbook/tips/rt_prediction.md
new file mode 100644
index 0000000..3ac4fb6
--- /dev/null
+++ b/docs/gitbook/tips/rt_prediction.md
@@ -0,0 +1,234 @@
+Hivemall provides a batch learning scheme that builds prediction models on 
Apache Hive.
+The learning process itself is a batch process; however, an online/real-time 
prediction can be achieved by carrying a prediction on a transactional 
relational DBMS.
+
+In this article, we explain how to run a real-time prediction using a 
relational DBMS. 
+We assume that you have already run the [a9a binary classification 
task](https://github.com/myui/hivemall/wiki#a9a-binary-classification).
+
+# Prerequisites
+
+- MySQL
+
+Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.
+
+- [Sqoop](http://sqoop.apache.org/)
+
+Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for 
Hadoop 2.6.
+To do that you need to edit build.xml and ivy.xml as shown in [this 
patch](https://gist.github.com/myui/e8db4a31b574103133c6).
+
+# Preparing Model Tables on MySQL
+
+```sql
+create database a9a;
+use a9a;
+
+create user sqoop identified by 'sqoop';
+grant all privileges on a9a.* to 'sqoop'@'%' identified by 'sqoop';
+flush privileges;
+
+create table a9a_model1 (
+  feature int, 
+  weight double
+);
+```
+
+Do not forget to edit bind_address in the MySQL configuration file 
(/etc/mysql/my.conf) accessible from master and slave nodes of Hadoop.
+
+# Exporting Hive table to MySQL
+
+Check the connectivity to MySQL server using Sqoop.
+
+```sh
+export MYSQL_HOST=dm01
+
+export HADOOP_HOME=/opt/hadoop
+export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
+export HADOOP_COMMON_HOME=${HADOOP_HOME}
+
+bin/sqoop list-tables --connect jdbc:mysql://${MYSQL_HOST}/a9a --username 
sqoop --password sqoop
+```
+
+Create TSV table because Sqoop cannot directory read Hive tables.
+
+```sql
+create table a9a_model1_tsv 
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+AS
+select * from a9a_model1;
+```
+
+Check the location of 'a9a_model1_tsv' as follows:
+
+```sql
+desc extended a9a_model1_tsv;
+> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9a_model1_tsv
+```
+
+```sh
+bin/sqoop export \
+--connect jdbc:mysql://${MYSQL_HOST}/a9a \
+--username sqoop --password sqoop \
+--table a9a_model1 \
+--export-dir /user/hive/warehouse/a9a.db/a9a_model1_tsv \
+--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
+--batch
+```
+
+When the exporting successfully finishes, you can find entries in the model 
table in MySQL.
+
+```sql
+mysql> select * from a9a_model1 limit 3;
++---------+---------------------+
+| feature | weight              |
++---------+---------------------+
+|       0 | -0.5761121511459351 |
+|       1 | -1.5259535312652588 |
+|      10 | 0.21053194999694824 |
++---------+---------------------+
+3 rows in set (0.00 sec)
+```
+
+We recommend to create an index of model tables to boost lookups in online 
prediction.
+
+```sql
+CREATE UNIQUE INDEX a9a_model1_feature_index on a9a_model1 (feature);
+-- USING BTREE;
+```
+
+# Exporting test data from Hadoop to MySQL (optional step)
+
+Prepare a testing data table in Hive which is being exported.
+
+```sql
+create table a9atest_exploded_tsv
+  ROW FORMAT DELIMITED 
+    FIELDS TERMINATED BY "\t"
+    LINES TERMINATED BY "\n"
+  STORED AS TEXTFILE
+AS
+select
+  rowid, 
+  -- label, 
+  extract_feature(feature) as feature,
+  extract_weight(feature) as value
+from
+  a9atest LATERAL VIEW explode(addBias(features)) t AS feature;
+
+desc extended a9atest_exploded_tsv;
+> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv,
+```
+
+Prepare a test table, importing data from Hadoop.
+
+```sql
+use a9a;
+
+create table a9atest_exploded (
+  rowid bigint,
+  feature int, 
+  value double
+);
+```
+
+Then, run Sqoop to export data from HDFS to MySQL.
+
+```sh
+export MYSQL_HOST=dm01
+
+bin/sqoop export \
+--connect jdbc:mysql://${MYSQL_HOST}/a9a \
+--username sqoop --password sqoop \
+--table a9atest_exploded \
+--export-dir /user/hive/warehouse/a9a.db/a9atest_exploded_tsv \
+--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
+--batch
+```
+
+Better to add an index to the rowid column to boost selection by rowids.
+```sql
+CREATE INDEX a9atest_exploded_rowid_index on a9atest_exploded (rowid) USING 
BTREE;
+```
+
+When the exporting successfully finishes, you can find entries in the test 
table in MySQL.
+
+```sql
+mysql> select * from a9atest_exploded limit 10;
++-------+---------+-------+
+| rowid | feature | value |
++-------+---------+-------+
+| 12427 |      67 |     1 |
+| 12427 |      73 |     1 |
+| 12427 |      74 |     1 |
+| 12427 |      76 |     1 |
+| 12427 |      82 |     1 |
+| 12427 |      83 |     1 |
+| 12427 |       0 |     1 |
+| 12428 |       5 |     1 |
+| 12428 |       7 |     1 |
+| 12428 |      16 |     1 |
++-------+---------+-------+
+10 rows in set (0.00 sec)
+```
+
+# Online/realtime prediction on MySQL
+
+Define sigmoid function used for a prediction of logistic regression as 
follows: 
+
+```sql
+DROP FUNCTION IF EXISTS sigmoid;
+DELIMITER $$
+CREATE FUNCTION sigmoid(x DOUBLE)
+  RETURNS DOUBLE
+  LANGUAGE SQL
+BEGIN
+  RETURN 1.0 / (1.0 + EXP(-x));
+END;
+$$
+DELIMITER ;
+```
+
+We assume here that doing prediction for a 'features' having (0,1,10) and each 
of them is a categorical feature (i.e., the weight is 1.0). Then, you can get 
the probability by logistic regression simply as follows:
+
+```sql
+select
+  sigmoid(sum(m.weight)) as prob
+from
+  a9a_model1 m
+where
+  m.feature in (0,1,10);
+```
+
+```
++--------------------+
+| prob               |
++--------------------+
+| 0.1310696931351625 |
++--------------------+
+1 row in set (0.00 sec)
+```
+
+Similar to [the way in 
Hive](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)#prediction),
 you can run prediction as follows:
+
+```sql
+select
+  sigmoid(sum(t.value * m.weight)) as prob, 
+  if(sigmoid(sum(t.value * m.weight)) > 0.5, 1.0, 0.0) as predicted
+from
+  a9atest_exploded t LEFT OUTER JOIN
+  a9a_model1 m ON (t.feature = m.feature)
+where
+  t.rowid = 12427; -- prediction on a particular id
+```
+
+Alternatively, you can use SQL views for testing target 't' in the above query.
+
+```
++---------------------+-----------+
+| prob                | predicted |
++---------------------+-----------+
+| 0.05595205126313402 |       0.0 |
++---------------------+-----------+
+1 row in set (0.00 sec)
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/README.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/README.md 
b/docs/gitbook/troubleshooting/README.md
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/asterisk.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/asterisk.md 
b/docs/gitbook/troubleshooting/asterisk.md
new file mode 100644
index 0000000..49e2f71
--- /dev/null
+++ b/docs/gitbook/troubleshooting/asterisk.md
@@ -0,0 +1,3 @@
+See [HIVE-4181](https://issues.apache.org/jira/browse/HIVE-4181) that asterisk 
argument without table alias for UDTF is not working. It has been fixed as part 
of Hive v0.12 release.
+
+A possible workaround is to use asterisk with a table alias, or to specify 
names of arguments explicitly.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/mapjoin_classcastex.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/mapjoin_classcastex.md 
b/docs/gitbook/troubleshooting/mapjoin_classcastex.md
new file mode 100644
index 0000000..c48919a
--- /dev/null
+++ b/docs/gitbook/troubleshooting/mapjoin_classcastex.md
@@ -0,0 +1,8 @@
+Map-side join on Tez causes 
[ClassCastException](http://markmail.org/message/7cwbgupnhah6ggkv) when a 
serialized table contains array column(s).
+
+[Workaround] Try setting _hive.mapjoin.optimized.hashtable_ off as follows:
+```sql
+set hive.mapjoin.optimized.hashtable=false;
+```
+
+Caution: Fixed in Hive 1.3.0. Refer 
[HIVE_11051](https://issues.apache.org/jira/browse/HIVE-11051) for the detail.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/mapjoin_task_error.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/mapjoin_task_error.md 
b/docs/gitbook/troubleshooting/mapjoin_task_error.md
new file mode 100644
index 0000000..02aff2f
--- /dev/null
+++ b/docs/gitbook/troubleshooting/mapjoin_task_error.md
@@ -0,0 +1,8 @@
+From Hive 0.11.0, **hive.auto.convert.join** is [enabled by the 
default](https://issues.apache.org/jira/browse/HIVE-3297).
+
+When using complex queries using views, the auto conversion sometimes throws 
SemanticException, cannot serialize object.
+
+Workaround for the exception is to disable **hive.auto.convert.join** before 
the execution as follows.
+```
+set hive.auto.convert.join=false;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/num_mappers.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/num_mappers.md 
b/docs/gitbook/troubleshooting/num_mappers.md
new file mode 100644
index 0000000..be01f2a
--- /dev/null
+++ b/docs/gitbook/troubleshooting/num_mappers.md
@@ -0,0 +1,20 @@
+The default _hive.input.format_ is set to 
_org.apache.hadoop.hive.ql.io.CombineHiveInputFormat_.
+This configuration could give less number of mappers than the split size 
(i.e., # blocks in HDFS) of the input table.
+
+Try setting _org.apache.hadoop.hive.ql.io.HiveInputFormat_ for 
_hive.input.format_.
+```
+set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
+```
+
+Note Apache Tez uses _org.apache.hadoop.hive.ql.io.HiveInputFormat_ by the 
default.
+```
+set hive.tez.input.format;
+``` 
+> hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
+
+***
+
+You can then control the maximum number of mappers via setting:
+```
+set mapreduce.job.maps=128;
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/oom.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/troubleshooting/oom.md 
b/docs/gitbook/troubleshooting/oom.md
new file mode 100644
index 0000000..643d09a
--- /dev/null
+++ b/docs/gitbook/troubleshooting/oom.md
@@ -0,0 +1,20 @@
+# OOM in mappers
+
+In a certain setting, the default input split size is too large for Hivemall. 
Due to that, OutOfMemoryError cloud happen on mappers in the middle of training.
+
+Then, revise your a Hadoop setting 
(**mapred.child.java.opts**/**mapred.map.child.java.opts**) first to use a 
larger value as possible.
+
+If an OOM error still caused after that, set smaller **mapred.max.split.size** 
value before training.
+```
+SET mapred.max.split.size=67108864;
+```
+Then, the number of training examples used for each trainer is reduced (as the 
number of mappers increases) and the trained model would fit in the memory.
+
+# OOM in shuffle/merge
+
+If OOM caused during the merge step, try setting a larger 
**mapred.reduce.tasks** value before training and revise [shuffle/reduce 
parameters](http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Shuffle%2FReduce+Parameters).
+```
+SET mapred.reduce.tasks=64;
+```
+
+If your OOM happened by using amplify(), try using rand_amplify() instead.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/src/site/site.xml
----------------------------------------------------------------------
diff --git a/src/site/site.xml b/src/site/site.xml
index fdb8fac..8a20d84 100644
--- a/src/site/site.xml
+++ b/src/site/site.xml
@@ -48,8 +48,8 @@
                                <ribbonOrientation>right</ribbonOrientation>
                                <ribbonColor>red</ribbonColor>
                        </gitHub>                       
-            <facebookLike />
-                       <twitter>
+            <!-- <facebookLike /> -->
+                       <twitter>                           
                                <user>ApacheHivemall</user>
                                <showUser>true</showUser>
                                <showFollowers>false</showFollowers>            
                
@@ -83,7 +83,7 @@
                </menu>
                
                <menu name="Documentation">
-                 <item name="User Guide" href="/userguide.html" />
+                 <item name="User Guide" href="/userguide/index.html" />
                  <item name="Overview" href="/overview.html" />
                  <item name="Wiki" 
href="https://cwiki.apache.org/confluence/display/HIVEMALL"; target="_blank" />
           <item name="FAQ" href="/faq.html" />

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/src/site/xdoc/index.xml.vm
----------------------------------------------------------------------
diff --git a/src/site/xdoc/index.xml.vm b/src/site/xdoc/index.xml.vm
index 2638458..2dedfca 100644
--- a/src/site/xdoc/index.xml.vm
+++ b/src/site/xdoc/index.xml.vm
@@ -26,6 +26,9 @@
         <script src="js/misc.js" type="text/javascript"/>
     </head>
     <body>
+        <div class="alert alert-info" role="alert">
+            <strong>Info:</strong> We are now in the process of migrating the 
project repository from <a href="https://github.com/myui/hivemall";>Github</a> 
to <a href="https://github.com/apache/incubator-hivemall";>Apache Incubator</a>.
+        </div>
         <div id="carousel-main" class="row">
             <div id="screenshots-carousel" class="carousel slide span10">
                 <!--  Carousel items  -->
@@ -45,9 +48,7 @@
                     <div class="item">
                         <img alt="" src="/images/hivemall_overview_bg.png" 
height="120px"/>
                         <div class="carousel-caption">
-                            <a 
href="http://www.slideshare.net/myui/introduction-to-hivemall";>
-                            <p>Introduction to Hivemall (slide)</p>
-                            </a>
+                            <p>Introduction to Hivemall <a 
href="http://www.slideshare.net/myui/introduction-to-hivemall";></a></p>
                         </div>
                     </div>
                 </div>

Reply via email to