Repository: incubator-hivemall Updated Branches: refs/heads/master [created] 72d6a629f
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/hadoop_tuning.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/hadoop_tuning.md b/docs/gitbook/tips/hadoop_tuning.md new file mode 100644 index 0000000..a6c1854 --- /dev/null +++ b/docs/gitbook/tips/hadoop_tuning.md @@ -0,0 +1,79 @@ +# Prerequisites + +Please refer the following guides for Hadoop tuning: + +* http://hadoopbook.com/ +* http://www.slideshare.net/cloudera/mr-perf + +--- +# Mapper-side configuration +_Mapper configuration is important for hivemall when training runs on mappers (e.g., when using rand_amplify())._ + +``` +mapreduce.map.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN) +mapred.map.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1) + +mapreduce.task.io.sort.mb=1024 (YARN) +io.sort.mb=1024 (MR v1) +``` + +Hivemall can use at max 1024MB in the above case. +> mapreduce.map.java.opts - mapreduce.task.io.sort.mb = 2048MB - 1024MB = 1024MB + +Moreover, other Hadoop components consumes memory spaces. It would be about 1024MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a mapper. + +So, make `mapreduce.map.java.opts - mapreduce.task.io.sort.mb` as large as possible. + +# Reducer-side configuration +_Reducer configuration is important for hivemall when training runs on reducers (e.g., when using amplify())._ + +``` +mapreduce.reduce.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN) +mapred.reduce.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1) + +mapreduce.reduce.shuffle.input.buffer.percent=0.6 (YARN) +mapred.reduce.shuffle.input.buffer.percent=0.6 (MR v1) + +-- mapreduce.reduce.input.buffer.percent=0.2 (YARN) +-- mapred.job.reduce.input.buffer.percent=0.2 (MR v1) +``` + +Hivemall can use at max 820MB in the above case. +> mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent) = 2048 * (1 - 0.6) â 820 MB + +Moreover, other Hadoop components consumes memory spaces. It would be about 820MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a reducer. + +So, make `mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent)` as large as possible. + +--- +# Formula to estimate consumed memory in Hivemall + +For a dense model, the consumed memory in Hivemall is as follows: +``` +feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance is calculated) * 1.2 (heuristics) +``` +> 2^24 * 4 bytes * 2 * 1.2 â 161MB + +When [SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows: +``` +feature_dimensions (assume here 2^25) * 2 bytes (short) * 2 (iff covariance is calculated) * 1.2 (heuristics) +``` +> 2^25 * 2 bytes * 2 * 1.2 â 161MB + +Note: Hivemall uses a [sparse representation](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SparseModel.java) of prediction model (using a hash table) by the default. Use "[-densemodel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/LearnerBaseUDTF.java#L87)" option to use a dense model. + +# Execution Engine of Hive + +We recommend to use Apache Tez for execute engine of Hive for Hivemall queries. + +```sql +set mapreduce.framework.name=yarn-tez; +set hive.execution.engine=tez; +``` + +You can use the plain old MapReduce by setting following setting: + +```sql +set mapreduce.framework.name=yarn; +set hive.execution.engine=mr; +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/mixserver.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/mixserver.md b/docs/gitbook/tips/mixserver.md new file mode 100644 index 0000000..631557c --- /dev/null +++ b/docs/gitbook/tips/mixserver.md @@ -0,0 +1,68 @@ +In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers. + +<!-- +You can find a brief explanation of the internal design of MIX protocol in [this slide](http://www.slideshare.net/myui/hivemall-mix-internal). +--> + +Prerequisite +============ + +* Hivemall v0.3 or later + +We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though. + +Running Mix Server +=================== + +First, put the following files on server(s) that are accessible from Hadoop worker nodes: +* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases) +* [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh) + +_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._ + +```sh +# run a Mix Server +./run_mixserv.sh +``` + +We assume in this example that Mix servers are running on host01, host03 and host03. +The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh. + +See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options. + +We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is *horizontally scalable* by adding MIX server nodes. + +Using Mix Protocol through Hivemall +=================================== + +[Install Hivemall](https://github.com/myui/hivemall/wiki/Installation) on Hive. + +_Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._ + +Now, we explain that how to use mixing in [an example using KDD2010a dataset](https://github.com/myui/hivemall/wiki/KDD2010a-classification). + +Enabling the mixing on Hivemall is simple as follows: +```sql +use kdd2010; + +create table kdd10a_pa1_model1 as +select + feature, + cast(voted_avg(weight) as float) as weight +from + (select + train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) + from + kdd10a_train_x3 + ) t +group by feature; +``` + +All you have to do is just adding "*-mix*" training option as seen in the above query. + +The effect of model mixing +=========================== + +In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix). + +The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rand_amplify.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/rand_amplify.md b/docs/gitbook/tips/rand_amplify.md new file mode 100644 index 0000000..4df124e --- /dev/null +++ b/docs/gitbook/tips/rand_amplify.md @@ -0,0 +1,103 @@ +This article explains *amplify* technique that is useful for improving prediction score. + +Iterations are mandatory in machine learning (e.g., in [stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good prediction models. However, MapReduce is known to be not suited for iterative algorithms because IN/OUT of each MapReduce job is through HDFS. + +In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset) as an example. + +**WARNING**: rand_amplify() is supported in v0.2-beta1 and later. + +--- +# Amplify training examples in Map phase and shuffle them in Reduce phase +Hivemall provides the **amplify** UDTF to enumerate iteration effects in machine learning without several MapReduce steps. + +The amplify function returns multiple rows for each row. +The first argument ${xtimes} is the multiplication factor. +In the following examples, the multiplication factor is set to 3. + +```sql +set hivevar:xtimes=3; + +create or replace view training_x3 +as +select + * +from ( +select + amplify(${xtimes}, *) as (rowid, label, features) +from + training_orcfile +) t +CLUSTER BY rand(); +``` + +In the above example, the [CLUSTER BY](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-SyntaxofClusterByandDistributeBy) clause distributes Map outputs to reducers using a random key for the distribution key. And then, the input records of the reducer is randomly shuffled. + +The multiplication of records and the random shuffling has a similar effect to iterations. +So, we recommend users to use an amplified view for training as follows: + +```sql +create table lr_model_x3 +as +select + feature, + cast(avg(weight) as float) as weight +from + (select + logress(features,label) as (feature,weight) + from + training_x3 + ) t +group by feature; +``` + +The above query is executed by 2 MapReduce jobs as shown below: + +[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/amplify_plan.txt) is the actual plan generated by the Hive. + +Using *trainning_x3* instead of the plain training table results in higher and better AUC (0.746214) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. + +A problem in amplify() is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck. +When the training table is so large that involves 100 Map tasks, the merge operator needs to merge at least 100 files by (external) merge sort! + +Note that the actual bottleneck is not M/R iterations but shuffling training instance. Iteration without shuffling (as in [the Spark example](http://spark.incubator.apache.org/examples.html)) causes very slow convergence and results in requiring more iterations. Shuffling cannot be avoided even in iterative MapReduce variants. + + + +--- +# Amplify and shuffle training examples in each Map task + +To deal with large training data, Hivemall provides **rand_amplify** UDTF that randomly shuffles input rows in a Map task. +The rand_amplify UDTF outputs rows in a random order when the local buffer specified by ${shufflebuffersize} is filled. + +With rand_amplify(), the view definition of training_x3 becomes as follows: +```sql +set hivevar:shufflebuffersize=1000; + +create or replace view training_x3 +as +select + rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features) +from + training_orcfile; +``` + +The training query is executed as follows: + +[Here](https://dl.dropboxusercontent.com/u/13123103/hivemall/randamplify_plan.txt) is the actual query plan. + +The map-local multiplication and shuffling has no bottleneck in the merge phase and the query is efficiently executed within a single MapReduce job. + + + +Using *rand_amplify* results in a better AUC (0.743392) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. + +--- +# Conclusion + +We recommend users to use *amplify()* for small training inputs and to use *rand_amplify()* for large training inputs to get a better accuracy in a reasonable training time. + +| Method | ELAPSED TIME (sec) | AUC | +|:-----------|--------------------|----:| +| Plain | 89.718 | 0.734805 | +| amplifier+clustered by | 479.855 | 0.746214 | +| rand_amplifier | 116.424 | 0.743392 | \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rowid.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/rowid.md b/docs/gitbook/tips/rowid.md new file mode 100644 index 0000000..c43aa74 --- /dev/null +++ b/docs/gitbook/tips/rowid.md @@ -0,0 +1,31 @@ +```sql +CREATE TABLE xxx +AS +SELECT + regexp_replace(reflect('java.util.UUID','randomUUID'), '-', '') as rowid, + * +FROM + ..; +``` + +Another option to generate rowid is to use row_number(). +However, the query execution would become too slow for large dataset because the rowid generation is executed on a single reducer. +```sql +CREATE TABLE xxx +AS +select + row_number() over () as rowid, + * +from a9atest; +``` + +*** +# Rowid generator provided in Hivemall v0.2 or later +You can use [rowid() function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java) to generate an unique rowid in Hivemall v0.2 or later. +```sql +select + rowid() as rowid, -- returns ${task_id}-${sequence_number} + * +from + xxx +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/tips/rt_prediction.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/tips/rt_prediction.md b/docs/gitbook/tips/rt_prediction.md new file mode 100644 index 0000000..3ac4fb6 --- /dev/null +++ b/docs/gitbook/tips/rt_prediction.md @@ -0,0 +1,234 @@ +Hivemall provides a batch learning scheme that builds prediction models on Apache Hive. +The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS. + +In this article, we explain how to run a real-time prediction using a relational DBMS. +We assume that you have already run the [a9a binary classification task](https://github.com/myui/hivemall/wiki#a9a-binary-classification). + +# Prerequisites + +- MySQL + +Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib. + +- [Sqoop](http://sqoop.apache.org/) + +Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. +To do that you need to edit build.xml and ivy.xml as shown in [this patch](https://gist.github.com/myui/e8db4a31b574103133c6). + +# Preparing Model Tables on MySQL + +```sql +create database a9a; +use a9a; + +create user sqoop identified by 'sqoop'; +grant all privileges on a9a.* to 'sqoop'@'%' identified by 'sqoop'; +flush privileges; + +create table a9a_model1 ( + feature int, + weight double +); +``` + +Do not forget to edit bind_address in the MySQL configuration file (/etc/mysql/my.conf) accessible from master and slave nodes of Hadoop. + +# Exporting Hive table to MySQL + +Check the connectivity to MySQL server using Sqoop. + +```sh +export MYSQL_HOST=dm01 + +export HADOOP_HOME=/opt/hadoop +export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/ +export HADOOP_COMMON_HOME=${HADOOP_HOME} + +bin/sqoop list-tables --connect jdbc:mysql://${MYSQL_HOST}/a9a --username sqoop --password sqoop +``` + +Create TSV table because Sqoop cannot directory read Hive tables. + +```sql +create table a9a_model1_tsv + ROW FORMAT DELIMITED + FIELDS TERMINATED BY "\t" + LINES TERMINATED BY "\n" + STORED AS TEXTFILE +AS +select * from a9a_model1; +``` + +Check the location of 'a9a_model1_tsv' as follows: + +```sql +desc extended a9a_model1_tsv; +> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9a_model1_tsv +``` + +```sh +bin/sqoop export \ +--connect jdbc:mysql://${MYSQL_HOST}/a9a \ +--username sqoop --password sqoop \ +--table a9a_model1 \ +--export-dir /user/hive/warehouse/a9a.db/a9a_model1_tsv \ +--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \ +--batch +``` + +When the exporting successfully finishes, you can find entries in the model table in MySQL. + +```sql +mysql> select * from a9a_model1 limit 3; ++---------+---------------------+ +| feature | weight | ++---------+---------------------+ +| 0 | -0.5761121511459351 | +| 1 | -1.5259535312652588 | +| 10 | 0.21053194999694824 | ++---------+---------------------+ +3 rows in set (0.00 sec) +``` + +We recommend to create an index of model tables to boost lookups in online prediction. + +```sql +CREATE UNIQUE INDEX a9a_model1_feature_index on a9a_model1 (feature); +-- USING BTREE; +``` + +# Exporting test data from Hadoop to MySQL (optional step) + +Prepare a testing data table in Hive which is being exported. + +```sql +create table a9atest_exploded_tsv + ROW FORMAT DELIMITED + FIELDS TERMINATED BY "\t" + LINES TERMINATED BY "\n" + STORED AS TEXTFILE +AS +select + rowid, + -- label, + extract_feature(feature) as feature, + extract_weight(feature) as value +from + a9atest LATERAL VIEW explode(addBias(features)) t AS feature; + +desc extended a9atest_exploded_tsv; +> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv, +``` + +Prepare a test table, importing data from Hadoop. + +```sql +use a9a; + +create table a9atest_exploded ( + rowid bigint, + feature int, + value double +); +``` + +Then, run Sqoop to export data from HDFS to MySQL. + +```sh +export MYSQL_HOST=dm01 + +bin/sqoop export \ +--connect jdbc:mysql://${MYSQL_HOST}/a9a \ +--username sqoop --password sqoop \ +--table a9atest_exploded \ +--export-dir /user/hive/warehouse/a9a.db/a9atest_exploded_tsv \ +--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \ +--batch +``` + +Better to add an index to the rowid column to boost selection by rowids. +```sql +CREATE INDEX a9atest_exploded_rowid_index on a9atest_exploded (rowid) USING BTREE; +``` + +When the exporting successfully finishes, you can find entries in the test table in MySQL. + +```sql +mysql> select * from a9atest_exploded limit 10; ++-------+---------+-------+ +| rowid | feature | value | ++-------+---------+-------+ +| 12427 | 67 | 1 | +| 12427 | 73 | 1 | +| 12427 | 74 | 1 | +| 12427 | 76 | 1 | +| 12427 | 82 | 1 | +| 12427 | 83 | 1 | +| 12427 | 0 | 1 | +| 12428 | 5 | 1 | +| 12428 | 7 | 1 | +| 12428 | 16 | 1 | ++-------+---------+-------+ +10 rows in set (0.00 sec) +``` + +# Online/realtime prediction on MySQL + +Define sigmoid function used for a prediction of logistic regression as follows: + +```sql +DROP FUNCTION IF EXISTS sigmoid; +DELIMITER $$ +CREATE FUNCTION sigmoid(x DOUBLE) + RETURNS DOUBLE + LANGUAGE SQL +BEGIN + RETURN 1.0 / (1.0 + EXP(-x)); +END; +$$ +DELIMITER ; +``` + +We assume here that doing prediction for a 'features' having (0,1,10) and each of them is a categorical feature (i.e., the weight is 1.0). Then, you can get the probability by logistic regression simply as follows: + +```sql +select + sigmoid(sum(m.weight)) as prob +from + a9a_model1 m +where + m.feature in (0,1,10); +``` + +``` ++--------------------+ +| prob | ++--------------------+ +| 0.1310696931351625 | ++--------------------+ +1 row in set (0.00 sec) +``` + +Similar to [the way in Hive](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)#prediction), you can run prediction as follows: + +```sql +select + sigmoid(sum(t.value * m.weight)) as prob, + if(sigmoid(sum(t.value * m.weight)) > 0.5, 1.0, 0.0) as predicted +from + a9atest_exploded t LEFT OUTER JOIN + a9a_model1 m ON (t.feature = m.feature) +where + t.rowid = 12427; -- prediction on a particular id +``` + +Alternatively, you can use SQL views for testing target 't' in the above query. + +``` ++---------------------+-----------+ +| prob | predicted | ++---------------------+-----------+ +| 0.05595205126313402 | 0.0 | ++---------------------+-----------+ +1 row in set (0.00 sec) +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/README.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/README.md b/docs/gitbook/troubleshooting/README.md new file mode 100644 index 0000000..e69de29 http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/asterisk.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/asterisk.md b/docs/gitbook/troubleshooting/asterisk.md new file mode 100644 index 0000000..49e2f71 --- /dev/null +++ b/docs/gitbook/troubleshooting/asterisk.md @@ -0,0 +1,3 @@ +See [HIVE-4181](https://issues.apache.org/jira/browse/HIVE-4181) that asterisk argument without table alias for UDTF is not working. It has been fixed as part of Hive v0.12 release. + +A possible workaround is to use asterisk with a table alias, or to specify names of arguments explicitly. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/mapjoin_classcastex.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/mapjoin_classcastex.md b/docs/gitbook/troubleshooting/mapjoin_classcastex.md new file mode 100644 index 0000000..c48919a --- /dev/null +++ b/docs/gitbook/troubleshooting/mapjoin_classcastex.md @@ -0,0 +1,8 @@ +Map-side join on Tez causes [ClassCastException](http://markmail.org/message/7cwbgupnhah6ggkv) when a serialized table contains array column(s). + +[Workaround] Try setting _hive.mapjoin.optimized.hashtable_ off as follows: +```sql +set hive.mapjoin.optimized.hashtable=false; +``` + +Caution: Fixed in Hive 1.3.0. Refer [HIVE_11051](https://issues.apache.org/jira/browse/HIVE-11051) for the detail. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/mapjoin_task_error.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/mapjoin_task_error.md b/docs/gitbook/troubleshooting/mapjoin_task_error.md new file mode 100644 index 0000000..02aff2f --- /dev/null +++ b/docs/gitbook/troubleshooting/mapjoin_task_error.md @@ -0,0 +1,8 @@ +From Hive 0.11.0, **hive.auto.convert.join** is [enabled by the default](https://issues.apache.org/jira/browse/HIVE-3297). + +When using complex queries using views, the auto conversion sometimes throws SemanticException, cannot serialize object. + +Workaround for the exception is to disable **hive.auto.convert.join** before the execution as follows. +``` +set hive.auto.convert.join=false; +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/num_mappers.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/num_mappers.md b/docs/gitbook/troubleshooting/num_mappers.md new file mode 100644 index 0000000..be01f2a --- /dev/null +++ b/docs/gitbook/troubleshooting/num_mappers.md @@ -0,0 +1,20 @@ +The default _hive.input.format_ is set to _org.apache.hadoop.hive.ql.io.CombineHiveInputFormat_. +This configuration could give less number of mappers than the split size (i.e., # blocks in HDFS) of the input table. + +Try setting _org.apache.hadoop.hive.ql.io.HiveInputFormat_ for _hive.input.format_. +``` +set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; +``` + +Note Apache Tez uses _org.apache.hadoop.hive.ql.io.HiveInputFormat_ by the default. +``` +set hive.tez.input.format; +``` +> hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat + +*** + +You can then control the maximum number of mappers via setting: +``` +set mapreduce.job.maps=128; +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/docs/gitbook/troubleshooting/oom.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/troubleshooting/oom.md b/docs/gitbook/troubleshooting/oom.md new file mode 100644 index 0000000..643d09a --- /dev/null +++ b/docs/gitbook/troubleshooting/oom.md @@ -0,0 +1,20 @@ +# OOM in mappers + +In a certain setting, the default input split size is too large for Hivemall. Due to that, OutOfMemoryError cloud happen on mappers in the middle of training. + +Then, revise your a Hadoop setting (**mapred.child.java.opts**/**mapred.map.child.java.opts**) first to use a larger value as possible. + +If an OOM error still caused after that, set smaller **mapred.max.split.size** value before training. +``` +SET mapred.max.split.size=67108864; +``` +Then, the number of training examples used for each trainer is reduced (as the number of mappers increases) and the trained model would fit in the memory. + +# OOM in shuffle/merge + +If OOM caused during the merge step, try setting a larger **mapred.reduce.tasks** value before training and revise [shuffle/reduce parameters](http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Shuffle%2FReduce+Parameters). +``` +SET mapred.reduce.tasks=64; +``` + +If your OOM happened by using amplify(), try using rand_amplify() instead. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/src/site/site.xml ---------------------------------------------------------------------- diff --git a/src/site/site.xml b/src/site/site.xml index fdb8fac..8a20d84 100644 --- a/src/site/site.xml +++ b/src/site/site.xml @@ -48,8 +48,8 @@ <ribbonOrientation>right</ribbonOrientation> <ribbonColor>red</ribbonColor> </gitHub> - <facebookLike /> - <twitter> + <!-- <facebookLike /> --> + <twitter> <user>ApacheHivemall</user> <showUser>true</showUser> <showFollowers>false</showFollowers> @@ -83,7 +83,7 @@ </menu> <menu name="Documentation"> - <item name="User Guide" href="/userguide.html" /> + <item name="User Guide" href="/userguide/index.html" /> <item name="Overview" href="/overview.html" /> <item name="Wiki" href="https://cwiki.apache.org/confluence/display/HIVEMALL" target="_blank" /> <item name="FAQ" href="/faq.html" /> http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/370e2aa3/src/site/xdoc/index.xml.vm ---------------------------------------------------------------------- diff --git a/src/site/xdoc/index.xml.vm b/src/site/xdoc/index.xml.vm index 2638458..2dedfca 100644 --- a/src/site/xdoc/index.xml.vm +++ b/src/site/xdoc/index.xml.vm @@ -26,6 +26,9 @@ <script src="js/misc.js" type="text/javascript"/> </head> <body> + <div class="alert alert-info" role="alert"> + <strong>Info:</strong> We are now in the process of migrating the project repository from <a href="https://github.com/myui/hivemall">Github</a> to <a href="https://github.com/apache/incubator-hivemall">Apache Incubator</a>. + </div> <div id="carousel-main" class="row"> <div id="screenshots-carousel" class="carousel slide span10"> <!-- Carousel items --> @@ -45,9 +48,7 @@ <div class="item"> <img alt="" src="/images/hivemall_overview_bg.png" height="120px"/> <div class="carousel-caption"> - <a href="http://www.slideshare.net/myui/introduction-to-hivemall"> - <p>Introduction to Hivemall (slide)</p> - </a> + <p>Introduction to Hivemall <a href="http://www.slideshare.net/myui/introduction-to-hivemall"></a></p> </div> </div> </div>
