Repository: incubator-madlib
Updated Branches:
  refs/heads/master 3eec0a82e -> 206e1269e


Doc: Update documentation

Minor corrections and changes in elastic net, decision tree, random
forest, pivot.

Closes #118


Project: http://git-wip-us.apache.org/repos/asf/incubator-madlib/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-madlib/commit/206e1269
Tree: http://git-wip-us.apache.org/repos/asf/incubator-madlib/tree/206e1269
Diff: http://git-wip-us.apache.org/repos/asf/incubator-madlib/diff/206e1269

Branch: refs/heads/master
Commit: 206e1269edfef4589639021d27fa5072b9297339
Parents: 3eec0a8
Author: Frank McQuillan <fmcquil...@pivotal.io>
Authored: Tue Apr 18 13:07:05 2017 -0700
Committer: Rahul Iyer <ri...@apache.org>
Committed: Tue Apr 18 17:28:05 2017 -0700

----------------------------------------------------------------------
 .../modules/elastic_net/elastic_net.sql_in      |  3 +-
 .../recursive_partitioning/decision_tree.sql_in | 17 ++++---
 .../recursive_partitioning/random_forest.sql_in | 49 +++++++++++++-------
 .../postgres/modules/utilities/pivot.sql_in     |  4 +-
 4 files changed, 48 insertions(+), 25 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/206e1269/src/ports/postgres/modules/elastic_net/elastic_net.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in 
b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in
index 2949fc5..f3a8980 100644
--- a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in
+++ b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in
@@ -735,7 +735,8 @@ The two queries above will result in same residuals:
 -# Reuse the houses table above.
 Here we use 3-fold cross validation with 3 automatically generated 
 lambda values and 3 specified alpha values. (This can take some time to 
-run since elastic net is effectively being called 27 times.)
+run since elastic net is effectively being called 27 times for 
+these combinations, then a 28th time for the whole dataset.)
 <pre class="example">
 DROP TABLE IF EXISTS houses_en3, houses_en3_summary, houses_en3_cv;
 SELECT madlib.elastic_net_train( 'houses',                  -- Source table

http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/206e1269/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
----------------------------------------------------------------------
diff --git 
a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in 
b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
index ef671fc..7251a9c 100644
--- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
@@ -259,7 +259,9 @@ tree_train(
 
   <DT>max_depth (optional)</DT>
   <DD>INTEGER, default: 7. Maximum depth of any node of the final tree,
-      with the root node counted as depth 0.</DD>
+      with the root node counted as depth 0. A deeper tree can
+      lead to better prediction but will also result in
+      longer processing time and higher memory usage.</DD>
 
   <DT>min_split (optional)</DT>
   <DD>INTEGER, default: 20. Minimum number of observations that must exist
@@ -276,7 +278,7 @@ tree_train(
       discrete quantiles to compute split boundaries. This global parameter
       is used to compute the resolution of splits for continuous features.
       Higher number of bins will lead to better prediction,
-      but will also result in longer processing.</DD>
+      but will also result in longer processing time and higher memory 
usage.</DD>
 
   <DT>pruning_params (optional)</DT>
   <DD>TEXT. Comma-separated string of key-value pairs giving
@@ -351,9 +353,10 @@ provided <em>cp</em> and explore all possible sub-trees 
(up to a single-node tre
 to compute the optimal sub-tree. The optimal sub-tree and the 'cp' 
corresponding
 to this optimal sub-tree is placed in the <em>output_table</em>, with the
 columns named as <em>tree</em> and <em>pruning_cp</em> respectively.
-- The main parameters that affect memory usage are:  depth of tree, number
-of features, number of values per categorical feature, and number of bins for
-continuous features.  If you are hitting VMEM limits, consider reducing one or
+- The main parameters that affect memory usage are: depth of
+tree (‘max_depth’), number of features, number of values per
+categorical feature, and number of bins for continuous features 
(‘num_splits’).
+If you are hitting memory limits, consider reducing one or
 more of these parameters.
 
 @anchor predict
@@ -922,7 +925,9 @@ File decision_tree.sql_in documenting the training function
   *        each observation.
   * @param max_depth OPTIONAL (Default = 7). Set the maximum depth
   *        of any node of the final tree, with the root node counted
-  *        as depth 0.
+  *        as depth 0. A deeper tree can lead to better prediction
+  *        but will also result in longer processing time and higher
+  *        memory usage.
   * @param min_split OPTIONAL (Default = 20). Minimum number of
   *        observations that must exist in a node for a split to
   *        be attempted.

http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/206e1269/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
----------------------------------------------------------------------
diff --git 
a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in 
b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
index 3d4da87..f263cf9 100644
--- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in
@@ -34,6 +34,9 @@ constructed using bootstrapped samples from the input data. 
The results of these
 models are then combined to yield a single prediction, which, at the
 expense of some loss in interpretation, have been found to be highly accurate.
 
+Please also refer to the decision tree user documentation for 
+information relevant to the implementation of random forests in MADlib.
+
 @anchor train
 @par Training Function
 Random Forest training function has the following format:
@@ -276,8 +279,16 @@ forest_train(training_table_name,
   <DT>list_of_features</DT>
   <DD>text. Comma-separated string of column names to use as predictors. Can
   also be a '*' implying all columns are to be used as predictors (except the
-  ones included in the next argument). Boolean, integer and text columns are
-  considered categorical columns.</DD>
+  ones included in the next argument). The types of the features can be mixed
+  where boolean, integer, and text columns are considered categorical and
+  double precision columns are considered continuous. The categorical variables
+  are not encoded and used as is for the training.
+
+  It is important to note that we don't test for every combination of
+  levels of a categorical variable when evaluating a split. We order the levels
+  of the non-integer categorical variable by the entropy of the variable in
+  predicting the response. The split at each node is evaluated between these
+  ordered levels. Integer categorical variables are ordered by their 
value.</DD>
 
   <DT>list_of_features_to_exclude</DT>
   <DD>text. Comma-separated string of column names to exclude from the 
predictors
@@ -317,9 +328,11 @@ forest_train(training_table_name,
       the default value of 1 is sufficient to compute the importance.
   </DD>
 
-  <DT>max_depth (optional)</DT>
-  <DD>integer, default: 10. Maximum depth of any node of a tree,
-      with the root node counted as depth 0.</DD>
+  <DT>max_tree_depth (optional)</DT>
+  <DD>integer, default: 7. Maximum depth of any node of a tree,
+      with the root node counted as depth 0. A deeper tree can
+      lead to better prediction but will also result in 
+      longer processing time and higher memory usage.</DD>
 
   <DT>min_split (optional)</DT>
   <DD>integer, default: 20. Minimum number of observations that must exist
@@ -331,11 +344,11 @@ forest_train(training_table_name,
       set to min_bucket*3 or min_bucket to min_split/3, as appropriate.</DD>
 
   <DT>num_splits (optional)</DT>
-  <DD>integer, default: 100. Continuous-valued features are binned into
+  <DD>integer, default: 20. Continuous-valued features are binned into
       discrete quantiles to compute split boundaries. This global parameter
       is used to compute the resolution of splits for continuous features.
       Higher number of bins will lead to better prediction,
-      but will also result in higher processing time.</DD>
+      but will also result in longer processing time and higher memory 
usage.</DD>
 
   <DT>surrogate_params (optional)</DT>
   <DD>text, Comma-separated string of key-value pairs controlling the behavior
@@ -358,10 +371,11 @@ forest_train(training_table_name,
     is close to 0 may result in trees with only the root node.
     This allows users to experiment with the function in a speedy fashion.</DD>
 </DL>
-    @note The main parameters that affect memory usage are:  depth of tree, 
number
-    of features, and number of values per feature (controlled by num_splits).  
-    If you are hitting VMEM limits,
-    consider reducing one or more of these parameters.
+    @note The main parameters that affect memory usage are: depth of 
+    tree (‘max_tree_depth’), number of features, number of values per 
+    categorical feature, and number of bins for continuous features 
(‘num_splits’). 
+    If you are hitting memory limits, consider reducing one or 
+    more of these parameters.
 
 @anchor predict
 @par Prediction Function
@@ -858,7 +872,7 @@ File random_forest.sql_in documenting the training function
   * @param num_random_features OPTIONAL (Default = sqrt(n) for classification,
   *        n/3 for regression) Number of features to randomly select at
   *        each split.
-  * @param max_tree_depth OPTIONAL (Default = 10). Set the maximum depth
+  * @param max_tree_depth OPTIONAL (Default = 7). Set the maximum depth
   *        of any node of the final tree, with the root node counted
   *        as depth 0.
   * @param min_split OPTIONAL (Default = 20). Minimum number of
@@ -869,12 +883,13 @@ File random_forest.sql_in documenting the training 
function
   *        one of minbucket or minsplit is specified, minsplit
   *        is set to minbucket*3 or minbucket to minsplit/3, as
   *        appropriate.
-  * @param num_splits optional (default = 100) number of bins to use
-  *        during binning. continuous-valued features are binned
+  * @param num_splits optional (default = 20) number of bins to use
+  *        during binning. Continuous-valued features are binned
   *        into discrete bins (per the quartile values) to compute
-  *        split bound- aries. this global parameter is used to
-  *        compute the resolution of the bins. higher number of
-  *        bins will lead to higher processing time.
+  *        split boundaries. This global parameter is used to
+  *        compute the resolution of the bins. Higher number of
+  *        bins will lead to higher processing time and more
+  *        memory usage.
   * @param verbose optional (default = false) prints status
   *        information on the splits performed and any other
   *        information useful for debugging.

http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/206e1269/src/ports/postgres/modules/utilities/pivot.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/utilities/pivot.sql_in 
b/src/ports/postgres/modules/utilities/pivot.sql_in
index 7cdfbe0..4d239de 100644
--- a/src/ports/postgres/modules/utilities/pivot.sql_in
+++ b/src/ports/postgres/modules/utilities/pivot.sql_in
@@ -142,6 +142,8 @@ pivot(
     If the total number of output columns exceeds this limit, then make this
     parameter either 'array' (to combine the output columns into an array) or
     'svec' (to cast the array output to <em>'madlib.svec'</em> type).
+    If you have an 'aggregate_func' that has an array return type, 
+    it cannot be combined with 'output_type'='array' or 'svec'.
 
     A dictionary will be created (<em>output_col_dictionary=TRUE</em>)
     when 'output_type' is 'array' or 'svec' to define each index into the 
array.
@@ -364,7 +366,7 @@ val_avg_piv_30_piv2_300 |
 
 -# Use multiple pivot columns (same as above) with an array output:
 <pre class="example">
-DROP TABLE IF EXISTS pivout;
+DROP TABLE IF EXISTS pivout, pivout_dictionary;
 SELECT madlib.pivot('pivset_ext', 'pivout', 'id', 'piv, piv2', 'val',
                     NULL, NULL, FALSE, FALSE, 'array');
 \\x off

Reply via email to