This is an automated email from the ASF dual-hosted git repository.

fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git


The following commit(s) were added to refs/heads/master by this push:
     new e015e0f  user doc updates to multiple modules
e015e0f is described below

commit e015e0f5257cdf3cd37e601df51af155c23ca5a5
Author: Frank McQuillan <[email protected]>
AuthorDate: Fri May 17 13:38:44 2019 -0700

    user doc updates to multiple modules
---
 doc/mainpage.dox.in                                |  2 +-
 src/ports/postgres/modules/bayes/bayes.sql_in      |  5 ++---
 .../conjugate_gradient/conjugate_gradient.sql_in   |  5 ++---
 src/ports/postgres/modules/convex/mlp.sql_in       | 20 +++++++++++++++++
 .../deep_learning/input_data_preprocessor.sql_in   |  3 +++
 .../deep_learning/keras_model_arch_table.sql_in    | 25 ++++++++++++----------
 .../recursive_partitioning/decision_tree.sql_in    |  3 ++-
 src/ports/postgres/modules/sample/sample.sql_in    |  5 ++---
 src/ports/postgres/modules/svm/svm.sql_in          | 14 ++++++++++++
 9 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in
index d874e5f..b63ee5d 100644
--- a/doc/mainpage.dox.in
+++ b/doc/mainpage.dox.in
@@ -290,7 +290,7 @@ Interface and implementation are subject to change.
     @brief A collection of modules for deep learning.
     @details A collection of modules for deep learning.
     @{
-        @defgroup grp_keras_model_arch Load Model Architecture
+        @defgroup grp_keras_model_arch Load Model
         @defgroup grp_input_preprocessor_dl Preprocessor for Images
     @}
     @defgroup grp_bayes Naive Bayes Classification
diff --git a/src/ports/postgres/modules/bayes/bayes.sql_in 
b/src/ports/postgres/modules/bayes/bayes.sql_in
index 40b71d2..9121cc1 100644
--- a/src/ports/postgres/modules/bayes/bayes.sql_in
+++ b/src/ports/postgres/modules/bayes/bayes.sql_in
@@ -32,9 +32,8 @@ m4_include(`SQLCommon.m4')
 independently contributes to the probability that a data point belongs to a
 category.
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
 
 Naive Bayes refers to a stochastic model where all independent variables
 \f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context)
diff --git 
a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in 
b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
index 2dfafc5..0636314 100644
--- a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
+++ b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in
@@ -22,9 +22,8 @@
 @brief Finds the solution to the function \f$ \boldsymbol Ax = \boldsymbol b 
\f$, where \f$A\f$
 is a symmetric, positive-definite matrix and \f$x\f$ and \f$ \boldsymbol b \f$ 
are vectors.
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
 
 This function uses the iterative conjugate gradient method [1] to find a 
solution to the function: \f[ \boldsymbol Ax = \boldsymbol b \f]
 where \f$ \boldsymbol A \f$ is a symmetric, positive definite matrix and 
\f$x\f$ and \f$ \boldsymbol b \f$ are vectors.
diff --git a/src/ports/postgres/modules/convex/mlp.sql_in 
b/src/ports/postgres/modules/convex/mlp.sql_in
index 0d06c54..d6ce7ce 100644
--- a/src/ports/postgres/modules/convex/mlp.sql_in
+++ b/src/ports/postgres/modules/convex/mlp.sql_in
@@ -182,6 +182,18 @@ mlp_classification(
   <DT>verbose (optional)</DT>
   <DD>BOOLEAN, default: FALSE. Provides verbose output of the results of 
training,
   including the value of loss at each iteration.</DD>
+  @note
+    There are some subtleties on the reported per-iteration loss
+    values because we are working in a distributed system.
+    When mini-batching is used (i.e., batch gradient descent),
+    loss per iteration is an average of losses across all mini-batches
+    and epochs on a segment.  Losses across all segments then get
+    averaged to give overall loss for the model for the iteration.
+    This will tend to be a pessimistic estimate of loss.
+    When mini-batching is not used (i.e., stochastic gradient descent),
+    we use the model state from the previous iteration to compute the loss
+    at the start of the current iteration on the whole data set.  This
+    is an accurate computation of loss for the iteration.
 
   <DT>grouping_col (optional)</DT>
   <DD>TEXT, default: NULL.
@@ -1376,6 +1388,14 @@ For an overview of multilayer perceptrons, see [1].
 
 For details on backpropogation, see [2].
 
+On the effect of database cluster size: as the database cluster
+size increases, the per iteration loss will be higher since the
+model only sees 1/n of the data, where n is the number of segments.
+However, each iteration runs faster than single node because it is only
+traversing 1/n of the data.  For large data sets, all else being equal,
+a bigger cluster will achieve a given accuracy faster than a single node
+although it may take more iterations to achieve that accuracy.
+
 @anchor literature
 @literature
 
diff --git 
a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in 
b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index f2d9591..01936a3 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
@@ -32,6 +32,9 @@ m4_include(`SQLCommon.m4')
 @brief Utilities that prepare input image data for use by deep learning
 modules.
 
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
+
 <div class="toc"><b>Contents</b><ul>
 <li class="level1"><a href="#training_preprocessor_dl">Preprocessor for 
Training Image Data</a></li>
 <li class="level1"><a href="#validation_preprocessor_dl">Preprocessor for 
Validation Image Data</a></li>
diff --git 
a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in 
b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
index 16037c2..45dcaa7 100644
--- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
+++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in
@@ -33,9 +33,12 @@ m4_include(`SQLCommon.m4')
 @brief Utility function to load model architectures and weights into a table 
for
 use by deep learning algorithms.
 
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
+
 <div class="toc"><b>Contents</b><ul>
-<li class="level1"><a href="#load_keras_model">Load Model Architecture</a></li>
-<li class="level1"><a href="#delete_keras_model">Delete Model 
Architecture</a></li>
+<li class="level1"><a href="#load_keras_model">Load Model</a></li>
+<li class="level1"><a href="#delete_keras_model">Delete Model</a></li>
 <li class="level1"><a href="#example">Examples</a></li>
 </ul></div>
 
@@ -45,13 +48,13 @@ Model architecture is in JSON form
 and model weights are in the form of double precision arrays.
 If the output table already exists, a new row is inserted
 into the table so it can act as a repository for multiple model
-architectures.
+architectures and weights.
 
-There is also a utility function to delete a model architecture
-from the model architecture table.
+There is also a utility function to delete a model
+from the table.
 
 @anchor load_keras_model
-@par Load Model Architecture
+@par Load Model
 
 <pre class="syntax">
 load_keras_model(
@@ -62,7 +65,7 @@ load_keras_model(
 \b Arguments
 <dl class="arglist">
   <dt>keras_model_arch_table</dt>
-  <dd>VARCHAR. Output table to load keras model architecture.
+  <dd>VARCHAR. Output table to load keras model architecture and weights.
   </dd>
 
   <dt>model_arch</dt>
@@ -98,7 +101,7 @@ load_keras_model(
 </br>
 
 @anchor delete_keras_model
-@par Delete Model Architecture
+@par Delete Model
 
 <pre class="syntax">
 delete_keras_model(
@@ -109,11 +112,11 @@ delete_keras_model(
 \b Arguments
 <dl class="arglist">
   <dt>keras_model_arch_table</dt>
-  <dd>VARCHAR. Table containing model architectures.
+  <dd>VARCHAR. Table containing model architectures and weights.
   </dd>
 
   <dt>model_id</dt>
-  <dd>INTEGER. The id of the model architecture to be deleted.
+  <dd>INTEGER. The id of the model to be deleted.
   </dd>
 </dl>
 
@@ -148,7 +151,7 @@ null, "dtype": "float32", "activation": "linear", 
"trainable": true,
 "units": 10, "use_bias": true, "activity_regularizer": null}}],
 "backend": "tensorflow"}'
 </pre>
--#  Load the model into the model architecture table:
+-#  Load the model into the model table:
 <pre class="example">
 DROP TABLE IF EXISTS model_arch_library;
 SELECT madlib.load_keras_model('model_arch_library',   -- Output table
diff --git 
a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in 
b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
index bf1c883..2408770 100644
--- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
+++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in
@@ -164,7 +164,8 @@ tree_train(
 
   <DT>num_splits (optional)</DT>
   <DD>INTEGER, default: 20. Continuous-valued features are binned into
-      discrete quantiles to compute split boundaries. This global parameter
+      discrete quantiles to compute split boundaries. Uniform binning
+      is used.  This global parameter
       is used to compute the resolution of splits for continuous features.
       Higher number of bins will lead to better prediction,
       but will also result in longer processing time and higher memory 
usage.</DD>
diff --git a/src/ports/postgres/modules/sample/sample.sql_in 
b/src/ports/postgres/modules/sample/sample.sql_in
index 227d7ac..8f8a56f 100644
--- a/src/ports/postgres/modules/sample/sample.sql_in
+++ b/src/ports/postgres/modules/sample/sample.sql_in
@@ -23,9 +23,8 @@ m4_include(`SQLCommon.m4')
 
 @brief Provides utility functions for sampling operations.
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
 
 The random sampling module consists of useful utility functions for sampling
 operations. These functions can be used while implementing
diff --git a/src/ports/postgres/modules/svm/svm.sql_in 
b/src/ports/postgres/modules/svm/svm.sql_in
index a55fd5f..2320179 100644
--- a/src/ports/postgres/modules/svm/svm.sql_in
+++ b/src/ports/postgres/modules/svm/svm.sql_in
@@ -322,6 +322,20 @@ is the intercept.
 <DD>Default: 2*num_features. The dimensionality of the transformed feature 
space.
 A larger value lowers the variance of the estimate of the kernel but requires
 more memory and takes longer to train.</DD>
+@note
+Setting the \e n_components kernel parameter properly is important
+to generate an accurate decision boundary.  This parameter
+is the dimensionality of the transformed feature space that arises
+from using the primal formulation.  We use primal in MADlib
+because we are implementing in a distributed system,
+compared to an R or other single node implementations
+that can use the dual formulation.  The primal approach
+implements an approximation of the kernel function using random
+feature maps, so in the case of a gaussian kernel, the
+dimensionality of the transformed feature space is not
+infinite (as in dual), but rather of size \e n_components.
+Try increasing \e n_components higher than the default if you are
+not getting an accurate decision boundary.
 <DT>random_state</DT>
 <DD>Default: 1. Seed used by a random number generator. </DD>
 </DL>

Reply via email to