This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git
The following commit(s) were added to refs/heads/master by this push: new ec5614f misc user doc updates for 1dot17 ec5614f is described below commit ec5614fe34fc4e410ac226a60985051fc166ee8e Author: Frank McQuillan <fmcquil...@pivotal.io> AuthorDate: Tue Dec 17 12:38:01 2019 -0800 misc user doc updates for 1dot17 --- doc/mainpage.dox.in | 6 +-- .../deep_learning/input_data_preprocessor.sql_in | 4 +- .../deep_learning/keras_model_arch_table.sql_in | 9 ++-- .../modules/deep_learning/madlib_keras.sql_in | 57 +++++++++++++++------- .../madlib_keras_fit_multiple_model.sql_in | 28 ++++++----- src/ports/postgres/modules/knn/knn.sql_in | 4 ++ 6 files changed, 69 insertions(+), 39 deletions(-) diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in index 0e7b426..82be4a5 100644 --- a/doc/mainpage.dox.in +++ b/doc/mainpage.dox.in @@ -292,9 +292,9 @@ Interface and implementation are subject to change. @defgroup grp_gpu_configuration GPU Configuration @defgroup grp_keras Keras @defgroup grp_keras_model_arch Load Models - @defgroup grp_model_selection Model Selection - @brief Train multiple deep learning models at the same time. - @details Train multiple deep learning models at the same time. + @defgroup grp_model_selection Model Selection for DL + @brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. + @details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. @{ @defgroup grp_automl AutoML @defgroup grp_keras_run_model_selection Run Model Selection diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in index ddc356f..f243417 100644 --- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in +++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in @@ -853,7 +853,9 @@ Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto. @anchor related @par Related Topics -minibatch_preprocessing.sql_in +training_preprocessor_dl() + +validation_preprocessor_dl() gpu_configuration() diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index b1bf150..cc915bb 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -275,11 +275,10 @@ SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL; -------+ 1 </pre> -Load weights from Keras using psycopg2. -(Psycopg is a PostgreSQL database adapter for the -Python programming language.) As above we need to -flatten then serialize the weights to store as a -PostgreSQL binary data type. +Load weights from Keras using psycopg2. (Psycopg is a PostgreSQL database adapter for the +Python programming language.) As above we need to flatten then serialize the weights to store as a +PostgreSQL binary data type. Note that the psycopg2.Binary function used below will increase the size of the +Python object for the weights, so if your model is large it might be better to use a PL/Python function as above. <pre class="example"> import psycopg2 import psycopg2 as p2 diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index 6127031..0a395e8 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -737,7 +737,12 @@ madlib_keras_predict_byom( <DT>class_values (optional)</DT> <DD>TEXT[], default: NULL. List of class labels that were used while training the model. See the 'output_table' - column for more details. + column above for more details. + + @note + If you specify the class values parameter, + it must reflect how the dependent variable was 1-hot encoded for training. If you accidently + pick another order that does not match the 1-hot encoding, the predictions would be wrong. </DD> <DT>normalizing_const (optional)</DT> @@ -1166,7 +1171,7 @@ WHERE iris_predict.estimated_class_text != iris_test.class_text; 6 (1 row) </pre> -Percent missclassifications: +Accuracy: <pre class="example"> SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from (select iris_test.class_text as actual, iris_predict.estimated_class_text as estimated @@ -1188,10 +1193,18 @@ syntax. See <a href="group__grp__keras__model__arch.html">load_keras_model</a> for details on how to load the model architecture and weights. In this example we will use weights we already have: <pre class="example"> -UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 1; +UPDATE model_arch_library +SET model_weights = iris_model.model_weights +FROM iris_model +WHERE model_arch_library.model_id = 1; </pre> Now train using a model from the model architecture table directly -without referencing the model table from the MADlib training: +without referencing the model table from the MADlib training. Note that if you +specify the class values parameter as we do below, it must reflect how the dependent +variable was 1-hot encoded for training. In this example the 'training_preprocessor_dl()' +in Step 2 above encoded in the order {'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'} so +this is the order we pass in the parameter. If we accidently pick another order that does +not match the 1-hot encoding, the predictions would be wrong. <pre class="example"> DROP TABLE IF EXISTS iris_predict_byom; SELECT madlib.madlib_keras_predict_byom('model_arch_library', -- model arch table @@ -1254,7 +1267,7 @@ WHERE iris_predict_byom.estimated_dependent_var != iris_test.class_text; 6 (1 row) </pre> -Percent missclassifications: +Accuracy: <pre class="example"> SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from (select iris_test.class_text as actual, iris_predict_byom.estimated_dependent_var as estimated @@ -1495,7 +1508,10 @@ Fetch the weights from a previous MADlib run. (Normally these would be downloaded from a source that trained the same model architecture on a related dataset.) <pre class="example"> -UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 2; +UPDATE model_arch_library +SET model_weights = iris_model.model_weights +FROM iris_model +WHERE model_arch_library.model_id = 2; </pre> Now train the model using the transfer model and the pre-trained weights: <pre class="example"> @@ -1556,23 +1572,26 @@ and versions. 2. Classification is currently supported, not regression. -3. On the effect of database cluster size: as the database cluster -size increases, the per iteration loss will be higher since the -model only sees 1/n of the data, where n is the number of segments. -However, each iteration runs faster than single node because it is only -traversing 1/n of the data. For large data sets, all else being equal, -a bigger cluster will achieve a given accuracy faster than a single node -although it may take more iterations to achieve that accuracy. -However, for highly non-convex solution spaces, convergence behavior -may diminish as cluster size increases. Ensure that each segment has -sufficient volume of data and examples of each class value. - @anchor background @par Technical Background For an introduction to deep learning foundations, including MLP and CNN, refer to [6]. +This module trains a single large model across the database cluster +using the bulk synchronous parallel (BSP) approach, with model averaging [7]. + +On the effect of database cluster size: as the database cluster size increases, the per iteration +loss will be higher since the model only sees 1/n of the data, where n is the number of segments. +However, each iteration runs faster than single node because it is only traversing 1/n of the data. +For highly non-convex solution spaces, convergence behavior may diminish as cluster size increases. +Ensure that each segment has sufficient volume of data and examples of each class value. + +Alternatively, to train multiple models at the same time for model +architecture search or hyperparameter tuning, you can +use <a href="group__grp__keras__run__model__selection.html">Model Selection</a>, +which does not do model averaging and hence may have better covergence efficiency. + @anchor literature @literature @@ -1591,6 +1610,10 @@ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf [6] Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016. +[7] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems," Supun Nakandala, +Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and Engineering, University of California, +San Diego https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf. + @anchor related @par Related Topics diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index 1ddbd18..c0a68b3 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -1314,24 +1314,26 @@ and versions. 2. Classification is currently supported, not regression. -3. On the effect of database cluster size: as the database cluster -size increases, it will be proportionally faster to train a set of -models, as long as you have at least as many model selection tuples -as segments. This is because model state is "hopped" from -segment to segment and training takes place in parallel. See [1,2] -for details on how model hopping works. If you have fewer model selection -tuples to train than segments, then some segments may not be busy 100% -of the time so speedup will not necessarily be linear with database -cluster size. Inference (predict) is an embarrassingly parallel -operation so inference runtimes will be proportionally faster as the number -of segments increases. - @anchor background @par Technical Background For an introduction to deep learning foundations, including MLP and CNN, refer to [7]. +This module trains many models a time across the database cluster in order +to explore network architectures and hyperparameters. It uses model hopper +parallelism (MOP) and has high convergence efficiency since it does not do +model averaging [2]. + +On the effect of database cluster size: as the database cluster size increases, +it will be proportionally faster to train a set of models, as long as you have at +least as many model selection tuples as segments. This is because model state is "hopped" from +segment to segment and training takes place in parallel [1,2]. If you have fewer model +selection tuples to train than segments, then some +segments may not be busy 100% of the time so speedup will not necessarily be linear with +database cluster size. Inference (predict) is an embarrassingly parallel operation so +inference runtimes will be proportionally faster as the number of segments increases. + @anchor literature @literature @@ -1340,7 +1342,7 @@ refer to [7]. Supun Nakandala, Yuhao Zhang, and Arun Kumar, ACM SIGMOD 2019 DEEM Workshop, https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf -[2] Resource-Efficient and Reproducible Model Selection on Deep Learning Systems," +[2] "Resource-Efficient and Reproducible Model Selection on Deep Learning Systems," Supun Nakandala, Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and Engineering, University of California, San Diego https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf diff --git a/src/ports/postgres/modules/knn/knn.sql_in b/src/ports/postgres/modules/knn/knn.sql_in index daeddc8..22822ed 100644 --- a/src/ports/postgres/modules/knn/knn.sql_in +++ b/src/ports/postgres/modules/knn/knn.sql_in @@ -121,6 +121,10 @@ in a column of type <tt>DOUBLE PRECISION[]</tt>. <dd>TEXT. Name of the column with testing data points or expression that evaluates to a numeric array</dd> +@note +For unsupervised nearest neighbors, make the test dataset the same as the source dataset, +so the nearest neighbor of each point is the point itself, with a zero distance. + <dt>test_id</dt> <dd>TEXT. Name of the column having ids of data points in test data table.</dd>