This is an automated email from the ASF dual-hosted git repository.
fmcquillan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git
The following commit(s) were added to refs/heads/master by this push:
new 63f40e7 updated DL preprocessor docs for bytea (#445)
63f40e7 is described below
commit 63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627
Author: Frank McQuillan
AuthorDate: Tue Oct 1 13:52:40 2019 -0700
updated DL preprocessor docs for bytea (#445)
* updated DL preprocessor docs for bytea
* address review comments
---
.../deep_learning/input_data_preprocessor.sql_in | 210 ++---
1 file changed, 98 insertions(+), 112 deletions(-)
diff --git
a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
index a3f4281..8d70431 100644
--- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
+++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in
@@ -18,7 +18,7 @@
* under the License.
*
* @file input_preprocessor_dl.sql_in
- * @brief TODO
+ * @brief Utilities to prepare input image data for use by deep learning
modules.
* @date December 2018
*
*/
@@ -86,9 +86,10 @@ training_preprocessor_dl(source_table,
TEXT. Name of the output table from the training preprocessor which
will be used as input to algorithms that support mini-batching.
Note that the arrays packed into the output table are shuffled
- and normalized (by dividing each element in the independent variable array
- by the optional 'normalizing_const' parameter), so they will not match
- up in an obvious way with the rows in the source table.
+ and normalized, by dividing each element in the independent variable array
+ by the optional 'normalizing_const' parameter. For performance reasons,
+ packed arrays are converted to PostgreSQL bytea format, which is a
+ variable-length binary string.
In the case a validation data set is used (see
later on this page), this output table is also used
@@ -158,11 +159,15 @@ validation_preprocessor_dl(source_table,
output_table
TEXT. Name of the output table from the validation
- preprocessor which will be used as input to algorithms that support
mini-batching. The arrays packed into the output table are
+ preprocessor which will be used as input to algorithms that support
mini-batching.
+ The arrays packed into the output table are
normalized using the same normalizing constant from the
training preprocessor as specified in
the 'training_preprocessor_table' parameter described below.
Validation data is not shuffled.
+ For performance reasons,
+ packed arrays are converted to PostgreSQL bytea format, which is a
+ variable-length binary string.
dependent_varname
@@ -209,25 +214,43 @@ validation_preprocessor_dl(source_table,
validation_preprocessor_dl() contain the following columns:
-buffer_id
-INTEGER. Unique id for each row in the packed table.
+independent_var
+BYTEA. Packed array of independent variables in PostgreSQL bytea
format.
+Arrays of independent variables packed into the output table are
+normalized by dividing each element in the independent variable array
by the
+optional 'normalizing_const' parameter. Training data is shuffled, but
+validation data is not.
dependent_var
-ANYARRAY[]. Packed array of dependent variables.
+BYTEA. Packed array of dependent variables in PostgreSQL bytea
format.
The dependent variable is always one-hot encoded as an
-INTEGER[] array. For now, we are assuming that
+integer array. For now, we are assuming that
input_preprocessor_dl() will be used
only for classification problems using deep learning. So
the dependent variable is one-hot encoded, unless it's already a
numeric array in which case we assume it's already one-hot
-encoded and just cast it to an INTEGER[] array.
+encoded and just cast it to an integer array.
-independent_var
-REAL[]. Packed array of independent variables.
+independent_var_shape
+INTEGER[]. Shape of the independent variable array after
preprocessing.
+The first element is the number of images packed per row, and
subsequent
+elements will depend on how the image is described (e.g., channels
first or last).
+
+
+
+dependent_var_shape
+INTEGER[]. Shape of the dependent variable array after
preprocessing.
+The first element is the number of images packed per row, and the
second
+element is the number of class values.
+
+
+
+buffer_id
+INTEGER. Unique id for each row in the packed table.