[ 
https://issues.apache.org/jira/browse/MADLIB-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100554#comment-16100554
 ] 

Frank McQuillan edited comment on MADLIB-413 at 7/25/17 8:11 PM:
-----------------------------------------------------------------

Re-opening this issue, based on playing around with it a bit.

Notes on phase 1 of NN

(2)
If I select all defaults
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text'     -- Label
);
{code}
I get an error
{code}
(psycopg2.ProgrammingError) function madlib.mlp_classification(unknown, 
unknown, unknown, unknown) does not exist
LINE 1: SELECT madlib.mlp_classification(
               ^
HINT:  No function matches the given name and argument types. You may need to 
add explicit type casts.
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text'     -- Label\n);"]
{code}
Same with 
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5]          -- Number of units per layer
);
{code}
I get an error
{code}
(psycopg2.ProgrammingError) function madlib.mlp_classification(unknown, 
unknown, unknown, unknown, integer[]) does not exist
LINE 1: SELECT madlib.mlp_classification(
               ^
HINT:  No function matches the given name and argument types. You may need to 
add explicit type casts.
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text',     -- Label\n    ARRAY[5]         -- Number 
of units per layer\n);"]
{code}
Same with 
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    NULL,          -- Number of units per layer
    NULL,
    NULL
);
{code}
I get an error
{code}
InternalError: (psycopg2.InternalError) plpy.Error: hidden_layer_sizes may not 
be null (plpython.c:4648)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "mlp_classification", line 32, in <module>
    True
  PL/Python function "mlp_classification", line 66, in mlp
  PL/Python function "mlp_classification", line 277, in _validate_args
  PL/Python function "mlp_classification", line 63, in _assert
PL/Python function "mlp_classification"
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text',     -- Label\n    NULL,          -- Number of 
units per layer\n    NULL,\n    NULL\n);"]
{code}


(3)
If I use the defaults
{code}
    'step_size=0.001,
    n_iterations=100,
    tolerance=0.001',     -- Optimizer params
    'sigmoid');          -- Activation function
{code}
on full Iris data set (150 rows) I get 100 missclassifications with 1 hidden 
layer/5 units when I use the same data for prediction that I used for training. 
 Are these the right defaults we should be using?  I suspect many people may 
just take the default values (at least initially) so we need to ensure we use 
reasonable defaults.


(4)
Info/debug statements being written to the console
{code}
INFO:  loss: 0.636514
INFO:  loss: 0.636516
INFO:  loss: 0.636521
INFO:  loss: 0.636528
INFO:  loss: 0.636535
INFO:  loss: 0.636543
INFO:  loss: 0.636551
INFO:  loss: 0.636558
INFO:  loss: 0.636565
{code}
Please add a  `verbose` flag at the end in the same was as we do for 
http://madlib.incubator.apache.org/docs/latest/group__grp__glm.html
to handle console output of loss function.  What other info could we input to 
help with tuning?


(5)
I can't get anything with more than 1 hidden layer to converge for 
classification
e.g., with 1 hidden layer/5 units I get loss of 0.027 and 1 misclassification 
on full IRIS data set. 
{code}   
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5],         -- Number of units per layer
    'step_size=0.003,
    n_iterations=5000,
    tolerance=0',     -- Optimizer params
    'tanh');          -- Activation function
{code}
With 2 hidden layers/5 units each I get loss of 0.637 and 100 
misclassifications on IRIS data set - in fact the loss function starts at .6 
and never decreases with each iteration.     Does not seem to be proper 
behavior.
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5,5],         -- Number of units per layer
    'step_size=0.003,
    n_iterations=5000,
    tolerance=0',     -- Optimizer params
    'tanh');          -- Activation function
{code}
Tried reducing step size but that did not help.



(8)
There are some user doc formatting issues:
* Contents at top right of page says 
“Prediction Functions/a> 
Examples”
* Examples 8 and 9 and below

(10) 
Please put clearly in the docs that one-hot encoding is required for 
categorical variables.  Use a `Note` maybe after the `source_table` argument 
description.  Also point to the MADlib function for this 
http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html

(11)
In the user docs let's call module this `Neural Network` as in the attached 
picture.  The first line could then be:
`Multilayer Perceptron (MLP) is type of neural network that can be used for 
regression and classification.`
That is the only change needed I think.  We can continue to call the functions 
`mlp_classification()`  etc. as they are now.





















was (Author: fmcquillan):
Re-opening this issue, based on playing around with it a bit.

Notes on phase 1 of NN

(2)
If I select all defaults
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text'     -- Label
);
{code}
I get an error
{code}
(psycopg2.ProgrammingError) function madlib.mlp_classification(unknown, 
unknown, unknown, unknown) does not exist
LINE 1: SELECT madlib.mlp_classification(
               ^
HINT:  No function matches the given name and argument types. You may need to 
add explicit type casts.
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text'     -- Label\n);"]
{code}
Same with 
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5]          -- Number of units per layer
);
{code}
I get an error
{code}
(psycopg2.ProgrammingError) function madlib.mlp_classification(unknown, 
unknown, unknown, unknown, integer[]) does not exist
LINE 1: SELECT madlib.mlp_classification(
               ^
HINT:  No function matches the given name and argument types. You may need to 
add explicit type casts.
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text',     -- Label\n    ARRAY[5]         -- Number 
of units per layer\n);"]
{code}
Same with 
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    NULL,          -- Number of units per layer
    NULL,
    NULL
);
{code}
I get an error
{code}
InternalError: (psycopg2.InternalError) plpy.Error: hidden_layer_sizes may not 
be null (plpython.c:4648)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "mlp_classification", line 32, in <module>
    True
  PL/Python function "mlp_classification", line 66, in mlp
  PL/Python function "mlp_classification", line 277, in _validate_args
  PL/Python function "mlp_classification", line 63, in _assert
PL/Python function "mlp_classification"
 [SQL: "SELECT madlib.mlp_classification(\n    'iris_data',      -- Source 
table\n    'mlp_model',      -- Destination table\n    'attributes',     -- 
Input features\n    'class_text',     -- Label\n    NULL,          -- Number of 
units per layer\n    NULL,\n    NULL\n);"]
{code}


(3)
If I use the defaults
{code}
    'step_size=0.001,
    n_iterations=100,
    tolerance=0.001',     -- Optimizer params
    'sigmoid');          -- Activation function
{code}
on full Iris data set (150 rows) I get 100 missclassifications with 1 hidden 
layer/5 units when I use the same data for prediction that I used for training. 
 Are these the right defaults we should be using?  I suspect many people may 
just take the default values (at least initially) so we need to ensure we use 
reasonable defaults.


(4)
Info/debug statements being written to the console
{code}
INFO:  loss: 0.636514
INFO:  loss: 0.636516
INFO:  loss: 0.636521
INFO:  loss: 0.636528
INFO:  loss: 0.636535
INFO:  loss: 0.636543
INFO:  loss: 0.636551
INFO:  loss: 0.636558
INFO:  loss: 0.636565
{code}
Please add a  `verbose` flag at the end in the same was as we do for 
http://madlib.incubator.apache.org/docs/latest/group__grp__glm.html
to handle console output of loss function.  What other info could we input to 
help with tuning?


(5)
I can't get anything with more than 1 hidden layer to converge for 
classification
e.g., with 1 hidden layer/5 units I get loss of 0.027 and 1 misclassification 
on full IRIS data set. 
{code}   
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5],         -- Number of units per layer
    'step_size=0.003,
    n_iterations=5000,
    tolerance=0',     -- Optimizer params
    'tanh');          -- Activation function
{code}
With 2 hidden layers/5 units each I get loss of 0.637 and 100 
misclassifications on IRIS data set - in fact the loss function starts at .6 
and never decreases with each iteration.     Does not seem to be proper 
behavior.
{code}
SELECT madlib.mlp_classification(
    'iris_data',      -- Source table
    'mlp_model',      -- Destination table
    'attributes',     -- Input features
    'class_text',     -- Label
    ARRAY[5,5],         -- Number of units per layer
    'step_size=0.003,
    n_iterations=5000,
    tolerance=0',     -- Optimizer params
    'tanh');          -- Activation function
{code}
Tried reducing step size but that did not help.



(8)
There are some user doc formatting issues:
* Contents at top right of page says 
“Prediction Functions/a> 
Examples”
* Examples 8 and 9 and below

(10) 
Please put clearly in the docs that one-hot encoding is required for 
categorical variables.  Use a `Note` maybe after the `source_table` argument 
description.  Also point to the MADlib function for this 
http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html




















> Neural Networks - MLP - Phase 1
> -------------------------------
>
>                 Key: MADLIB-413
>                 URL: https://issues.apache.org/jira/browse/MADLIB-413
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Neural Networks
>            Reporter: Caleb Welton
>            Assignee: Cooper Sloan
>             Fix For: v1.12
>
>
> Multilayer perceptron with backpropagation
> Modules:
> * mlp_classification
> * mlp_regression
> Interface
> {code}
> source_table VARCHAR
> output_table VARCHAR
> independent_varname VARCHAR -- Column name for input features, should be a 
> Real Valued array
> dependent_varname VARCHAR, -- Column name for target values, should be Real 
> Valued array of size 1 or greater
> hidden_layer_sizes INTEGER[], -- Number of units per hidden layer (can be 
> empty or null, in which case, no hidden layers)
> optimizer_params VARCHAR, -- Specified below
> weights VARCHAR, -- Column name for weights. Weights the loss for each input 
> vector. Column should contain positive real value
> activation_function VARCHAR, -- One of 'sigmoid' (default), 'tanh', 'relu', 
> or any prefix (eg. 't', 's')
> grouping_cols
> )
> {code}
> where
> {code}
> optimizer_params: -- eg "step_size=0.5, n_tries=5"
> {
> step_size DOUBLE PRECISION, -- Learning rate
> n_iterations INTEGER, -- Number of iterations per try
> n_tries INTEGER, -- Total number of training cycles, with random 
> initializations to avoid local minima.
> tolerance DOUBLE PRECISION, -- Maximum distance between weights before 
> training stops (or until it reaches n_iterations)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to