[
https://issues.apache.org/jira/browse/MADLIB-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1226:
------------------------------------
Description:
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get
it to converge. It turned out to be user error (me) and not a problem with
convergence at all, because I forgot to 1-hot encode the dependent variable.
But I am wondering if other people might do the same thing that I did and get
confused.
Here's what I did. For this input data:
{code}
madlib=# \d+ public.mnist_train
Table "public.mnist_train"
Column | Type | Modifiers
| Storage | Stats target | Description
--------+-----------+----------------------------------------------------------+----------+--------------+-------------
y | integer |
| plain | |
x | integer[] |
| extended | |
id | integer | not null default nextval('mnist_train_id_seq'::regclass)
| plain | |
{code}
I called minibatch preprocessor:
{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
'mnist_train_packed', -- Output table
'y', -- Dependent
variable
'x' -- Independent
variables
);
{code}
then mlp:
{code}
SELECT madlib.mlp_classification(
'mnist_train_packed', -- Source table from preprocessor output
'mnist_result', -- Destination table
'independent_varname', -- Independent
'dependent_varname', -- Dependent
ARRAY[5], -- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh', -- Activation function
'', -- No weights
FALSE, -- No warmstart
TRUE); -- Verbose
{code}
with the result:
{code}
INFO: Iteration: 2, Loss: <-79.5295531257>
INFO: Iteration: 3, Loss: <-79.529408892>
INFO: Iteration: 4, Loss: <-79.5291940436>
INFO: Iteration: 5, Loss: <-79.5288964944>
INFO: Iteration: 6, Loss: <-79.5285051451>
INFO: Iteration: 7, Loss: <-79.5280094708>
INFO: Iteration: 8, Loss: <-79.5273995189>
INFO: Iteration: 9, Loss: <-79.5266665607>
{code}
So it did not error out but clearly is not working on data in the right format.
I suggest 2 changes:
1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
dependent variable (this JIRA)
2) add a check to the MLP classification code to check that the dependent var
has been 1-hot encoded, and error out if that is not the case.
(https://issues.apache.org/jira/browse/MADLIB-1226)
was:
I was testing MNIST dataset with minibatch preprocessor + MLP and could not get
it to converge. It turned out to be user error (me) and not a problem with
convergence at all, because I forgot to 1-hot encode the dependent variable.
But I am wondering if other people might do the same thing that I did and get
confused.
Here's what I did. For this input data:
{code}
madlib=# \d+ public.mnist_train
Table "public.mnist_train"
Column | Type | Modifiers
| Storage | Stats target | Description
--------+-----------+----------------------------------------------------------+----------+--------------+-------------
y | integer |
| plain | |
x | integer[] |
| extended | |
id | integer | not null default nextval('mnist_train_id_seq'::regclass)
| plain | |
{code}
I called minibatch preprocessor:
{code}
SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
'mnist_train_packed', -- Output table
'y', -- Dependent
variable
'x' -- Independent
variables
);
{code}
then mlp:
{code}
SELECT madlib.mlp_classification(
'mnist_train_packed', -- Source table from preprocessor output
'mnist_result', -- Destination table
'independent_varname', -- Independent
'dependent_varname', -- Dependent
ARRAY[5], -- Hidden layer sizes
'learning_rate_init=0.01,
n_iterations=20,
learning_rate_policy=exp, n_epochs=20,
lambda=0.0001, -- Regularization
tolerance=0',
'tanh', -- Activation function
'', -- No weights
FALSE, -- No warmstart
TRUE); -- Verbose
{code}
with the result:
{code}
INFO: Iteration: 2, Loss: <-79.5295531257>
INFO: Iteration: 3, Loss: <-79.529408892>
INFO: Iteration: 4, Loss: <-79.5291940436>
INFO: Iteration: 5, Loss: <-79.5288964944>
INFO: Iteration: 6, Loss: <-79.5285051451>
INFO: Iteration: 7, Loss: <-79.5280094708>
INFO: Iteration: 8, Loss: <-79.5273995189>
INFO: Iteration: 9, Loss: <-79.5266665607>
{code}'
So it did not error out but clearly is not working on data in the right format.
I suggest 2 changes:
1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
dependent variable (this JIRA)
2) add a check to the MLP classification code to check that the dependent var
has been 1-hot encodeded, and error out if that is not the case.
(https://issues.apache.org/jira/browse/MADLIB-1226)
> Add option for 1-hot encoding to minibatch preprocessor
> -------------------------------------------------------
>
> Key: MADLIB-1226
> URL: https://issues.apache.org/jira/browse/MADLIB-1226
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v1.14
>
>
> I was testing MNIST dataset with minibatch preprocessor + MLP and could not
> get it to converge. It turned out to be user error (me) and not a problem
> with convergence at all, because I forgot to 1-hot encode the dependent
> variable.
> But I am wondering if other people might do the same thing that I did and get
> confused.
> Here's what I did. For this input data:
> {code}
> madlib=# \d+ public.mnist_train
> Table "public.mnist_train"
> Column | Type | Modifiers
> | Storage | Stats target | Description
> --------+-----------+----------------------------------------------------------+----------+--------------+-------------
> y | integer |
> | plain | |
> x | integer[] |
> | extended | |
> id | integer | not null default
> nextval('mnist_train_id_seq'::regclass) | plain | |
> {code}
> I called minibatch preprocessor:
> {code}
> SELECT madlib.minibatch_preprocessor('mnist_train', -- Source table
> 'mnist_train_packed', -- Output table
> 'y', -- Dependent
> variable
> 'x' -- Independent
> variables
> );
> {code}
> then mlp:
> {code}
> SELECT madlib.mlp_classification(
> 'mnist_train_packed', -- Source table from preprocessor output
> 'mnist_result', -- Destination table
> 'independent_varname', -- Independent
> 'dependent_varname', -- Dependent
> ARRAY[5], -- Hidden layer sizes
> 'learning_rate_init=0.01,
> n_iterations=20,
> learning_rate_policy=exp, n_epochs=20,
> lambda=0.0001, -- Regularization
> tolerance=0',
> 'tanh', -- Activation function
> '', -- No weights
> FALSE, -- No warmstart
> TRUE); -- Verbose
> {code}
> with the result:
> {code}
> INFO: Iteration: 2, Loss: <-79.5295531257>
> INFO: Iteration: 3, Loss: <-79.529408892>
> INFO: Iteration: 4, Loss: <-79.5291940436>
> INFO: Iteration: 5, Loss: <-79.5288964944>
> INFO: Iteration: 6, Loss: <-79.5285051451>
> INFO: Iteration: 7, Loss: <-79.5280094708>
> INFO: Iteration: 8, Loss: <-79.5273995189>
> INFO: Iteration: 9, Loss: <-79.5266665607>
> {code}
> So it did not error out but clearly is not working on data in the right
> format.
> I suggest 2 changes:
> 1) Add an explicit param in the mini-batch preprocessor for 1-hot encoding of
> dependent variable (this JIRA)
> 2) add a check to the MLP classification code to check that the dependent var
> has been 1-hot encoded, and error out if that is not the case.
> (https://issues.apache.org/jira/browse/MADLIB-1226)
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)