[ 
https://issues.apache.org/jira/browse/MADLIB-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domino Valdano closed MADLIB-1443.
----------------------------------

> Crash in fit_multiple when any model reaches loss=nan
> -----------------------------------------------------
>
>                 Key: MADLIB-1443
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1443
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Deep Learning
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Minor
>              Labels: deeplearning
>             Fix For: v1.18.0
>
>
> There's a crash that can happen in {{madlib_keras_fit_multiple}} (and 
> probably fit as well but I haven't tested it), when the loss ends up becoming 
> nan for a model.
> {{$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, 
> momentum=1.1)',metrics=['accuracy']$$}}
> Clearly, this was not a great choice for the momentum hyperparameter, but 
> keras does accept it and trains through all the way with no errors or 
> exceptions.  The problem is, the loss ends up becoming infinite (or 
> undefined?) at some point.  All 8 models trained for 10 hours, printed out 
> the results, and then {{madlib_keras_fit_multiple}} crashed while trying to 
> write out the final info table:
> Training set after iteration 1:
>  mst_key=7: metric=0.446168005466, loss=2.39643478394
>  mst_key=12: metric=0.00999999977648, loss=nan}}
> mst_key=11: metric=0.165068000555, loss=4.0407166481}}
> ...
> Validation set after iteration 1:
>  mst_key=7: metric=0.359100013971, loss=2.89618015289
>  mst_key=12: metric=0.00999999977648, loss=nan
>  mst_key=11: metric=0.151299998164, loss=4.0829615593}}
> ...
> CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"
>  psql:run_fit_mult100.sql:14:
> ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist
> LINE 4: training_loss_final = nan,
>                                  ^
> QUERY:
>  UPDATE places100_mult_model_444_july7_info SET
>  training_metrics_final = 0.00999999977648,
>  training_loss_final = nan,
>  metrics_elapsed_time = ARRAY[33260.02720808983],
>  training_metrics = ARRAY[0.009999999776482582],
>  training_loss = ARRAY[nan]
>  WHERE mst_key = 12
> CONTEXT: Traceback (most recent call last):
> PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
>  fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
>  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
>  PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__
>  PL/Python function "madlib_keras_fit_multiple_model", line 543, in 
> insert_info_table
>  PL/Python function "madlib_keras_fit_multiple_model", line 539, in 
> update_info_table
>  PL/Python function "madlib_keras_fit_multiple_model"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to