[ https://issues.apache.org/jira/browse/MADLIB-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Domino Valdano closed MADLIB-1443. ---------------------------------- > Crash in fit_multiple when any model reaches loss=nan > ----------------------------------------------------- > > Key: MADLIB-1443 > URL: https://issues.apache.org/jira/browse/MADLIB-1443 > Project: Apache MADlib > Issue Type: Bug > Components: Deep Learning > Reporter: Domino Valdano > Assignee: Domino Valdano > Priority: Minor > Labels: deeplearning > Fix For: v1.18.0 > > > There's a crash that can happen in {{madlib_keras_fit_multiple}} (and > probably fit as well but I haven't tested it), when the loss ends up becoming > nan for a model. > {{$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, > momentum=1.1)',metrics=['accuracy']$$}} > Clearly, this was not a great choice for the momentum hyperparameter, but > keras does accept it and trains through all the way with no errors or > exceptions. The problem is, the loss ends up becoming infinite (or > undefined?) at some point. All 8 models trained for 10 hours, printed out > the results, and then {{madlib_keras_fit_multiple}} crashed while trying to > write out the final info table: > Training set after iteration 1: > mst_key=7: metric=0.446168005466, loss=2.39643478394 > mst_key=12: metric=0.00999999977648, loss=nan}} > mst_key=11: metric=0.165068000555, loss=4.0407166481}} > ... > Validation set after iteration 1: > mst_key=7: metric=0.359100013971, loss=2.89618015289 > mst_key=12: metric=0.00999999977648, loss=nan > mst_key=11: metric=0.151299998164, loss=4.0829615593}} > ... > CONTEXT: PL/Python function "madlib_keras_fit_multiple_model" > psql:run_fit_mult100.sql:14: > ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist > LINE 4: training_loss_final = nan, > ^ > QUERY: > UPDATE places100_mult_model_444_july7_info SET > training_metrics_final = 0.00999999977648, > training_loss_final = nan, > metrics_elapsed_time = ARRAY[33260.02720808983], > training_metrics = ARRAY[0.009999999776482582], > training_loss = ARRAY[nan] > WHERE mst_key = 12 > CONTEXT: Traceback (most recent call last): > PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module> > fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals()) > PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper > PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__ > PL/Python function "madlib_keras_fit_multiple_model", line 543, in > insert_info_table > PL/Python function "madlib_keras_fit_multiple_model", line 539, in > update_info_table > PL/Python function "madlib_keras_fit_multiple_model" -- This message was sent by Atlassian Jira (v8.3.4#803005)