[
https://issues.apache.org/jira/browse/MADLIB-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan reassigned MADLIB-1406:
---------------------------------------
Assignee: Nikhil Kak
> DL: fit multiple takes up unnecessary disk space
> ------------------------------------------------
>
> Key: MADLIB-1406
> URL: https://issues.apache.org/jira/browse/MADLIB-1406
> Project: Apache MADlib
> Issue Type: Bug
> Components: Deep Learning
> Reporter: Nikhil Kak
> Assignee: Nikhil Kak
> Priority: Major
> Fix For: v1.17
>
>
> While testing places10 with fit multiple (gpdb5, 10 iterations and 20 msts),
> we ran out of disk space although we had at least 1.5T left at the beginning
> of the query. There is no reason for us to use this much space and this
> probably means that there is a bug in the code
> Here is the query and the failure
> {code:java}
> DROP TABLE IF EXISTS mst_table, mst_table_summary;
> SELECT load_model_selection_table(
> 'model_arch_places10',
> 'mst_table',
> ARRAY[1],
> ARRAY[
> $$loss='categorical_crossentropy', optimizer='SGD(lr=0.1, decay=1e-6,
> nesterov=True)', metrics=['accuracy']$$,
> $$loss='categorical_crossentropy', optimizer='SGD(lr=0.01,
> decay=1e-6, nesterov=True)', metrics=['accuracy']$$,
> $$loss='categorical_crossentropy', optimizer='SGD(lr=0.001,
> decay=1e-6, nesterov=True)', metrics=['accuracy']$$,
> $$loss='categorical_crossentropy', optimizer='SGD(lr=0.0001,
> decay=1e-6, nesterov=True)', metrics=['accuracy']$$,
> $$loss='categorical_crossentropy', optimizer='SGD(lr=0.001,
> decay=1e-6, nesterov=False)', metrics=['accuracy']$$
> ],
> ARRAY[
> $$batch_size=16, epochs=1, verbose=0$$,
> $$batch_size=20, epochs=1, verbose=0$$,
> $$batch_size=32, epochs=1, verbose=0$$,
> $$batch_size=40, epochs=1, verbose=0$$
> ]
> );
> DROP TABLE if exists places10_train_mult_model,
> places10_train_mult_model_summary, places10_train_mult_model_info;
> SELECT madlib_keras_fit_multiple_model(
> 'places10_train_bytea_batched',
> 'places10_train_mult_model',
> 'mst_table',
> 10,
> TRUE
> );
> -- failed in the 7th iteration
> ....
> Time for training in iteration 6: 6403.70687222 sec
> ERROR: plpy.SPIError: could not extend relation 1663/3721274/1121877: No
> space left on device (seg1){code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)