[ 
https://issues.apache.org/jira/browse/MADLIB-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domino Valdano updated MADLIB-1426:
-----------------------------------
    Description: 
Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without 
GPU's, it always fails in evaluate complaining that device {{gpu0}} is not 
available. This happens regardless of whether {{use_gpus=False}} or 
use_gpus=True.

My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I 
think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug 
that affects all platforms, but not entirely sure of that. Possibly specific to 
OSX or gpdb5.

The problem happens in {{internal_keras_eval_transition()}} in 
{{madlib_keras.py_in}}.
 With {{use_gpus=False}}, it runs:

{{with K.tf.device(device_name):}}
 {{    res = segment_model.evaluate(x_val, y_val)}}

I added a {{plpy.info}} statement to print {{device_name}} at the beginning of 
this function. I also printed the value of {{use_gpus}} on master before 
training begins. While {{use_gpus}} is set to false, the {{device_name}} on the 
segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}).

This is the error message that happens:

 

{{CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"}}
{{LOCATION: PLy_output, plpython.c:4773}}
{{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 
slice1 127.0.0.1:25432 pid=90299)}}
{{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
{{LOCATION: PLy_output, plpython.c:4773}}
{{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 
slice1 127.0.0.1:25434 pid=90301)}}
{{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
{{LOCATION: PLy_output, plpython.c:4773}}
{{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 
slice1 127.0.0.1:25433 pid=90300)}}
{{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
{{LOCATION: PLy_output, plpython.c:4773}}
{{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: 
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a 
device for operation group_deps: Operation was explicitly assigned to 
/device:GPU:0 but available devices are [ 
/job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device 
specification refers to a valid device. (plpython.c:5038) (seg0 slice1 
127.0.0.1:25432 pid=90299) (plpython.c:5038)}}
{{DETAIL:}}
{{[[\{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, 
^metrics/acc/Mean)]]}}
{{Traceback (most recent call last):}}
{{PL/Python function "internal_keras_eval_transition", line 6, in <module>}}
{{return madlib_keras.internal_keras_eval_transition(**globals())}}
{{PL/Python function "internal_keras_eval_transition", line 782, in 
internal_keras_eval_transition}}
{{PL/Python function "internal_keras_eval_transition", line 1112, in evaluate}}
{{PL/Python function "internal_keras_eval_transition", line 391, in test_loop}}
{{PL/Python function "internal_keras_eval_transition", line 2714, in _call_}}
{{PL/Python function "internal_keras_eval_transition", line 2670, in _call}}
{{PL/Python function "internal_keras_eval_transition", line 2622, in 
_make_callable}}
{{PL/Python function "internal_keras_eval_transition", line 1469, in 
_make_callable_from_options}}
{{PL/Python function "internal_keras_eval_transition", line 1351, in 
_extend_graph}}
{{PL/Python function "internal_keras_eval_transition"}}
{{CONTEXT: Traceback (most recent call last):}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>}}
{{fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 216, in _init_}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 230, in 
fit_multiple_model}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 270, in 
train_multiple_model}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 302, in 
evaluate_model}}
{{PL/Python function "madlib_keras_fit_multiple_model", line 417, in 
compute_loss_and_metrics}}{{PL/Python function 
"madlib_keras_fit_multiple_model", line 739, in 
get_loss_metric_from_keras_eval}}
{{PL/Python function "madlib_keras_fit_multiple_model"}}
{{LOCATION: PLy_elog, plpython.c:5038}}

  was:
Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without 
GPU's, it always fails in evaluate complaining that device {{gpu0}} is not 
available. This happens regardless of whether {{use_gpus=False}} or 
use_gpus=True.

My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. I 
think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug 
that affects all platforms, but not entirely sure of that. Possibly specific to 
OSX or gpdb5.

The problem happens in {{internal_keras_eval_transition()}} in 
{{madlib_keras.py_in}}.
 With {{use_gpus=False}}, it runs:

{{with K.tf.device(device_name):}}
 {{    res = segment_model.evaluate(x_val, y_val)}}

I added a {{plpy.info}} statement to print {{device_name}} at the beginning of 
this function. I also printed the value of {{use_gpus}} on master before 
training begins. While {{use_gpus}} is set to false, the {{device_name}} on the 
segments is set to {{/gpu:0}}. This is the bug (it should be set to {{/cpu:0}}).

This is the error message that happens:

{{{{LOCATION: PLy_output, plpython.c:4773}}}}
{{ {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 
slice1 127.0.0.1:25432 pid=90299)}}}}
{{ {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}}}
{{ {{LOCATION: PLy_output, plpython.c:4773}}}}
{{ {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 
slice1 127.0.0.1:25434 pid=90301)}}}}
{{ {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}}}
{{ {{LOCATION: PLy_output, plpython.c:4773}}}}
{{ {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 
slice1 127.0.0.1:25433 pid=90300)}}}}
{{ {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}}}
{{ {{LOCATION: PLy_output, plpython.c:4773}}}}
{{ {{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: 
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a 
device for operation group_deps: Operation was explicitly assigned to 
/device:GPU:0 but available devices are [ 
/job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device 
specification refers to a valid device. (plpython.c:5038) (seg0 slice1 
127.0.0.1:25432 pid=90299) (plpython.c:5038)}}}}
{{ {{DETAIL:}}}}
{{ {{[[{{node group_deps}}}} = NoOp[_device="/device:GPU:0"](^loss/mul, 
^metrics/acc/Mean)]]}}{{{{Traceback (most recent call last):}}}}
{{}}

Traceback (most recent call last):
 PL/Python function "internal_keras_eval_transition", line 6, in <module>
 return madlib_keras.internal_keras_eval_transition(**globals())
 PL/Python function "internal_keras_eval_transition", line 782, in 
internal_keras_eval_transition
 PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
 PL/Python function "internal_keras_eval_transition", line 391, in test_loop
 PL/Python function "internal_keras_eval_transition", line 2714, in __call__
 PL/Python function "internal_keras_eval_transition", line 2670, in _call
 PL/Python function "internal_keras_eval_transition", line 2622, in 
_make_callable
 PL/Python function "internal_keras_eval_transition", line 1469, in 
_make_callable_from_options
 PL/Python function "internal_keras_eval_transition", line 1351, in 
_extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT: Traceback (most recent call last):
 PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
 fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
 PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
 PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
 PL/Python function "madlib_keras_fit_multiple_model", line 230, in 
fit_multiple_model
 PL/Python function "madlib_keras_fit_multiple_model", line 270, in 
train_multiple_model
 PL/Python function "madlib_keras_fit_multiple_model", line 302, in 
evaluate_model
 PL/Python function "madlib_keras_fit_multiple_model", line 417, in 
compute_loss_and_metrics

PL/Python function "madlib_keras_fit_multiple_model", line 739, in 
get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION: PLy_elog, plpython.c:5038


> Without GPU's, FitMultipleModel fails in evaluate()
> ---------------------------------------------------
>
>                 Key: MADLIB-1426
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1426
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Deep Learning
>            Reporter: Domino Valdano
>            Priority: Major
>
> Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system 
> without GPU's, it always fails in evaluate complaining that device {{gpu0}} 
> is not available. This happens regardless of whether {{use_gpus=False}} or 
> use_gpus=True.
> My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5. 
> I think I've also seen this happen on CentOS in gpdb6, so I believe this is a 
> bug that affects all platforms, but not entirely sure of that. Possibly 
> specific to OSX or gpdb5.
> The problem happens in {{internal_keras_eval_transition()}} in 
> {{madlib_keras.py_in}}.
>  With {{use_gpus=False}}, it runs:
> {{with K.tf.device(device_name):}}
>  {{    res = segment_model.evaluate(x_val, y_val)}}
> I added a {{plpy.info}} statement to print {{device_name}} at the beginning 
> of this function. I also printed the value of {{use_gpus}} on master before 
> training begins. While {{use_gpus}} is set to false, the {{device_name}} on 
> the segments is set to {{/gpu:0}}. This is the bug (it should be set to 
> {{/cpu:0}}).
> This is the error message that happens:
>  
> {{CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"}}
> {{LOCATION: PLy_output, plpython.c:4773}}
> {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg0 
> slice1 127.0.0.1:25432 pid=90299)}}
> {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
> {{LOCATION: PLy_output, plpython.c:4773}}
> {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg2 
> slice1 127.0.0.1:25434 pid=90301)}}
> {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
> {{LOCATION: PLy_output, plpython.c:4773}}
> {{psql:../run_fit_mult_iris.sql:1: INFO: 00000: device_name = /gpu:0 (seg1 
> slice1 127.0.0.1:25433 pid=90300)}}
> {{CONTEXT: PL/Python function "internal_keras_eval_transition"}}
> {{LOCATION: PLy_output, plpython.c:4773}}
> {{psql:../run_fit_mult_iris.sql:1: ERROR: XX000: plpy.SPIError: 
> tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a 
> device for operation group_deps: Operation was explicitly assigned to 
> /device:GPU:0 but available devices are [ 
> /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device 
> specification refers to a valid device. (plpython.c:5038) (seg0 slice1 
> 127.0.0.1:25432 pid=90299) (plpython.c:5038)}}
> {{DETAIL:}}
> {{[[\{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, 
> ^metrics/acc/Mean)]]}}
> {{Traceback (most recent call last):}}
> {{PL/Python function "internal_keras_eval_transition", line 6, in <module>}}
> {{return madlib_keras.internal_keras_eval_transition(**globals())}}
> {{PL/Python function "internal_keras_eval_transition", line 782, in 
> internal_keras_eval_transition}}
> {{PL/Python function "internal_keras_eval_transition", line 1112, in 
> evaluate}}
> {{PL/Python function "internal_keras_eval_transition", line 391, in 
> test_loop}}
> {{PL/Python function "internal_keras_eval_transition", line 2714, in _call_}}
> {{PL/Python function "internal_keras_eval_transition", line 2670, in _call}}
> {{PL/Python function "internal_keras_eval_transition", line 2622, in 
> _make_callable}}
> {{PL/Python function "internal_keras_eval_transition", line 1469, in 
> _make_callable_from_options}}
> {{PL/Python function "internal_keras_eval_transition", line 1351, in 
> _extend_graph}}
> {{PL/Python function "internal_keras_eval_transition"}}
> {{CONTEXT: Traceback (most recent call last):}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>}}
> {{fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 216, in _init_}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 230, in 
> fit_multiple_model}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 270, in 
> train_multiple_model}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 302, in 
> evaluate_model}}
> {{PL/Python function "madlib_keras_fit_multiple_model", line 417, in 
> compute_loss_and_metrics}}{{PL/Python function 
> "madlib_keras_fit_multiple_model", line 739, in 
> get_loss_metric_from_keras_eval}}
> {{PL/Python function "madlib_keras_fit_multiple_model"}}
> {{LOCATION: PLy_elog, plpython.c:5038}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to