Domino Valdano created MADLIB-1426:
--------------------------------------

             Summary: Without GPU's, FitMultipleModel fails in evaluate()
                 Key: MADLIB-1426
                 URL: https://issues.apache.org/jira/browse/MADLIB-1426
             Project: Apache MADlib
          Issue Type: Bug
          Components: Deep Learning
            Reporter: Domino Valdano


Whenever I try to run `madlib_keras_fit_multiple_model()` on a system without 
GPU's, it always fails in evaluate complaining that device `gpu0` is not 
available.  This happens regardless of whether use_gpus=False or use_gpus=True.

My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5.  I 
think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug 
that affects all platforms, but not entirely sure of that.  Possibly specific 
to OSX or gpdb5.

The problem happens in `internal_keras_eval_transition()` in 
`madlib_keras.py_in`.
With `use_gpus=False`, it calls:

```
with K.tf.device(device_name):
        res = segment_model.evaluate(x_val, y_val)
```
with `device_name='/gpu0'`

```
I know this because I added a plpy.info statement to print `device_name` at the 
beginning of this function.  I also printed the value of `use_gpus` on master 
before training begins:
```
INFO:  00000: use_gpus = False
```
This is what the error looks like:
```
INFO:  00000: device_name = /gpu:0  (seg1 slice1 127.0.0.1:25433 pid=90300)
CONTEXT:  PL/Python function "internal_keras_eval_transition"
LOCATION:  PLy_output, plpython.c:4773
psql:../run_fit_mult_iris.sql:1: ERROR:  XX000: plpy.SPIError: 
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a 
device for operation group_deps: Operation was explicitly assigned to 
/device:GPU:0 but available devices are [ 
/job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device 
specification refers to a valid device. (plpython.c:5038)  (seg0 slice1 
127.0.0.1:25432 pid=90299) (plpython.c:5038)
DETAIL:
[[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, 
^metrics/acc/Mean)]]
Traceback (most recent call last):
  PL/Python function "internal_keras_eval_transition", line 6, in <module>
    return madlib_keras.internal_keras_eval_transition(**globals())
  PL/Python function "internal_keras_eval_transition", line 782, in 
internal_keras_eval_transition
  PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
  PL/Python function "internal_keras_eval_transition", line 391, in test_loop
  PL/Python function "internal_keras_eval_transition", line 2714, in __call__
  PL/Python function "internal_keras_eval_transition", line 2670, in _call
  PL/Python function "internal_keras_eval_transition", line 2622, in 
_make_callable
  PL/Python function "internal_keras_eval_transition", line 1469, in 
_make_callable_from_options
  PL/Python function "internal_keras_eval_transition", line 1351, in 
_extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT:  Traceback (most recent call last):
  PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
    fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
  PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
  PL/Python function "madlib_keras_fit_multiple_model", line 230, in 
fit_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 270, in 
train_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 302, in 
evaluate_model
  PL/Python function "madlib_keras_fit_multiple_model", line 417, in 
compute_loss_and_metrics
  PL/Python function "madlib_keras_fit_multiple_model", line 739, in 
get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION:  PLy_elog, plpython.c:5038
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to