[
https://issues.apache.org/jira/browse/MADLIB-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846274#comment-16846274
]
Domino Valdano edited comment on MADLIB-1326 at 5/22/19 11:22 PM:
------------------------------------------------------------------
We were able to reproduce this error in a docker image, Ubuntu 16.04.6 LTS
(Xenial Xerus). But we could not reproduce it on OSX (both with keras 2.2.4 and
tensorflow 1.1.13).
On Ubuntu, the simplest repro we found was running dev-check on this
_debug.sql_in_ file:
{code:java}
drop table if exists small_unbatched;
create table small_unbatched AS select ARRAY[1] AS x;
drop table if exists small_batched;
create table small_batched as select array_scalar_mult(x,1) as x from
small_unbatched;
select dummy() FROM small_unbatched;{code}
{code:java}
Viewing memory usage with top, before and after the call to array_scalar_mult()
showed very little change, and the system was not close to being out of memory
before postgress hits the segmentation fault. By inspecting the logs, we found
that the actual segfault happens inside libpthread.so.
Interestingly, it does not crash if you add:{code}
{code:java}
dummy()
{code}
at the top of the test file, so that it calls it once before and once after
array_scalar_mult. This seems suspiciously like another example of tensorflow
holding on to resources until the process dies, even if it's already returned
from a plpy function. (This is discussed in some tensorflow forums, but
unfortunately the tensorflow devs have said that it is expected behavior and
they don't intend to fix it).
was (Author: dvaldano):
We were able to reproduce this error in a docker image, Ubuntu 16.04.6 LTS
(Xenial Xerus). But we could not reproduce it on OSX (both with keras 2.2.4 and
tensorflow 1.1.13).
On Ubuntu, the simplest repro we found was running dev-check on this
_debug.sql_in_ file:
{code:java}
drop table if exists small_unbatched;
create table small_unbatched AS select ARRAY[1] AS x;
drop table if exists small_batched;
create table small_batched as select array_scalar_mult(x,1) as x from
small_unbatched;
select dummy() FROM small_unbatched;{code}
{code:java}
Interestingly, it does not crash if you add:{code}
{code:java}
dummy()
{code}
at the top of the test file, so that it calls it once before and once after
array_scalar_mult. This seems suspiciously like another example of tensorflow
holding on to resources until the process dies, even if it's already returned
from a plpy function. (This is discussed in some tensorflow forums, but
unfortunately the tensorflow devs have said that it is expected behavior and
they don't intend to fix it).
> DL: Dev-check fails when keras_fit is called after array_scalar_mult
> --------------------------------------------------------------------
>
> Key: MADLIB-1326
> URL: https://issues.apache.org/jira/browse/MADLIB-1326
> Project: Apache MADlib
> Issue Type: Bug
> Components: Deep Learning
> Reporter: Nandish Jayaram
> Priority: Major
> Fix For: v1.16
>
>
> In madlib_keras dev-check, we create the input data to fit using
> {{minibatch_preprocessor_dl()}}. This function internally calls
> {{array_scalar_mult()}}. If we call either of these functions followed by
> {{madlib_keras_fit()}}, then the following error pops up:
> {code:java}
> NOTICE: Releasing segworker groups to finish aborting the transaction.
> ERROR: could not connect to segment: initialization of segworker group
> failed (cdbgang.c:237)
> {code}
> Digging further into Postgres logs suggests that there was a segmentation
> fault, and it seems like it's happening the moment {{import keras}} is called
> in {{madlib_keras_fit()}}.
> This issue was first noticed while working on MADLIB-1304 (which was closed
> with [this
> commit|https://github.com/apache/madlib/commit/241074ae68cb8e15437f98abf1c2e3c7bb3471ae],
> as the comment [in this
> line|https://github.com/apache/madlib/commit/241074ae68cb8e15437f98abf1c2e3c7bb3471ae#diff-f89c193e163bfe0e7e3821445e38fa97R29]
> suggests. This happened on Greenplum then, and Postgres was not supporting
> deep learning yet. This was again noticed while working on MADLIB-1311, which
> added Postgres support. At this point, the failure happened on Postgres and
> there were no failures on Greenplum.
> While working on MADLIB-1311, we tried a couple of things and observed an odd
> behavior. We created a dummy function:
> {code:java}
> create function dummy()
> returns void as
> $$
> import keras
> $$
> language plpythonu;
> {code}
> If we ran {{select dummy()}} *before* running {{minibatch_preprocessor_dl()}}
> or {{array_scalar_mult()}}, then the whole dev-check passes. But running the
> same function right after calling either of those functions causes a failure.
> So, looks like any UDF that calls {{import keras}} *must* be run *before*
> calling {{minibatch_preprocessor_dl()}} or {{array_scalar_mult()}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)