[ 
https://issues.apache.org/jira/browse/MADLIB-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284304#comment-16284304
 ] 

Nikhil edited comment on MADLIB-1185 at 12/11/17 5:19 PM:
----------------------------------------------------------

The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
    explicit 
    PGException()
      : std::runtime_error("The backend raised an exception.") { }
    
    // FIXME: Do something useful with inErrorData
    PGException(ErrorData* /* inErrorData */)
      : std::runtime_error("The backend raised an exception.") {  }
};
{code}

The root cause of the problem lies in the type_info constructor in the 
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.

All these files define a type_info struct like this
{code}
typedef struct __type_info{
    Oid oid;
    int16_t len;
    bool    byval;
    char    align;

    __type_info(Oid oid):oid(oid)
    {
        madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
    }
} type_info;

static type_info FLOAT8TI(FLOAT8OID);
{code}

madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function 
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and 
does not print the actual exception coming from postgres. So we had to replace 
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the 
actual error. After that, we saw the following exception
{code}
  ERROR:  invalid cache ID: 74
  CONTEXT:  parallel worker
{code}

get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign 
values to the struct members len, byval and align.

The problem here is that when you open a psql session and call any c madlib udf 
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling 
all the type_info constructors during dl_open(the first call to dl_open will 
always call all the typedef constructors.) which in turn call SearchSysCache1.  
It is not recommended to call SearchSysCache1  during init.  Here is a relevant 
postgres thread about it: 

https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com

Hardcoding all the type_info struct members inside the constructor fixes the 
problem.  



was (Author: nikhilkak):
The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
    explicit 
    PGException()
      : std::runtime_error("The backend raised an exception.") { }
    
    // FIXME: Do something useful with inErrorData
    PGException(ErrorData* /* inErrorData */)
      : std::runtime_error("The backend raised an exception.") {  }
};
{code}

The root cause of the problem lies in the type_info constructor in the 
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.

All these files define a type_info struct like this
{code}
typedef struct __type_info{
    Oid oid;
    int16_t len;
    bool    byval;
    char    align;

    __type_info(Oid oid):oid(oid)
    {
        madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
    }
} type_info;

static type_info FLOAT8TI(FLOAT8OID);
{code}

madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function 
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and 
does not print the actual exception coming from postgres. So we had to replace 
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the 
actual error. After that, we saw the following exception
{code}
  ERROR:  invalid cache ID: 74
  CONTEXT:  parallel worker
{code}

get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign 
values to the struct members len, byval and align.

The problem here is that when you open a psql session and call any c madlib udf 
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling 
all the type_info constructors during dl_open which in turn call 
SearchSysCache1. It is not recommended to call SearchSysCache1  during init.  
Here is a relevant postgres thread about it: 

https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com

Hardcoding all the type_info struct members inside the constructor fixes the 
problem.  


> Postgres 10 support for MADlib with large tables
> ------------------------------------------------
>
>                 Key: MADLIB-1185
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1185
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: DB Abstraction Layer
>            Reporter: Nikhil
>             Fix For: v1.13
>
>
> Running MADlib on postgres10 with a large dataset ( 98000 rows with a double 
> array column) causes the database to crash.
> Repro Steps
> {code}
> 1. create table foo (id integer, x double precision[], y integer);
> 2. Insert at least 1 million rows like these
>   id   |            x            | y
> -------+-------------------------+---
>  97440 | {1,0.2,0,1,0,1,0,0,0,0} | 1
> 3. Now running any C madlib UDF followed by a count(*) of foo will cause the 
> database to crash
> select madlib.poisson_random(1); select count(*) from foo;
> or
> select madlib.svec_plus('{1}:{5}', '{1}:{4}'); select count(*) from foo;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to