[
https://issues.apache.org/jira/browse/MADLIB-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284304#comment-16284304
]
Nikhil edited comment on MADLIB-1185 at 12/11/17 5:39 PM:
----------------------------------------------------------
*+RCA+*
The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
explicit
PGException()
: std::runtime_error("The backend raised an exception.") { }
// FIXME: Do something useful with inErrorData
PGException(ErrorData* /* inErrorData */)
: std::runtime_error("The backend raised an exception.") { }
};
{code}
The root cause of the problem lies in the type_info constructor in the
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.
All these files define a type_info struct like this
{code}
typedef struct __type_info{
Oid oid;
int16_t len;
bool byval;
char align;
__type_info(Oid oid):oid(oid)
{
madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
}
} type_info;
static type_info FLOAT8TI(FLOAT8OID);
{code}
madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and
does not print the actual exception coming from postgres. So we had to replace
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the
actual error. After that, we saw the following exception
{code}
ERROR: invalid cache ID: 74
CONTEXT: parallel worker
{code}
get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign
values to the struct members len, byval and align.
The problem here is that when you open a psql session and call any c madlib udf
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling
all the type_info constructors during dlopen(the first call to dlopen will
always call all the typedef constructors.) which in turn call SearchSysCache1.
It is not recommended to call SearchSysCache1 during init. Here is a relevant
postgres thread about it:
https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com
Hardcoding all the type_info struct members inside the constructor fixes the
problem.
was (Author: nikhilkak):
*+RCA+*
The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
explicit
PGException()
: std::runtime_error("The backend raised an exception.") { }
// FIXME: Do something useful with inErrorData
PGException(ErrorData* /* inErrorData */)
: std::runtime_error("The backend raised an exception.") { }
};
{code}
The root cause of the problem lies in the type_info constructor in the
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.
All these files define a type_info struct like this
{code}
typedef struct __type_info{
Oid oid;
int16_t len;
bool byval;
char align;
__type_info(Oid oid):oid(oid)
{
madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
}
} type_info;
static type_info FLOAT8TI(FLOAT8OID);
{code}
madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and
does not print the actual exception coming from postgres. So we had to replace
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the
actual error. After that, we saw the following exception
{code}
ERROR: invalid cache ID: 74
CONTEXT: parallel worker
{code}
get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign
values to the struct members len, byval and align.
The problem here is that when you open a psql session and call any c madlib udf
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling
all the type_info constructors during dl_open(the first call to dl_open will
always call all the typedef constructors.) which in turn call SearchSysCache1.
It is not recommended to call SearchSysCache1 during init. Here is a relevant
postgres thread about it:
https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com
Hardcoding all the type_info struct members inside the constructor fixes the
problem.
> Postgres 10 support for MADlib with large tables
> ------------------------------------------------
>
> Key: MADLIB-1185
> URL: https://issues.apache.org/jira/browse/MADLIB-1185
> Project: Apache MADlib
> Issue Type: Bug
> Components: DB Abstraction Layer
> Reporter: Nikhil
> Fix For: v1.13
>
>
> Running MADlib on postgres10 with a large dataset ( 98000 rows with a double
> array column) causes the database to crash.
> Repro Steps
> {code}
> 1. create table foo (id integer, x double precision[], y integer);
> 2. Insert at least 1 million rows like these
> id | x | y
> -------+-------------------------+---
> 97440 | {1,0.2,0,1,0,1,0,0,0,0} | 1
> 3. Now running any C madlib UDF followed by a count(*) of foo will cause the
> database to crash
> select madlib.poisson_random(1); select count(*) from foo;
> or
> select madlib.svec_plus('{1}:{5}', '{1}:{4}'); select count(*) from foo;
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)