On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey <bridget.f...@redfin.com> wrote: > Hello, > We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an > issue that seems very similar to the one reported as bug 6200. We see > approximately 2 dozen alloc errors per day across 3 slaves, and we are > getting one segfault approximately every 3 days. We did not experience this > issue before our upgrade (we were on version 8.4, and used skytools for > replication). > > We are attempting to get a core dump on segfault (our last attempt did not > work due to a config issue for the core dump). We're also attempting to > repro the alloc errors on a test setup, but it seems like we may need quite > a bit of load to trigger the issue. We're not certain that the alloc issues > and the sefaults are "the same issue" - but it seems that it may be since > the OP for bug 6200 sees the same behavior. We have seen no issues on the > master, all alloc errors and segfaults have been on the slaves. > > We've seen the alloc errors on a few different tables, but most frequently > on logins. Rows are added to the logins table one-by-one, and updates > generally happen one row at a time. The table is pretty basic, it looks > like this... > > CREATE TABLE logins > ( > login_id bigserial NOT NULL, > <snip - a bunch of columns> > CONSTRAINT logins_pkey PRIMARY KEY (login_id ), > <snip - some other constraints...> > ) > WITH ( > FILLFACTOR=80, > OIDS=FALSE > ); > > The queries that trigger the alloc error on this table look like this (we > use hibernate hence the funny underscoring...) > select login0_.login_id as login1_468_0_, l... from logins login0_ where > login0_.login_id=$1 > > The alloc error in the logs looks like this: > -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: > invalid memory alloc request size 18446744073709551613 > > The alloc error is nearly always for size 18446744073709551613 - though we > have seen one time where it was a different amount...
Hmm, that number in hex works out to 0xfffffffffffffffd, which makes it sound an awful lot like the system (for some unknown reason) attempted to allocate -3 bytes of memory. I've seen something like this once before on a customer system running a modified version of PostgreSQL. In that case, the problem turned out to be page corruption. Circumstances didn't permit determination of the root cause of the page corruption, however, nor was I able to figure out exactly how the corruption I saw resulted in an allocation request like this. It would be nice to figure out where in the code this is happening and put in a higher-level guard so that we get a better error message. You want want to compile a modified PostgreSQL executable that puts an extremely long sleep (like a year) just before this error is reported. Then, when the system hangs at that point, you can attach a debugger and pull a stack backtrace. Or you could insert an abort() at that point in the code and get a backtrace from the core dump. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs