On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00 > > TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > > 0)", File: "lock.c", Line: 1240) > [53418a51.6a08:2] LOG: server process (PID 27631) was terminated by signal 6 > [53418a51.6a08:3] DETAIL: Failed process was running: create table gc1() > inherits (c1);
Uggh. That's unfortunate, but not terribly surprising: I didn't think that missing volatile was very likely to be the cause of this. Have we been getting random failures of this type since the fastlock stuff went in, and we're only just now noticing? Or did some recent change expose this problem? I'm a bit suspicious of the patches to static-ify stuff, since that might cause the compiler to think it could move things across function calls that it hadn't thought move-able before, but FastPathStrongLocks references would seem to be the obvious candidate for that, and volatile-izing it ought to have fixed it. I would think. One thing I noticed, looking at this particular failure, is that at the time that "create table gc1() inherits (c1)" failed an assertion, another backend was inside "select blockme()", and specifically inside of "select count(*) from tenk1 a, tenk1 b, tenk1 c". I can't help but suspect that the bug is somehow concurrency-related, so the presence of concurrent activity seems like a clue, but I can't figure out the relationship. blockme() shouldn't be taking any strong relation locks. Unless AV decided to truncate a table just then, the process that failed should be the only one in the system with any strong relation lock, so if there's a race, what is it racing against? In the failure on prairiedog back on March 25th... http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03 ...there's a lot more concurrent activity. The process that failed the assertion was running "CREATE TABLE t3 (name TEXT, n INTEGER)"; concurrently, the following other queries were running: SELECT oid AS clsoid, relname, relnatts + 10 AS x INTO selinto_schema.tmp2 FROM pg_class WHERE relname like '%b%'; CREATE TEMP TABLE foo (id integer); create temp table t3 as select generate_series(-1000,1000) as x; CREATE TEMPORARY TABLE bitwise_test( i2 INT2, i4 INT4, i8 INT8, i INTEGER, x INT2, y BIT(4) ); DROP TABLE savepoints; create temp table tt1(f1 int); CREATE TEMP TABLE arrtest2 (i integer ARRAY[4], f float8[], n numeric[], t text[], d timestamp[]); COMMIT PREPARED 'regress-one'; All but the first and last of those take a strong relation lock, so some kind of race could certainly account for that failure. It's also interesting that COMMIT PREPARED is shown as being involved here; that code is presumably much more rarely executed than the code for the regular commit or abort paths, and might therefore be thought more likely to harbor a bug. In particular, there's this code in LockRefindAndRelease: /* * Decrement strong lock count. This logic is needed only for 2PC. */ if (decrement_strong_lock_count && ConflictsWithRelationFastPath(&lock->tag, lockmode)) { uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode); SpinLockAcquire(&FastPathStrongRelationLocks->mutex); FastPathStrongRelationLocks->count[fasthashcode]--; SpinLockRelease(&FastPathStrongRelationLocks->mutex); } I notice that this code lacks an Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0). I think we should add one. If this code is somehow managing to decrement one of the counts when it's already zero, the next process whose lock gets mapped to this partition would increment the count from the maximum value that can be stored back to zero. Then, when it goes to release the strong lock, it finds that the count is already zero and goes boom. This theory could even explain the new crash, since the COMMIT PREPARED stuff has already happened by the point where rover_firefly failed; since the lock tags are hashed to create fasthashcode, variation in which objects got which OIDs due to concurrency in the regression test could cause the failure to move around or even escape detection altogether. Now, even if the 2PC code is the problem, that doesn't explain exactly what's wrong with the above logic, but it would help narrow down where to look. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers