Re: [HACKERS] FastPathStrongRelationLocks still has an issue in HEAD

Robert Haas Mon, 07 Apr 2014 07:07:38 -0700

On Sun, Apr 6, 2014 at 1:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-04-06%2017%3A04%3A00
>
> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 
> 0)", File: "lock.c", Line: 1240)
> [53418a51.6a08:2] LOG:  server process (PID 27631) was terminated by signal 6
> [53418a51.6a08:3] DETAIL:  Failed process was running: create table gc1() 
> inherits (c1);


Uggh.  That's unfortunate, but not terribly surprising: I didn't think
that missing volatile was very likely to be the cause of this.  Have
we been getting random failures of this type since the fastlock stuff
went in, and we're only just now noticing?  Or did some recent change
expose this problem?  I'm a bit suspicious of the patches to
static-ify stuff, since that might cause the compiler to think it
could move things across function calls that it hadn't thought
move-able before, but FastPathStrongLocks references would seem to be
the obvious candidate for that, and volatile-izing it ought to have
fixed it.  I would think.

One thing I noticed, looking at this particular failure, is that at
the time that "create table gc1() inherits (c1)" failed an assertion,
another backend was inside "select blockme()", and specifically inside
of "select count(*)        from tenk1 a, tenk1 b, tenk1 c".  I can't
help but suspect that the bug is somehow concurrency-related, so the
presence of concurrent activity seems like a clue, but I can't figure
out the relationship.  blockme() shouldn't be taking any strong
relation locks.  Unless AV decided to truncate a table just then, the
process that failed should be the only one in the system with any
strong relation lock, so if there's a race, what is it racing against?

In the failure on prairiedog back on March 25th...

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-03-25%2003%3A15%3A03

...there's a lot more concurrent activity.  The process that failed
the assertion was running "CREATE TABLE t3 (name TEXT, n INTEGER)";
concurrently, the following other queries were running:

SELECT oid AS clsoid, relname, relnatts + 10 AS x
   INTO selinto_schema.tmp2
   FROM pg_class WHERE relname like '%b%';

CREATE TEMP TABLE foo (id integer);

create temp table t3 as select generate_series(-1000,1000) as x;

CREATE TEMPORARY TABLE bitwise_test(
  i2 INT2,
  i4 INT4,
  i8 INT8,
  i INTEGER,
  x INT2,
  y BIT(4)
);

DROP TABLE savepoints;

create temp table tt1(f1 int);

CREATE TEMP TABLE arrtest2 (i integer ARRAY[4], f float8[], n
numeric[], t text[], d timestamp[]);

COMMIT PREPARED 'regress-one';

All but the first and last of those take a strong relation lock, so
some kind of race could certainly account for that failure.  It's also
interesting that COMMIT PREPARED is shown as being involved here; that
code is presumably much more rarely executed than the code for the
regular commit or abort paths, and might therefore be thought more
likely to harbor a bug.  In particular, there's this code in
LockRefindAndRelease:

    /*
     * Decrement strong lock count.  This logic is needed only for 2PC.
     */
    if (decrement_strong_lock_count
        && ConflictsWithRelationFastPath(&lock->tag, lockmode))
    {
        uint32      fasthashcode = FastPathStrongLockHashPartition(hashcode);

        SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
        FastPathStrongRelationLocks->count[fasthashcode]--;
        SpinLockRelease(&FastPathStrongRelationLocks->mutex);
    }

I notice that this code lacks an
Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0).  I think
we should add one.  If this code is somehow managing to decrement one
of the counts when it's already zero, the next process whose lock gets
mapped to this partition would increment the count from the maximum
value that can be stored back to zero.  Then, when it goes to release
the strong lock, it finds that the count is already zero and goes
boom.  This theory could even explain the new crash, since the COMMIT
PREPARED stuff has already happened by the point where rover_firefly
failed; since the lock tags are hashed to create fasthashcode,
variation in which objects got which OIDs due to concurrency in the
regression test could cause the failure to move around or even escape
detection altogether.  Now, even if the 2PC code is the problem, that
doesn't explain exactly what's wrong with the above logic, but it
would help narrow down where to look.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FastPathStrongRelationLocks still has an issue in HEAD

Reply via email to