I suspected $SUBJECT from the beginning, and I've now put in enough work to be able to prove it. The attached test program reliably fails within a few minutes of being started, when run with 8 worker processes on an 8-core PPC machine. It's a pretty simple "token passing ring" protocol, and at some point one of the processes sees its latch set without seeing its flag set, so it goes back to sleep and the token stops getting passed.
The original worry about the latch code was that there was some internal race condition of this sort, but I think we were failing to see the forest for the trees. The specification of how to use a latch reads like this: * The correct pattern to wait for an event is: * * for (;;) * { * ResetLatch(); * if (work to do) * Do Stuff(); * * WaitLatch(); * } * * It's important to reset the latch *before* checking if there's work to * do. Otherwise, if someone sets the latch between the check and the * ResetLatch call, you will miss it and Wait will block. * * To wake up the waiter, you must first set a global flag or something * else that the main loop tests in the "if (work to do)" part, and call * SetLatch *after* that. Any protocol of that sort has *obviously* got a race condition between changes of the latch state and changes of the separate flag state, if run in a weak-memory-ordering machine. At least on the hardware I'm testing, it seems that the key requirement for a failure to be seen is that there's a lot of contention for the cache line containing the flag, but not for the cache line containing the latch. This allows the write to the flag to take longer to propagate to main memory than the write to the latch. I could not get it to fail until I added the extra shared-memory traffic in the delay loop around line 175. This probably also explains why it doesn't fail with small numbers of workers --- you need enough contending CPUs to saturate the memory hardware. (On Red Hat's test machines, 7 or 8 workers would reliably fail within a few minutes, but 6 or fewer didn't always.) I'm not sure whether we need to do anything about this for 9.1; it may be that none of the existing uses of latches are subject to the kind of contention that would create a risk. But I'm entirely convinced that we need to insert some memory-barrier instructions into the latch code before we expand our uses of latches much more. Since we were already finding that memory barriers will be needed for performance reasons elsewhere, I think it's time to bite the bullet and develop that infrastructure. regards, tom lane
/* * Simple test program to exercise SetLatch/WaitLatch. * * This just passes a "token" flag around among some number of worker * processes. * * To use: compile this into a .so, then execute * * CREATE FUNCTION start_worker() RETURNS void AS '/path/to/.so' LANGUAGE c; * * Then, for each of up to MAX_WORKERS sessions, execute * * psql -c "SELECT start_worker()" dbname & * * About one session per physical CPU should be good. * * Watch the postmaster log for awhile to confirm nobody drops the ball. * If anyone does, all the processes will exit with a timeout failure * after about a minute. If you get bored, "pg_ctl stop -m immediate" * is the easiest way out. * * Be sure to restart the postmaster between attempts, since there's no * logic for re-initializing a workspace that's already present. * * Note: this will NOT work with the Windows implementation of latches, since * we are cheesy about how we initialize the shared memory. Doesn't matter * for now, since Windows doesn't run on any machines with weak memory * ordering anyway. */ #include "postgres.h" #include "fmgr.h" #include "miscadmin.h" #include "storage/latch.h" #include "storage/shmem.h" PG_MODULE_MAGIC; #define MAX_WORKERS 16 #define BIAS 500000 /* to reduce time-to-failure */ /* * Arrays have padding to try to ensure active values are in different cache * lines */ typedef struct { int flag; int pad2[4]; int junk; char pad[333]; } TokStruct; typedef struct { Latch latch; char pad[333]; } LatchStruct; typedef struct { int num_workers; TokStruct flags[MAX_WORKERS]; LatchStruct latches[MAX_WORKERS]; } MySharedSpace; volatile long gsum = 0; Datum start_worker(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(start_worker); Datum start_worker(PG_FUNCTION_ARGS) { volatile MySharedSpace *workspace; bool found; int i; int my_number; long counter = 0; unsigned short xseed[3]; /* Set up non-interlocked RNG */ xseed[0] = xseed[1] = xseed[2] = 42; /* Create or access the shared data */ workspace = (MySharedSpace *) ShmemInitStruct("Latch Testing", sizeof(MySharedSpace), &found); if (!found) { /* I'm the first, initialize */ memset((void *) workspace, 0, sizeof(MySharedSpace)); for (i = 0; i < MAX_WORKERS; i++) InitSharedLatch(&(workspace->latches[i].latch)); /* Give the token to the first worker (ie, me) */ workspace->flags[0].flag = 1; } /* * Add self to the arrays. This assumes no two would-be workers do it * concurrently, which should be OK if they're started manually. Note * that we do NOT initialize flags[my_number], as that was set up * correctly during the workspace creation above. */ my_number = workspace->num_workers; if (my_number >= MAX_WORKERS) elog(ERROR, "too many workers"); OwnLatch(&(workspace->latches[my_number].latch)); workspace->num_workers++; /* * And play ball ... */ for (;;) { ResetLatch(&(workspace->latches[my_number].latch)); /* If I have the token ... */ if (workspace->flags[my_number].flag) { int next_worker; int cnt; /* delete it ... */ workspace->flags[my_number].flag = 0; /* * ... and pass it on to the next guy. Note we might fetch a * stale value of num_workers here, but that's okay, it'll just * mean the recently-added guy doesn't get the token during this * cycle. */ next_worker = my_number + 1; if (next_worker >= workspace->num_workers) next_worker = 0; workspace->flags[next_worker].flag = 1; SetLatch(&(workspace->latches[next_worker].latch)); /* Report that we're still alive every so often */ counter++; if (counter % (256*1024) == 0) elog(LOG, "worker %d still alive after %ld cycles", my_number, counter); CHECK_FOR_INTERRUPTS(); /* * Compute for a little while, hoping our latch will be set * again before we wait. The delay is random with a range that * grows slowly as we run. * * While we're at it, create a lot of r/w traffic on the * TokStructs to try to saturate their cache lines. This is * needed to cause other CPUs to not see flag updates promptly * compared to latch updates. */ cnt = (counter + BIAS) / 64 + (int) (pg_erand48(xseed) * ((counter + BIAS) / 64)); for (i = 0; i < cnt; i++) { int k = i % MAX_WORKERS; gsum += workspace->flags[k].flag; workspace->flags[k].junk++; } } if (!(WaitLatch(&(workspace->latches[my_number].latch), WL_LATCH_SET | WL_TIMEOUT, 60000000L) & WL_LATCH_SET)) elog(ERROR, "WaitLatch timed out in worker %d after %ld cycles, with flag %d latch %d", my_number, counter, workspace->flags[my_number].flag, workspace->latches[my_number].latch.is_set); } PG_RETURN_NULL(); }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers