On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvhe...@2ndquadrant.com > > wrote: > >> Jeff Janes wrote: >> >> > This problem was initially fairly easy to reproduce, but since I >> > started adding instrumentation specifically to catch it, it has become >> > devilishly hard to reproduce. >> > >> > I think my next step will be to also log each of the values which goes >> > into the complex if (...) expression that decides on the deletion. >> >> Could you please to reproduce it after updating to latest? I pushed >> fixes that should close these issues. Maybe you want to remove the >> instrumentation you added, to make failures more likely. >> > > There are still some problems in 9.4, but I haven't been able to diagnose > them and wanted to do more research on it. The announcement of upcoming > back-branches for 9.3 spurred me to try it there, and I have problems with > 9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation to the > checkpoint seems to have made the problem easier to reproduce. On an 8 > core machine, this test fell over after about 20 minutes, which is much > faster than it usually reproduces. > > This the error I get: > > 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access status of > transaction 85837221 > 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file > "pg_multixact/members/14031": No such file or directory. > 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement "SELECT 1 > FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR > KEY SHARE OF x" > > The testing harness is attached as 3 patches that must be made to the test > server, and 2 scripts. The script do.sh sets up the database (using fixed > paths, so be careful) and then invokes count.pl in a loop to do the > actual work. > Sorry, after a long time when I couldn't do much testing on this, I've now been able to get back to it. It looks like what is happening is that checkPoint.nextMultiOffset wraps around from 2^32 to 0, even if 0 is still being used. At that point it starts deleting member files that are still needed. Is there some interlock which is supposed to prevent from checkPoint.nextMultiOffset rom lapping iself? I haven't been able to find it. It seems like the interlock applies only to MultiXid, not the Offset. Thanks, Jeff