On August 19, 2014 10:24:20 PM CEST, Jeff Janes <jeff.ja...@gmail.com> wrote: >On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff.ja...@gmail.com> >wrote: > >> On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera ><alvhe...@2ndquadrant.com >> > wrote: >> >>> Jeff Janes wrote: >>> >>> > This problem was initially fairly easy to reproduce, but since I >>> > started adding instrumentation specifically to catch it, it has >become >>> > devilishly hard to reproduce. >>> > >>> > I think my next step will be to also log each of the values which >goes >>> > into the complex if (...) expression that decides on the deletion. >>> >>> Could you please to reproduce it after updating to latest? I pushed >>> fixes that should close these issues. Maybe you want to remove the >>> instrumentation you added, to make failures more likely. >>> >> >> There are still some problems in 9.4, but I haven't been able to >diagnose >> them and wanted to do more research on it. The announcement of >upcoming >> back-branches for 9.3 spurred me to try it there, and I have problems >with >> 9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation >to the >> checkpoint seems to have made the problem easier to reproduce. On an >8 >> core machine, this test fell over after about 20 minutes, which is >much >> faster than it usually reproduces. >> >> This the error I get: >> >> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access >status of >> transaction 85837221 >> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file >> "pg_multixact/members/14031": No such file or directory. >> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement >"SELECT 1 >> FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) >$1 FOR >> KEY SHARE OF x" >> >> The testing harness is attached as 3 patches that must be made to the >test >> server, and 2 scripts. The script do.sh sets up the database (using >fixed >> paths, so be careful) and then invokes count.pl in a loop to do the >> actual work. >> > >Sorry, after a long time when I couldn't do much testing on this, I've >now >been able to get back to it. > >It looks like what is happening is that checkPoint.nextMultiOffset >wraps >around from 2^32 to 0, even if 0 is still being used. At that point it >starts deleting member files that are still needed. > >Is there some interlock which is supposed to prevent from >checkPoint.nextMultiOffset rom lapping iself? I haven't been able to >find >it. It seems like the interlock applies only to MultiXid, not the >Offset.
There is none (and there never has been one either). I've complained about it a couple of times but nobody, me included, had time and energy to fix that :( Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers