On Monday, April 1, 2013, Jeff Davis wrote: > On Mon, 2013-04-01 at 10:37 -0700, Jeff Janes wrote: > > > Over 10,000 cycles of crash and recovery, I encountered two cases of > > checksum failures after recovery, example: > > > > > > 14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING: page verification > > failed, calculated checksum 7017 but expected 1098 > > 14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR: invalid page in block > > 77 of relation base/16384/2088965 > > > > 14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT: select sum(count) > > from foo > > It would be nice to know whether that's an index or a heap page. >
It is a heap page for the table jjanes.public.foo. > > > > > In both cases, the bad block (77 in this case) is the same block that > > was intentionally partially-written during the "crash". However, that > > block should have been restored from the WAL FPW, so its fragmented > > nature should not have been present in order to be detected. Any idea > > what is going on? > > Not right now. My primary suspect is what's going on in > visibilitymap_set() and heap_xlog_visible(), which is more complex than > some of the other code. That would require some VACUUM activity, which > isn't in your workload -- do you think autovacuum may kick in sometimes? > Yes, a modification to my test harness that I failed to mention is that it now sleeps for 2 minutes after every 100 rounds of crash/recovery, specifically so that autovac has a chance to kick in and run to completion. I made that change so as to avoid wrap-around shut-downs on long running tests. However "foo" is truncated at the beginning of every test, so I don't think this would be relevant to that table, as any poisoned fruits of the autovac would be discarded with the truncation. Cheers, Jeff