On Thu, Feb 22, 2018 at 04:55:38PM +0900, Michael Paquier wrote: > I am definitely ready to buy that it can be possible to have garbage > being read the length field which can cause allocate_recordbuf to fail > as that's the only code path in xlogreader.c which does such an > allocation. Still, it seems to me that we should first try to see if > there are strange allocation patterns that happen and see if it is > possible to have a reproduceable test case or a pattern which gives us > confidence that we are on the right track. One idea I have to > monitor those allocations like the following: > --- a/src/backend/access/transam/xlogreader.c > +++ b/src/backend/access/transam/xlogreader.c > @@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 > reclength) > newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ); > newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ)); > > +#ifndef FRONTEND > + elog(LOG, "Allocation for xlogreader increased to %u", newSize); > +#endif
So, I have been playing a bit more with that and defined the following strategy to see if it is possible to create inconsistencies: - Use a primary and a standby. - Set up max_wal_size and min_wal_size to a minimum of 80MB so as the segment recycling takes effect more quickly. - Create a single table with a UUID column to increase the likelihood of random data in INSERT records and FPWs, and insert enough data to trigger a full WAL recycling. - Every 5 seconds, insert a set of tuples into the table, using 110 to 120 tuples generates enough data for a bit more than a full WAL page. And then restart the primary. This causes the standby to catch up with normally a page streamed which is not completely initialized as it fetches the page in the middle. With the monitoring mentioned in the upper comment block, I have let the whole thing run for a couple of hours, but I have not been able to catch up problems, except the usual "invalid record length at 0/XXX: wanted 24, got 0". The allocation for recordbuf did not get higher than 40960 bytes as well, which matches with 5 WAL pages. An other, evil, idea that I have on top of all those things is to directly hexedit the WAL segment of the standby just at the limit where it would receive a record from the primary and insert in it garbage data which would make the validation tests to blow up in xlogreader.c for the record allocation. -- Michael
signature.asc
Description: PGP signature