On Wed, Aug 19, 2009 at 9:49 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Jeff Janes <jeff.ja...@gmail.com> writes:
> > If I read the code correctly, the only thing that is irrevocable is
> > that it writes into
> > rdt->next, and if it saved an old copy of rdt first, then it could
> > revoke the changes just
> > by doing rdt_old->next=NULL.  If that were done, then I think this
> > code could be
> > moved out of the section holding the WALInsertLock.
>
> Hmm, I recall that the changes are ... or were ... more complex.
> The tricky case I think is where we have to go back and redo the
> block-backup decisions after discovering that the checkpoint REDO
> pointer has just moved.
>
> If you can get the work out of the WALInsertLock section for just a
> few more instructions, it would definitely be worth doing.
>

I've attached a patch which removes the iteration over the blocks to be
backed-up from the critical section of XLogInsert.  Now those blocks are
only looped over in one piece of code which both computes the CRC and builds
the linked list, rather than having parallel loops.

I've used an elog statement (not shown in patch) to demonstrate that the
"goto begin;" after detecting REDO race actually does get executed under a
standard workload, (pgbench -c10).  Two to 4 out of 10 the backends execute
that code path for each checkpoint on my single CPU machine.  By doing a
kill -9 on a process, to simulate a crash, during the period after the goto
begin is execercised but before the precipitating heckpoint completes, I can
force it to use the written WAL records in recovery.  The database
automatically recovers and the results are self-consistent.

I cannot imagine any other races, rare events, or action at a distance that
could come into play with this code change, so I cannot think of anything
else to test at the moment.

I could not detect a speed difference with pgbench, but as I cannot get
pgbench to be XLogInsert bound, that is not surprising.  Using the only
XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed
table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync
turned off (to approxiamately simulate SSD, which I do not have), I get a
speed improvement of 2-4% with the patch over unpatched head.  Maybe with
more CPUs the benefit would be greater.

That small improvement is probably not very attractive, however I think the
patch makes the overall code a bit cleaner, so it may be warranted on that
ground.  Indeed, my motivation for working on this is that I kept beating my
head against the complexity of the old code, and thought that simplifying it
would make future work easier.

Cheers,

Jeff
Index: xlog.c
===================================================================
RCS file: /home/jjanes/pgrepo/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.352
diff -c -r1.352 xlog.c
*** xlog.c	10 Sep 2009 09:42:10 -0000	1.352
--- xlog.c	10 Sep 2009 19:27:08 -0000
***************
*** 540,548 ****
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];
  	pg_crc32	rdata_crc;
  	uint32		len,
  				write_len;
--- 540,550 ----
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];	/*xlog header of backed up block*/
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];	/*part of block before the hole*/
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];	/*part of block after the hole*/
! 	XLogRecData dummy_node;	/* head node for back-up block chain*/
! 	XLogRecData *rdt2;	/* tail pointer for back-up block chain*/
  	pg_crc32	rdata_crc;
  	uint32		len,
  				write_len;
***************
*** 663,696 ****
  
  	/*
  	 * Now add the backup block headers and data into the CRC
  	 */
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (dtbuf_bkp[i])
! 		{
! 			BkpBlock   *bkpb = &(dtbuf_xlg[i]);
! 			char	   *page;
  
! 			COMP_CRC32(rdata_crc,
! 					   (char *) bkpb,
! 					   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb->hole_length == 0)
! 			{
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   BLCKSZ);
! 			}
! 			else
! 			{
! 				/* must skip the hole */
! 				COMP_CRC32(rdata_crc,
! 						   page,
! 						   bkpb->hole_offset);
  				COMP_CRC32(rdata_crc,
! 						   page + (bkpb->hole_offset + bkpb->hole_length),
! 						   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			}
  		}
  	}
  
--- 665,740 ----
  
  	/*
  	 * Now add the backup block headers and data into the CRC
+ 	 * Also make a separate chain of entries for the backup blocks.  
+ 	 * Once we know we do not need to repeat the process due to races,
+ 	 * the two chains are stitched together so that we  don't need 
+ 	 * to special-case them in the write loop.  At the exit of this
+ 	 * loop, write_len includes the backup block data.
+ 	 *
+ 	 * Also set the appropriate info bits to show which buffers were backed
+ 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
+ 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
+ 	 * Doing this here is safe, for the same reason setting rdt->data = NULL
+ 	 * is safe
  	 */
+ 	rdt2 = &dummy_node;
+ 	dummy_node.next=NULL;
+ 	write_len = len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
  
! 		if (!dtbuf_bkp[i]) continue;
! 
! 		info |= XLR_SET_BKP_BLOCK(i);
! 		bkpb = &(dtbuf_xlg[i]);
! 
! 		COMP_CRC32(rdata_crc,
! 				   (char *) bkpb,
! 				   sizeof(BkpBlock));
! 		rdt2->next = &(dtbuf_rdt1[i]);
! 		rdt2 = rdt2->next;
! 
! 		rdt2->data = (char *) bkpb;
! 		rdt2->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt2->next = &(dtbuf_rdt2[i]);
! 		rdt2 = rdt2->next;
! 
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		if (bkpb->hole_length == 0)
! 		{
  				COMP_CRC32(rdata_crc,
! 					   page,
! 					   BLCKSZ);
! 			rdt2->data = page;
! 			rdt2->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt2->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			COMP_CRC32(rdata_crc,
! 					   page,
! 					   bkpb->hole_offset);
! 			COMP_CRC32(rdata_crc,
! 					   page + (bkpb->hole_offset + bkpb->hole_length),
! 					   BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
! 			rdt2->data = page;
! 			rdt2->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt2->next = &(dtbuf_rdt3[i]);
! 			rdt2 = rdt2->next;
! 
! 			rdt2->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt2->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt2->len;
! 			rdt2->next = NULL;
  		}
  	}
  
***************
*** 758,820 ****
  		goto begin;
  	}
  
! 	/*
! 	 * Make additional rdata chain entries for the backup blocks, so that we
! 	 * don't need to special-case them in the write loop.  Note that we have
! 	 * now irrevocably changed the input rdata chain.  At the exit of this
! 	 * loop, write_len includes the backup block data.
! 	 *
! 	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
! 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
! 	 */
! 	write_len = len;
! 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
! 	{
! 		BkpBlock   *bkpb;
! 		char	   *page;
! 
! 		if (!dtbuf_bkp[i])
! 			continue;
! 
! 		info |= XLR_SET_BKP_BLOCK(i);
! 
! 		bkpb = &(dtbuf_xlg[i]);
! 		page = (char *) BufferGetBlock(dtbuf[i]);
! 
! 		rdt->next = &(dtbuf_rdt1[i]);
! 		rdt = rdt->next;
! 
! 		rdt->data = (char *) bkpb;
! 		rdt->len = sizeof(BkpBlock);
! 		write_len += sizeof(BkpBlock);
! 
! 		rdt->next = &(dtbuf_rdt2[i]);
! 		rdt = rdt->next;
! 
! 		if (bkpb->hole_length == 0)
! 		{
! 			rdt->data = page;
! 			rdt->len = BLCKSZ;
! 			write_len += BLCKSZ;
! 			rdt->next = NULL;
! 		}
! 		else
! 		{
! 			/* must skip the hole */
! 			rdt->data = page;
! 			rdt->len = bkpb->hole_offset;
! 			write_len += bkpb->hole_offset;
! 
! 			rdt->next = &(dtbuf_rdt3[i]);
! 			rdt = rdt->next;
! 
! 			rdt->data = page + (bkpb->hole_offset + bkpb->hole_length);
! 			rdt->len = BLCKSZ - (bkpb->hole_offset + bkpb->hole_length);
! 			write_len += rdt->len;
! 			rdt->next = NULL;
! 		}
! 	}
  
  	/*
  	 * If we backed up any full blocks and online backup is not in progress,
--- 802,809 ----
  		goto begin;
  	}
  
! 	/* Stitch the two chains together.  We have now irrevocably changed the rdata chain */
! 	rdt->next = dummy_node.next;
  
  	/*
  	 * If we backed up any full blocks and online backup is not in progress,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to