Tom Lane wrote:

Manfred Spraul <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Don't you have to put it in a specific place in the loop to make that
work?  If not, why not?

Rep;nop is just a short delay - that's all.

That view seems to me to be directly contradicted by this statement:

The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance.

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...). I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised. It would help if Intel were less coy
about what the instruction really does.

My guess: Pentium 4 cpu support something like 250 uops in flight - it will have a dozend of the spinlock loops in it's pipeline. When the spinlock is released, it must figure out which of the loops should get it, and gets lost. My guess is that rep;nop delays the cpu buy at least 100 cpu ticks, and thus the pipeline will be empty before it proceeds. I don't have a Pentium 4, and the HP testdrive is down. Someone around who could run my test app?

This instruction does not change the architectural state of the
processor (that is, it performs essentially a delaying noop

This can be rephrased as "we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture". Great. How are we supposed to know
how to use it?

There was a w_spinlock.pdf document with reference code. google still finds it, but the links are dead :-(

I think a separate function is better than adding it into TAS: if it's part of tas, then it would automatically be included by every SpinLockAcquire call - unnecessary .text bloat.

Why do you think it's unnecessary? One thing that I find particularly vague in the quoted documentation is the statement that the PAUSE instruction is needed to avoid a delay when *exiting* the spin-wait loop. Doesn't this mean that a PAUSE is needed in the success path when the first TAS succeeds (i.e, the normal no-contention path)?


If not, why not? If so, does it go before or after the lock

Neither: somewhere in the failure path.

Also, if the principal effect is a "short delay", do we really need it
at all considering that our inner loop in s_lock is rather more than
an "xchgb" followed by a conditional branch?  There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).

The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.

I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow path: locked instructions generate lots of bus traffic, and that's a waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup means that the postgres spends far too much time in the spinlock code. I've looked at the oprofile dumps, and something like 1.2% of the total cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest instruction in the whole profile, it eats more cpu cycles than the memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
 * skel.cpp. skeleton for rdtsc benchmarks
 * Copyright (C) 1999, 2001 by Manfred Spraul.
 *	All rights reserved except the rights granted by the GPL.
 * Redistribution of this file is permitted under the terms of the GNU 
 * General Public License (GPL) version 2 or later.
 * $Header: /pub/home/manfred/cvs-tree/timetest/rep_nop.cpp,v 1.1 2001/04/07 19:38:33 manfred Exp $

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>

// disable local interrupts during benchmark
#undef USE_CLI

// define a cache flushing function

#ifdef USE_CLI
#include <sys/io.h>
#define CLI	"cli\n\t"
#define STI	"sti\n\t"
#define CLI
#define STI
#define iopl(a)	do { } while(0)

// Intel recommends that a serializing instruction
// should be called before and after rdtsc.
// CPUID is a serializing instruction.
// ".align 128:" P 4 L2 cache line size
#define read_rdtsc_before(time)		\
	__asm__ __volatile__(		\
		".align 128\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		CLI			\
		"cpuid\n\t"		\
		"rdtsc\n\t"		\
		"mov %%eax,(%0)\n\t"	\
		"mov %%edx,4(%0)\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		: /* no output */	\
		: "S"(&time)		\
		: "eax", "ebx", "ecx", "edx", "memory")

#define read_rdtsc_after(time)		\
	__asm__ __volatile__(		\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		"rdtsc\n\t"		\
		"mov %%eax,(%0)\n\t"	\
		"mov %%edx,4(%0)\n\t"	\
		"xor %%eax,%%eax\n\t"	\
		"cpuid\n\t"		\
		STI			\
		: /* no output */	\
		: "S"(&time)		\
		: "eax", "ebx", "ecx", "edx", "memory")

#define BUILD_TESTFNC(name, text, instructions) \
void name##_dummy(void)				\
{						\
	__asm__ __volatile__(			\
		".align 4096\n\t"		\
		"xor %%eax, %%eax\n\t"		\
		: : : "eax");			\
}						\
static unsigned long name##_best = 1024*1024*1024; \
static void name(void) \
{ \
	unsigned long long time; \
	unsigned long long time2; \
	read_rdtsc_before(time); \
	instructions; \
	read_rdtsc_after(time2); \
	if(time2-time < name##_best) { \
		printf( text ":\t%10Ld ticks; \n", \
			time2-time-zerotest_best); \
		name##_best = time2-time; \
	} \

void filler(void)
static int i = 3;
static int j;
	j = i*i;

#define DO_3(x) \
	do { x; x; x; } while(0)

#define DO_10(x) \
	do { x; x; x; x; x; x; x; x; x; x;} while(0)

#define DO_50(x) \
	do { DO_10(x); DO_10(x);DO_10(x); DO_10(x);DO_10(x);} while(0)

#define DO_T(y) do { \
	DO_3(filler()); \
	y; \
	DO_3(filler());} while(0)

#define DRAIN_SZ	(4*1024*1024)
int other[3*DRAIN_SZ] __attribute ((aligned (4096)));
static inline void drain_cache(void)
	int i;
	for(i=0;i<DRAIN_SZ;i++) other[DRAIN_SZ+i]=0;
	for(i=0;i<DRAIN_SZ;i++) if(other[DRAIN_SZ+i]!=0) break;
static inline void drain_cache(void)

#define DO_TEST(x) \
	do { \
		int i; \
		for(i=0;i<5000;i++) \
			x; \
	} while(0)


#define REP_NOP()	__asm__ __volatile__ ("rep;nop\n\t": : : "memory");
#define NOP()		__asm__ __volatile__ ("nop\n\t": : : "memory");

BUILD_TESTFNC(zerotest,"zerotest", DO_T((void)0));
BUILD_TESTFNC(rnop,"rep nop", DO_T(REP_NOP()));
BUILD_TESTFNC(nop,"nop", DO_T(NOP()));

int main()
	if(geteuid() == 0) {
		int res = nice(-20);
		if(res < 0) {
			return 1;
		printf("MOVETEST, reniced to (-20).\n");	
	} else
		printf("MOVETEST called by non-superuser, running with normal priority.\n");
	for(;;) {
	return 0;
Index: ./src/backend/storage/lmgr/s_lock.c
RCS file: /projects/cvsroot/pgsql-server/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.22
diff -u -r1.22 s_lock.c
--- ./src/backend/storage/lmgr/s_lock.c 23 Dec 2003 22:15:07 -0000      1.22
+++ ./src/backend/storage/lmgr/s_lock.c 27 Dec 2003 10:28:54 -0000
@@ -76,7 +76,7 @@
         * The select() delays are measured in centiseconds (0.01 sec) because 10
         * msec is a common resolution limit at the OS level.
-#define SPINS_PER_DELAY                100
+#define SPINS_PER_DELAY                1000
 #define NUM_DELAYS                     1000
 #define MIN_DELAY_CSEC         1
 #define MAX_DELAY_CSEC         100
@@ -86,7 +86,7 @@
        int                     cur_delay = MIN_DELAY_CSEC;
        struct timeval delay;
-       while (TAS(lock))
+       while (!S_LOCK_FREE(lock) || TAS(lock))
                if (++spins > SPINS_PER_DELAY)
@@ -111,6 +111,7 @@
                        spins = 0;
+               CPU_DELAY();
Index: ./src/include/storage/s_lock.h
RCS file: /projects/cvsroot/pgsql-server/src/include/storage/s_lock.h,v
retrieving revision 1.123
diff -u -r1.123 s_lock.h
--- ./src/include/storage/s_lock.h      23 Dec 2003 22:15:07 -0000      1.123
+++ ./src/include/storage/s_lock.h      27 Dec 2003 10:28:55 -0000
@@ -52,6 +52,15 @@
  *     in assembly language to execute a hardware atomic-test-and-set
  *     instruction.  Equivalent OS-supplied mutex routines could be used too.
+ *     Additionally, a platform specific delay function can be defined:
+ *
+ *     void CPU_DELAY(void)
+ *             Notification that the cpu is executing a busy loop.
+ *
+ *     Some platforms need such an indication. One example are platforms
+ *     that implement SMT, i.e. multiple logical threads that share 
+ *     execution resources in one physical cpu.
+ *
  *     If no system-specific TAS() is available (ie, HAVE_SPINLOCKS is not
  *     defined), then we fall back on an emulation that uses SysV semaphores
  *     (see spin.c).  This emulation will be MUCH MUCH slower than a proper TAS()
@@ -115,6 +124,17 @@
        return (int) _res;
+#define HAS_CPU_DELAY
+#define CPU_DELAY()    cpu_delay()
+static __inline__ void
+       __asm__ __volatile__(
+               " rep; nop                      \n"
+               : : : "memory");
 #endif  /* __i386__ || __x86_64__ */
@@ -715,6 +735,9 @@
 #define TAS(lock)              tas(lock)
 #endif  /* TAS */
+#ifndef HAS_CPU_DELAY
+#define CPU_DELAY()    do { } while(0)
  * Platform-independent out-of-line support routines
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to