Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-20 Thread Josh Berkus
Guys, Oh, you wanted a fix? That seems harder :-(. AFAICS we need a redesign that causes less load on the BufMgrLock. FWIW, we've been pursuing two routes of quick patch fixes. 1) Dave Cramer and I have been testing setting varying rates of spin_delay in an effort to find a sweet spot

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-20 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes: I'm really curious, BTW, about how all of Jan's changes to buffer usage in 7.5 affect this issue. Has anyone tested it on a recent snapshot? Won't help. (1) Theoretical argument: the problem case is select-only and touches few enough buffers that it need

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Bruce Momjian
Did we ever come to a conclusion about excessive SMP context switching under load? --- Dave Cramer wrote: Robert, The real question is does it help under real life circumstances ? Did you do the tests with Tom's sql

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Robert Creager
When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)), Bruce Momjian [EMAIL PROTECTED] confessed: Did we ever come to a conclusion about excessive SMP context switching under load? I just figured out what was causing the problem on my system Monday. I'm using the pg_autovacuum

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: Did we ever come to a conclusion about excessive SMP context switching under load? Yeah: it's bad. Oh, you wanted a fix? That seems harder :-(. AFAICS we need a redesign that causes less load on the BufMgrLock. However, the traditional solution to

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Tom Lane
Robert Creager [EMAIL PROTECTED] writes: I just figured out what was causing the problem on my system Monday. I'm using the pg_autovacuum daemon, and it was not vacuuming my db. Do you have the post-7.4.2 datatype fixes for pg_autovacuum? regards, tom lane

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Robert Creager
When grilled further on (Wed, 19 May 2004 22:42:26 -0400), Tom Lane [EMAIL PROTECTED] confessed: Robert Creager [EMAIL PROTECTED] writes: I just figured out what was causing the problem on my system Monday. I'm using the pg_autovacuum daemon, and it was not vacuuming my db. Do you have

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Tom Lane
Robert Creager [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] confessed: Do you have the post-7.4.2 datatype fixes for pg_autovacuum? No. I'm still running 7.4.1 w/associated contrib. I guess an upgrade is in order then. I'm currently downloading 7.4.2 to see what the change is that

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: ... The SMP issue seems to be not with whether there is instantaneous contention for the locked datastructure, but with the cost of making it possible for processor B to acquire a lock recently held by processor A. I see. I don't

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Bruce Momjian
OK, added to TODO: * Investigate SMP context switching issues --- Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: ... The SMP issue seems to be not with whether there is

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Matthew T. O'Connor
On Wed, 2004-05-19 at 21:59, Robert Creager wrote: When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)), Bruce Momjian [EMAIL PROTECTED] confessed: Did we ever come to a conclusion about excessive SMP context switching under load? I just figured out what was causing the

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-19 Thread Christopher Browne
In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Tom Lane) transmitted: ObQuote: Research is what I am doing when I don't know what I am doing. - attributed to Werner von Braun, but has anyone got a definitive reference?

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-02 Thread Robert Creager
Found some co-workers at work yesterday to load up my library... The sample period is 5 minutes long (vs 2 minutes previously): Context switches - avgmax Default 7.4.1 code : 48784 107354 Default patch - 10 : 20400 28160 patch at 100 : 38574 85372 patch

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-05-01 Thread Dave Cramer
No, don't go away and be quiet. Keep testing, it may be that under normal operation the context switching goes up but under the conditions that you were seeing the high CS it may not be as bad. As others have mentioned the real solution to this is to rewrite the buffer management so that the lock

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-30 Thread Robert Creager
When grilled further on (Thu, 29 Apr 2004 11:21:51 -0700), Josh Berkus [EMAIL PROTECTED] confessed: spins_per_delay was not beneficial. Instead, try increasing them, one step at a time: (take baseline measurement at 100) 250 500 1000 1500 2000 3000 5000 ... until you find an

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-29 Thread ohp
], Dirk_Lutzebäck [EMAIL PROTECTED], [EMAIL PROTECTED], Tom Lane [EMAIL PROTECTED], Joe Conway [EMAIL PROTECTED], scott.marlowe [EMAIL PROTECTED], Bruce Momjian [EMAIL PROTECTED], [EMAIL PROTECTED], Neil Conway [EMAIL PROTECTED] Subject: Re: [PERFORM] Wierd context-switching issue

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-29 Thread Josh Berkus
Rob, I would like to see the same, as I have a system that exhibits the same behavior on a production db that's running 7.4.1. If you checked the thread follow-ups, you'd see that *decreasing* spins_per_delay was not beneficial. Instead, try increasing them, one step at a time: (take

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-28 Thread Robert Creager
When grilled further on (Wed, 21 Apr 2004 10:29:43 -0700), Josh Berkus [EMAIL PROTECTED] confessed: Dave, After some testing if you use the current head code for s_lock.c which has some mods in it to alleviate this situation, and change SPINS_PER_DELAY to 10 you can drastically reduce

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Josh Berkus
Dave, Are you testing this with Tom's code, you need to do a baseline measurement with 10 and then increase it, you will still get lots of cs, but it will be less. No, that was just a test of 1000 straight up.Tom outlined a method, but I didn't see any code that would help me find a

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Dave Cramer
Josh, I think you can safely increase by orders of magnitude here, instead of by +100, my wild ass guess is that the sweet spot is the spin time should be approximately the time it takes to consume the resource. So if you have a really fast machine then the spin count should be higher. Also you

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-27 Thread Josh Berkus
Dave, But... you need a baseline first. A baseline on CS? I have that -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-26 Thread Kenneth Marshall
On Wed, Apr 21, 2004 at 02:51:31PM -0400, Tom Lane wrote: The context swap storm is happening because of contention at the next level up (LWLocks rather than spinlocks). It could be an independent issue that just happens to be triggered by the same sort of access pattern. I put forward a

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-26 Thread Josh Berkus
Magus, It would be interesting to see what a locking implementation ala FUTEX style would give on an 2.6 kernel, as i understood it that would work cross process with some work. I'mm working on testing a FUTEX patch, but am having some trouble with it. Will let you know the results

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-26 Thread Josh Berkus
Dave, Yeah, I did some more testing myself, and actually get better numbers with increasing spins per delay to 1000, but my suspicion is that it is highly dependent on finding the right delay for the processor you are on. Well, it certainly didn't help here: procs

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-26 Thread Dave Cramer
Are you testing this with Tom's code, you need to do a baseline measurement with 10 and then increase it, you will still get lots of cs, but it will be less. Dave On Mon, 2004-04-26 at 20:03, Josh Berkus wrote: Dave, Yeah, I did some more testing myself, and actually get better numbers

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-25 Thread Andrew McMillan
On Thu, 2004-04-22 at 10:37 -0700, Josh Berkus wrote: Tom, The tricky part is that a slow adaptation rate means we can't have every backend figuring this out for itself --- the right value would have to be maintained globally, and I'm not sure how to do that without adding a lot of

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Dave Cramer
Yeah, I did some more testing myself, and actually get better numbers with increasing spins per delay to 1000, but my suspicion is that it is highly dependent on finding the right delay for the processor you are on. My hypothesis is that if you spin approximately the same or more time than the

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Dave Cramer
More data On a dual xeon with HTT enabled: I tried increasing the NUM_SPINS to 1000 and it works better. NUM_SPINLOCKS CS ID pgbench 100 250K59% 230 TPS 1000125K55% 228 TPS This is certainly heading in the right direction ? Although it

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-22 Thread Tom Lane
Paul Tuckfield [EMAIL PROTECTED] writes: I used the taskset command: taskset 01 -p pid for backend of test_run.sql 1 taskset 01 -p pid for backend of test_run.sql 1 I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on the first Xeon processor in the box. AFAICT, what

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes: My hypothesis is that if you spin approximately the same or more time than the average time it takes to get finished with the shared resource then this should reduce cs. The only thing we use spinlocks for nowadays is to protect LWLocks, so the average

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Bruce Momjian
Josh Berkus wrote: Tom, Having to recompile to run on single- vs dual-processor machines doesn't seem like it would fly. Oh, I don't know. Many applications require compiling for a target architecture; SQL Server, for example, won't use a 2nd processor without re-installation. I'm

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Josh Berkus
Tom, Having to recompile to run on single- vs dual-processor machines doesn't seem like it would fly. Oh, I don't know. Many applications require compiling for a target architecture; SQL Server, for example, won't use a 2nd processor without re-installation. I'm not sure about Oracle. It

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Josh Berkus
Tom, The tricky part is that a slow adaptation rate means we can't have every backend figuring this out for itself --- the right value would have to be maintained globally, and I'm not sure how to do that without adding a lot of overhead. This may be a moot point, since you've stated that

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-22 Thread Rod Taylor
On Thu, 2004-04-22 at 13:55, Tom Lane wrote: Josh Berkus [EMAIL PROTECTED] writes: This may be a moot point, since you've stated that changing the loop timing won't solve the problem, but what about making the test part of make? I don't think too many systems are going to change

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-22 Thread Anjan Dave
Lane Cc: [EMAIL PROTECTED]; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Anjan, Quad 2.0GHz XEON with highest load we have seen on the applications, DB performing great - Can you run Tom's test? It takes

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread pginfo
Hi, Dual Xeon P4 2.8 linux RedHat AS 3 kernel 2.4.21-4-EL-smp 2 GB ram I can see the same problem: procs memory swap io system cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0 0 96212 61056 17202400

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread ohp
context-switching issue on Xeon Here is a test case. To set up, run the test_setup.sql script once; then launch two copies of the test_run.sql script. (For those of you with more than two CPUs, see whether you need one per CPU to make trouble, or whether two test_runs are enough.) Check

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Dave Cramer
After some testing if you use the current head code for s_lock.c which has some mods in it to alleviate this situation, and change SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. I am seeing a slight degradation in throughput using pgbench -c 10 -t 1000 but it might be

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Josh Berkus
Dave, After some testing if you use the current head code for s_lock.c which has some mods in it to alleviate this situation, and change SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. I am seeing a slight degradation in throughput using pgbench -c 10 -t 1000 but it

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Paul Tuckfield
Dave: Why would test and set increase context swtches: Note that it *does not increase* context swtiches when the two threads are on the two cores of a single Xeon processor. (use taskset to force affinity on linux) Scenario: If the two test and set processes are testing and setting the same

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Tom Lane
Paul Tuckfield [EMAIL PROTECTED] writes: I wonder do the threads stall so badly when pinging cache lines back and forth, that the kernel sees it as an opportunity to put the process to sleep? or do these worst case misses cause an interrupt? No; AFAICS the kernel could not even be aware of

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Dave Cramer
FYI, I am doing my testing on non hyperthreading dual athlons. Also, the test and set is attempting to set the same resource, and not simply a bit. It's really an lock;xchg in assemblelr. Also we are using the PAUSE mnemonic, so we should not be seeing any cache coherency issues, as the cache

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Dave Cramer
attached. -- Dave Cramer 519 939 0336 ICQ # 14675561 Index: backend/storage/lmgr/s_lock.c === RCS file: /usr/local/cvs/pgsql-server/src/backend/storage/lmgr/s_lock.c,v retrieving revision 1.16 diff -c -r1.16 s_lock.c ***

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-21 Thread Tom Lane
Kenneth Marshall [EMAIL PROTECTED] writes: If the context swap storm derives from LWLock contention, maybe using a random order to assign buffer locks in buf_init.c would prevent simple adjacency of buffer allocation to cause the storm. Good try, but no cigar ;-). The test cases I've been

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes: diff -c -r1.16 s_lock.c *** backend/storage/lmgr/s_lock.c 8 Aug 2003 21:42:00 - 1.16 --- backend/storage/lmgr/s_lock.c 21 Apr 2004 20:27:34 - *** *** 76,82 * The select() delays are measured in centiseconds

Re: [PERFORM] Wierd context-switching issue on Xeon patch for 7.4.1

2004-04-21 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: For BSDOS it has: #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \ (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD) { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */ int ncpus; size_t len =

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread ohp
[EMAIL PROTECTED], Bruce Momjian [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Neil Conway [EMAIL PROTECTED] Subject: Re: [PERFORM] Wierd context-switching issue on Xeon I wrote: Here is a test case. Hmmm ... I've been able to reproduce the CS storm on a dual Athlon

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Dave Cramer
Dual Athlon With one process running 30 cs/second with two process running 15000 cs/second Dave On Tue, 2004-04-20 at 08:46, Jeff wrote: On Apr 19, 2004, at 8:01 PM, Tom Lane wrote: [test case] Quad P3-700Mhz, ServerWorks, pg 7.4.2 - 1 process: 10-30 cs / second

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Matt Clark
; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Here is a test case. To set up, run the test_setup.sql script once; then launch two copies of the test_run.sql script. (For those of you with more than two CPUs, see whether

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Sven Geisler
PROTECTED] Sent: Sunday, April 18, 2004 11:47 PM Subject: Re: [PERFORM] Wierd context-switching issue on Xeon After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres. I did

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Dirk Lutzebäck
Dirk Lutzebaeck wrote: c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) performs well and I could not observe context switch peaks here (one user active), almost no extra semop calls Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to 300k CS. Bummer..

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Paul Tuckfield
I tried to test how this is related to cache coherency, by forcing affinity of the two test_run.sql processes to the two cores (pipelines? threads) of a single hyperthreaded xeon processor in an smp xeon box. When the processes are allowed to run on distinct chips in the smp box, the CS storm

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Paul Tuckfield
Ooops, what I meant to say was that 2 threads bound to one (hyperthreaded) cpu does *NOT* cause the storm, even on an smp xeon. Therefore, the context switches may be a result of cache coherency related delays. (2 threads on one hyperthreaded cpu presumably have tightly coupled 1,l2 cache.)

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Josh Berkus
Dirk, Tom, OK, off IRC, I have the following reports: Linux 2.4.21 or 2.4.20 on dual Pentium III : problem verified Linux 2.4.21 or 2.4.20 on dual Penitum II : problem cannot be reproduced Solaris 2.6 on 6 cpu e4500 (using 8 processes) : problem not reproduced -- -Josh Berkus Aglio Database

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread J. Andrew Rogers
I verified problem on a Dual Opteron server. I temporarily killed the normal load, so the server was largely idle when the test was run. Hardware: 2x Opteron 242 Rioworks HDAMA server board 4Gb RAM OS Kernel: RedHat9 + XFS 1 proc: 10-15 cs/sec 2 proc: 400,000-420,000 cs/sec j. andrew

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Anjan Dave
To: Tom Lane; Josh Berkus Cc: [EMAIL PROTECTED]; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Dirk Lutzebaeck wrote: c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) performs well and I could not observe context switch peaks here (one user

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Bruce Momjian
Dirk Lutzebäck wrote: Dirk Lutzebaeck wrote: c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) performs well and I could not observe context switch peaks here (one user active), almost no extra semop calls Did Tom's test here: with 2 processes I'll reach 200k+ CS

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Josh Berkus
Anjan, Quad 2.0GHz XEON with highest load we have seen on the applications, DB performing great - Can you run Tom's test? It takes a particular pattern of data access to reproduce the issue. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Dave Cramer
I modified the code in s_lock.c to remove the spins #define SPINS_PER_DELAY 1 and it doesn't exhibit the behaviour This effectively changes the code to while(TAS(lock)) select(1); // 10ms Can anyone explain why executing TAS 100 times would increase context switches ?

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-20 Thread Joe Conway
Joe Conway wrote: In isolation, test_run.sql should do essentially no syscalls at all once it's past the initial ramp-up. On a machine that's functioning per expectations, multiple copies of test_run show a relatively low rate of semop() calls --- a few per second, at most --- and maybe a

Re: RESOLVED: Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Dirk Lutzebäck
Josh, I cannot reproduce the excessive semop() on a Dual XEON DP on a non-bigmem kernel, HT on. Interesting to know if the problem is related to XEON MP (as Tom wrote) or bigmem. Josh Berkus wrote: Dirk, I'm not sure if this semop() problem is still an issue but the database behaves a bit

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Anjan Dave
was mentioned... Thanks, Anjan -Original Message- From: Greg Stark [mailto:[EMAIL PROTECTED] Sent: Sun 4/18/2004 8:40 PM To: Tom Lane Cc: [EMAIL PROTECTED]; Josh Berkus; [EMAIL PROTECTED]; Neil Conway Subject: Re: [PERFORM] Wierd context

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread J. Andrew Rogers
I decided to check the context-switching behavior here for baseline since we have a rather diverse set of postgres server hardware, though nothing using Xeon MP that is also running a postgres instance, and everything looks normal under load. Some platforms are better than others, but nothing is

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes: The other thing I'd like your comment on, Tom, is that Dirk appears to have reported that when he installed a non-bigmem kernel, the issue went away. Dirk, is this correct? I'd be really surprised if that had anything to do with it. AFAIR Dirk's test

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Joe Conway
scott.marlowe wrote: On Mon, 19 Apr 2004, Bruce Momjian wrote: I have BSD on a SuperMicro dual Xeon, so if folks want another hardware/OS combination to test, I can give out logins to my machine. I can probably do some nighttime testing on a dual 2800MHz non-MP Xeon machine as well. It's a Dell

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Josh Berkus
Joe, I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does anyone have a test set that can reliably reproduce the problem? Unfortunately we can't seem to come up with one.So far we have 2 machines that exhibit the issue, and their databases are highly confidential (State

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes: I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does anyone have a test set that can reliably reproduce the problem? Unfortunately we can't seem to come up with one. It does seem to require a database which is in the many GB ( 10GB), and

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Tom Lane
Here is a test case. To set up, run the test_setup.sql script once; then launch two copies of the test_run.sql script. (For those of you with more than two CPUs, see whether you need one per CPU to make trouble, or whether two test_runs are enough.) Check that you get a

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Tom Lane
I wrote: Here is a test case. Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, which seems to pretty much let the Xeon per se off the hook. Anybody got a multiple Opteron to try? Totally non-Intel CPUs? It would be interesting to see results with non-Linux kernels, too.

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Joe Conway
Tom Lane wrote: Here is a test case. To set up, run the test_setup.sql script once; then launch two copies of the test_run.sql script. (For those of you with more than two CPUs, see whether you need one per CPU to make trouble, or whether two test_runs are enough.) Check that you get a

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread Robert Creager
When grilled further on (Mon, 19 Apr 2004 20:53:09 -0400), Tom Lane [EMAIL PROTECTED] confessed: I wrote: Here is a test case. Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, which seems to pretty much let the Xeon per se off the hook. Anybody got a multiple Opteron

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-19 Thread jelle
Same problem with dual 1Ghz P3's running Postgres 7.4.2, linux 2.4.x, and 2GB ram, under load, with long transactions (i.e. 1 cannot serialize rollback per minute). 200K was the worst observed with vmstat. Finally moved DB to a single xeon box. ---(end of

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres. I did a bunch of oprofile measurements on a machine belonging to one of Josh's clients, using a test case that involved heavy

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Dave Cramer
So the the kernel/OS is irrelevant here ? this happens on any dual xeon? What about hypterthreading does it still happen if HTT is turned off ? Dave On Sun, 2004-04-18 at 17:47, Tom Lane wrote: After some further digging I think I'm starting to understand what's up here, and the really

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes: So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes: So the the kernel/OS is irrelevant here ? this happens on any dual xeon? I believe so. The context-switch behavior might possibly be a little more pleasant on other kernels, but the underlying spinlock problem is not dependent on the kernel. What about

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes: There's nothing about the way Postgres spinlocks are coded that affects this? No. AFAICS our spinlock sequences are pretty much equivalent to the way the Linux kernel codes its spinlocks, so there's no deep dark knowledge to be mined there. We could

Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
What about hypterthreading does it still happen if HTT is turned off ? The problem comes from keeping the caches synchronized between multiple physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a hyperthreaded processor still only has one cache. Also, I forgot to say that

RESOLVED: Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-16 Thread Dirk Lutzebäck
Tom, Josh, I think we have the problem resolved after I found the following note from Tom: A large number of semops may mean that you have excessive contention on some lockable resource, but I don't have enough info to guess what resource. This was the key to look at: we were missing all

Re: RESOLVED: Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-16 Thread Tom Lane
=?ISO-8859-1?Q?Dirk_Lutzeb=E4ck?= [EMAIL PROTECTED] writes: This was the key to look at: we were missing all indices on table which is used heavily and does lots of locking. After recreating the missing indices the production system performed normal. No, more excessive semop() calls, load

Re: [PERFORM] Wierd context-switching issue on Xeon

2003-11-25 Thread Josh Berkus
Tom, Strictly a WAG ... but what this sounds like to me is disastrously bad behavior of the spinlock code under heavy contention. We thought we'd fixed the spinlock code for SMP machines awhile ago, but maybe hyperthreading opens some new vistas for misbehavior ... Yeah, I thought of that