Hi Tom, Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).
- Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, March 05, 2007 07:43 PM Eastern Standard Time To: Mark Kirkwood Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Mark Kirkwood <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> But what I wanted to see was the curve of >> elapsed time vs shared_buffers? > ... > Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow disk drive (this machine is a dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test shows the disk read speed as 43MB/sec or so). Test case is a several-gig table, no indexes, fully vacuumed so that neither VACUUM nor COUNT(*) have to do anything but seqscan as fast as they can. Given a *freshly started* postmaster, I see regression=# show shared_buffers; shared_buffers ---------------- 128MB (1 row) regression=# \timing Timing is on. regression=# vacuum lineitem; VACUUM Time: 63652.333 ms regression=# vacuum lineitem; VACUUM Time: 63562.303 ms regression=# select count(*) from lineitem; count ---------- 10240000 (1 row) Time: 63142.174 ms regression=# vacuum lineitem; VACUUM Time: 61638.421 ms regression=# vacuum lineitem; VACUUM Time: 61785.905 ms I didn't show it here, but you can repeat the VACUUM all you want before the SELECT, and its times are stable; and you can repeat all you want after the SELECT, and the times are stable but a couple seconds lower. Restart the postmaster and it goes back to the slower behavior. (I'm keeping autovac off so it doesn't change the results.) I decided to get down and dirty with oprofile, and soon found that the user-space CPU consumption is indistinguishable in both states: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26368 13.7984 /home/tgl/testversion/bin/postgres 12765 6.6799 /libata 2238 1.1711 /lib64/libc-2.4.so 1112 0.5819 /dm_mod CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26284 16.3070 /home/tgl/testversion/bin/postgres 12004 7.4475 /libata 2093 1.2985 /lib64/libc-2.4.so 996 0.6179 /dm_mod Inside the kernel, there's only one routine that's significantly different: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 57779 40.9591 copy_user_generic 18175 12.8841 __delay 3994 2.8313 _raw_spin_lock 2388 1.6928 put_page 2184 1.5482 mwait_idle 2083 1.4766 _raw_write_unlock 1909 1.3533 _raw_write_lock CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 37437 33.0783 copy_user_generic 17891 15.8080 __delay 3372 2.9794 _raw_spin_lock 2218 1.9598 mwait_idle 2067 1.8263 _raw_write_unlock 1837 1.6231 _raw_write_lock 1531 1.3527 put_page So that's part of the mystery: apparently copy_user_generic is coded in such a way that it's faster to copy into memory that's already in processor cache. This strikes me as something that probably could/should be fixed in the kernel; I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. The other thing that was bothering me is why does the SELECT change VACUUM's behavior? A debugging printout added to ReadBuffer gave the answer: after postmaster start, we see things like read block 353094 into buffer 11386 read block 353095 into buffer 11387 read block 353096 into buffer 11388 read block 353097 into buffer 11389 read block 353098 into buffer 11390 read block 353099 into buffer 11391 read block 353100 into buffer 11392 read block 353101 into buffer 11393 read block 353102 into buffer 11394 read block 353103 into buffer 11395 and after the SELECT it behaves like read block 336761 into buffer 9403 read block 336762 into buffer 9402 read block 336763 into buffer 9403 read block 336764 into buffer 9402 read block 336765 into buffer 9403 read block 336766 into buffer 9402 read block 336767 into buffer 9403 read block 336768 into buffer 9402 read block 336769 into buffer 9403 read block 336770 into buffer 9402 What's going on is that VACUUM puts each buffer it's finished with on the tail of the freelist. In the post-SELECT state, there are just two buffers cycling through the freelist (not sure why not only one, but it doesn't matter) and so the cache footprint is limited. But immediately after postmaster start, (nearly) all the buffers are in the freelist and so they all cycle through VACUUM's usage. In any real-world situation, of course, the freelist is going to be nearly empty most of the time and so I don't think this part is worth changing. So I now concede Luke's argument that this behavior is related to L2 cache usage. But the next question is whether we ought to change regular seqscan's behavior to mimic VACUUM. I'm very worried about pessimizing other cases if we do. ISTM there's a fairly clear case that this might be fixable at the kernel level. Moreover, the issue only arises because of the way that the copy-from-kernel-buffer-to-userspace routine behaves, which means that if we go to a regime where we stop relying on OS caching and ask for direct DMA into our buffers, the advantage would disappear anyway. Lastly, I think the case where a win is possible is fairly narrow --- as soon as you've got anything but the one seqscan going on, it's not going to help. regards, tom lane