I've gotten some insight into the stats collection issues by monitoring Bruce's test case with oprofile (http://oprofile.sourceforge.net/).
Test conditions: PG CVS HEAD, built with --enable-debug --disable-cassert (debug symbols are needed for oprofile), on current Fedora Core 5 (Linux kernel 2.6.16-1.2133_FC5 #1 SMP, gcc 4.1.1), running on dual Xeon EM64T with hyperthreading enabled. All configuration parameters are at defaults except for stats_command_string. I extended the test script to be 240000 "SELECT 1;" commands so that it would run long enough to get a trustworthy profile. In the profile data below, I cut off each listing where the percentages fell below about 1% of global runtime. With stats_command_string off: 21.7 sec to run script (with oprofile off) 22.7 sec to run script (with oprofile active) These numbers stay pretty consistent over repeated trials. oprofile global report: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.03 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 156544 47.2570 /usr/lib/debug/lib/modules/2.6.16-1.2133_FC5/vmlinux 84114 25.3921 /home/tgl/testversion/bin/postgres 63873 19.2818 /lib64/libc-2.4.so 13621 4.1119 /home/tgl/testversion/bin/psql 10101 3.0493 /home/tgl/testversion/lib/libpq.so.5.0 1324 0.3997 /usr/bin/oprofiled kernel (note percentages here are percentage of kernel time not total time): samples % symbol name 19326 12.3454 schedule 11738 7.4982 _raw_spin_lock 11631 7.4299 __switch_to 9897 6.3222 mwait_idle 7153 4.5693 __find_first_bit 6395 4.0851 try_to_wake_up 3972 2.5373 unix_stream_recvmsg 3813 2.4357 system_call 3739 2.3885 thread_return 3612 2.3073 find_busiest_group 3446 2.2013 resched_task postgres: samples % symbol name 6415 7.6266 base_yyparse 6115 7.2699 AllocSetAlloc 3525 4.1907 hash_search libc: samples % symbol name 9276 14.5226 memcpy 8686 13.5989 _int_malloc 5865 9.1823 strncpy 5538 8.6703 strlen 5286 8.2758 strspn 4351 6.8120 vfprintf With stats_command_string on: 45.2 sec to run script (with oprofile off) 47.5 sec to run script (with oprofile active) Also repeatable. oprofile global report: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.03 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 785756 71.9386 /usr/lib/debug/lib/modules/2.6.16-1.2133_FC5/vmlinux 135820 12.4348 /home/tgl/testversion/bin/postgres 105828 9.6889 /lib64/libc-2.4.so 19903 1.8222 /home/tgl/testversion/lib/libpq.so.5.0 19293 1.7663 /home/tgl/testversion/bin/psql 7489 0.6856 /ip_conntrack 4555 0.4170 /usr/bin/oprofiled kernel: samples % symbol name 62337 7.9334 security_port_sid 61566 7.8353 schedule 37897 4.8230 _raw_spin_lock 36410 4.6338 __switch_to 26580 3.3827 mwait_idle 21797 2.7740 __find_first_bit 19118 2.4331 avc_has_perm_noaudit 18929 2.4090 sys_select 17005 2.1642 system_call 15979 2.0336 do_select 15762 2.0060 fget 14838 1.8884 try_to_wake_up 13488 1.7166 find_busiest_group 11956 1.5216 _raw_read_lock 11685 1.4871 thread_return 10639 1.3540 resched_task 10542 1.3416 copy_user_generic postgres: samples % symbol name 7817 5.7554 PgstatBufferMain 7541 5.5522 base_yyparse 7247 5.3357 AllocSetAlloc 6725 4.9514 hash_search libc: samples % symbol name 15012 14.1853 strncpy 12265 11.5896 _int_malloc 11949 11.2910 memcpy This system is running with SELinux enabled (in permissive mode). I presume that if I turned it off, security_port_sid() and avc_has_perm_noaudit() would drop off the radar screen but things wouldn't change much otherwise. Also, you'd expect to see PgstatCollectorMain() somewhere but it's not in the oprofile report at all. I believe that gcc sees that it's a static singly-referenced function and inlines it into PgstatBufferMain; so the time shown above for PgstatBufferMain should be read as the joint time for those two functions. Also it looks like all of the pgstat_recv_foo functions got inlined into it too. The only pgstat functions that show separately in the oprofile output are 1214 0.8938 pgstat_add_backend 860 0.6332 pgstat_report_tabstat 749 0.5515 pgstat_report_activity 434 0.3195 pgstat_write_statsfile 330 0.2430 pgstat_setheader 196 0.1443 pgstat_count_xact_commit 106 0.0780 pgstat_send 91 0.0670 pgstat_get_db_entry So the total time spent directly in pgstat.c is less than 9% of the postgres executable's time, or only about 1% of total runtime. Comparing the sample counts between the two cases ought to be legitimate because in both cases the oprofile run covered the full run of the test script. However the second case shows twice as many hits in psql and libpq.so, which ought to be doing the same amount of work, so I think there's some measurement artifact involved there. In any case, the kernel time has gone up way more than the userland time (a result confirmed by "top" and "vmstat"). Conclusions: * Kernel time is what's hurting us. * Most of the kernel time is associated with process scheduling. In our current system, a simple command with no stats overhead involves two process dispatches (the backend and psql) per command round trip. Enabling stats_command_string increases that to six, because we'll also dispatch the buffer process and collector process twice per command. * If you are using SELinux, UDP datagrams are expensive because the kernel permission-checks each one separately (even with a bound socket, apparently). We might want to think about replacing the UDP mechanism with a Unix socket where available. In short, it sure looks like the separate buffer process is a huge loss, at least on this platform. Getting rid of it should reduce the kernel overhead by a third. It's hard to tell how much userland time it would save, but the userland time isn't the problem. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq