Hi, We have brought this up before, but I wanted to touch on the subject of a customized memcpy routine for the use of corosync for performance reasons (obviously after 1.0, perhaps targeted towards 1.1). The purpose of the routine of course would be to improve performance.
A fast google search turns up: http://www.vik.cc/daniel/portfolio/memcpy.htm I used this implementation by including the c file in exec/main.c and lib/libcoroipcc.c in corosync and found the following benchmark results with oprofile on a single node 2.4 ghz Xeon (which involves minimal use of the network stack). the libc functions have lazy bindings (or weak or something) and this forces the memcpy routine to be used instead of libcs). Test is with evsbench since cpgbench appears broken atm. libc memcpy (rhel 5.3): evsbench results: [r...@bench-01 test]# ./evsbench Init result 1 Join result 1 401958 Writes 1 bytes per write 10.002 Seconds runtime 40189.024 TP/s 0.040 MB/s. 316524 Writes 1001 bytes per write 10.001 Seconds runtime 31649.352 TP/s 31.681 MB/s. 279601 Writes 2001 bytes per write 10.002 Seconds runtime 27954.517 TP/s 55.937 MB/s. 238100 Writes 3001 bytes per write 10.002 Seconds runtime 23805.201 TP/s 71.439 MB/s. 214647 Writes 4001 bytes per write 10.001 Seconds runtime 21462.556 TP/s 85.872 MB/s. 178253 Writes 5001 bytes per write 10.002 Seconds runtime 17821.761 TP/s 89.127 MB/s. 151699 Writes 6001 bytes per write 10.001 Seconds runtime 15168.430 TP/s 91.026 MB/s. Oprofile percentages of app use for the system: 1149541 41.1079 /root/trunk/exec/corosync 982566 35.1368 /usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux 390398 13.9607 /root/trunk/test/evsbench 210283 7.5198 /usr/bin/oprofiled Breakdowns of corosync: samples % image name symbol name 306301 26.6455 libc-2.5.so memcpy 158528 13.7905 libpthread-2.5.so pthread_spin_lock 151389 13.1695 libpthread-2.5.so pthread_mutex_lock 142589 12.4040 libpthread-2.5.so __pthread_mutex_unlock_usercnt 36381 3.1648 [vdso] (tgid:10967 range:0x25a000-0x25b000) (no symbols) 36274 3.1555 libc-2.5.so malloc 33355 2.9016 libc-2.5.so _int_malloc Breakdowns of evsbench: 193998 49.6924 libc-2.5.so memcpy 32028 8.2039 [vdso] (tgid:10971 range:0xa34000-0xa35000) (no symbols) 24339 6.2344 libc-2.5.so vfprintf 15562 3.9862 libpthread-2.5.so pthread_spin_lock 13029 3.3374 libpthread-2.5.so __pthread_mutex_unlock_usercnt 11219 2.8737 libpthread-2.5.so pthread_mutex_lock 10110 2.5897 libc-2.5.so semop 9588 2.4560 libc-2.5.so _IO_default_xsputn With the URL memcpy routine: evsbench results: 428449 Writes 1 bytes per write 10.002 Seconds runtime 42836.791 TP/s 0.043 MB/s. 350277 Writes 1001 bytes per write 10.002 Seconds runtime 35020.829 TP/s 35.056 MB/s. 298791 Writes 2001 bytes per write 10.001 Seconds runtime 29876.088 TP/s 59.782 MB/s. 260680 Writes 3001 bytes per write 10.001 Seconds runtime 26065.341 TP/s 78.222 MB/s. 241360 Writes 4001 bytes per write 10.002 Seconds runtime 24131.186 TP/s 96.549 MB/s. 199283 Writes 5001 bytes per write 10.002 Seconds runtime 19924.441 TP/s 99.642 MB/s. 166882 Writes 6001 bytes per write 10.002 Seconds runtime 16684.933 TP/s 100.126 MB/s. results: faster evsbench throughput and msgs/sec 10% or so Oprofile percentages of app use for the system: 1172931 41.1805 /root/trunk/exec/corosync 1084976 38.0925 /usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux 297212 10.4348 /root/trunk/test/evsbench 229355 8.0524 /usr/bin/oprofiled breakdown of corosync samples % image name symbol name 237635 20.2599 corosync memcpy 172623 14.7172 libpthread-2.5.so pthread_mutex_lock 161879 13.8012 libpthread-2.5.so __pthread_mutex_unlock_usercnt 156196 13.3167 libpthread-2.5.so pthread_spin_lock 42768 3.6463 [vdso] (tgid:13295 range:0x166000-0x167000) (no symbols) 40774 3.4762 libc-2.5.so malloc 37471 3.1946 libc-2.5.so free 36413 3.1044 libc-2.5.so _int_malloc 17141 1.4614 libtotem_pg.so.3.0.0 totemsrp_mcast breakdown of evsbench: samples % image name symbol name 83174 27.9847 libevs.so.3.0.0 memcpy 33458 11.2573 [vdso] (tgid:13298 range:0x7bf000-0x7c0000) (no symbols) 26347 8.8647 libc-2.5.so vfprintf 16409 5.5210 libpthread-2.5.so pthread_spin_lock 13464 4.5301 libpthread-2.5.so __pthread_mutex_unlock_usercnt 12543 4.2202 libpthread-2.5.so pthread_mutex_lock 10462 3.5200 libc-2.5.so _IO_default_xsputn 10050 3.3814 libc-2.5.so semop Conclusion: The end results show that evsbench is ~10% faster, memcpy consumes 7% (thats almost a 30% reduction!) less cpu time in corosync with the URL routine, and memcpy consumes 24% less cpu time in evsbench (almost a 50% reduction). There are many low hanging fruits for optimization, especially for 10gig interconnect support (targeted in 1.1), and this one seems especially easy to integrate. Comments welcome. Regards -steve _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
