Hi,

We have brought this up before, but I wanted to touch on the subject of
a customized memcpy routine for the use of corosync for performance
reasons (obviously after 1.0, perhaps targeted towards 1.1).  The
purpose of the routine of course would be to improve performance.

A fast google search turns up:
http://www.vik.cc/daniel/portfolio/memcpy.htm

I used this implementation by including the c file in exec/main.c and
lib/libcoroipcc.c in corosync and found the following benchmark results
with oprofile on a single node 2.4 ghz Xeon (which involves minimal use
of the network stack).  the libc functions have lazy bindings (or weak
or something) and this forces the memcpy routine to be used instead of
libcs).  Test is with evsbench since cpgbench appears broken atm.

libc memcpy (rhel 5.3):

evsbench results:
[r...@bench-01 test]# ./evsbench
Init result 1
Join result 1
401958 Writes     1 bytes per write  10.002 Seconds runtime 40189.024
TP/s   0.040 MB/s.
316524 Writes  1001 bytes per write  10.001 Seconds runtime 31649.352
TP/s  31.681 MB/s.
279601 Writes  2001 bytes per write  10.002 Seconds runtime 27954.517
TP/s  55.937 MB/s.
238100 Writes  3001 bytes per write  10.002 Seconds runtime 23805.201
TP/s  71.439 MB/s.
214647 Writes  4001 bytes per write  10.001 Seconds runtime 21462.556
TP/s  85.872 MB/s.
178253 Writes  5001 bytes per write  10.002 Seconds runtime 17821.761
TP/s  89.127 MB/s.
151699 Writes  6001 bytes per write  10.001 Seconds runtime 15168.430
TP/s  91.026 MB/s.


Oprofile percentages of app use for the system:
  1149541 41.1079 /root/trunk/exec/corosync
   982566 35.1368 /usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux
   390398 13.9607 /root/trunk/test/evsbench 
   210283  7.5198 /usr/bin/oprofiled

Breakdowns of corosync:
samples  %        image name               symbol name
306301   26.6455  libc-2.5.so              memcpy
158528   13.7905  libpthread-2.5.so        pthread_spin_lock
151389   13.1695  libpthread-2.5.so        pthread_mutex_lock
142589   12.4040  libpthread-2.5.so
__pthread_mutex_unlock_usercnt
36381     3.1648  [vdso] (tgid:10967 range:0x25a000-0x25b000) (no
symbols)
36274     3.1555  libc-2.5.so              malloc
33355     2.9016  libc-2.5.so              _int_malloc

Breakdowns of evsbench:
193998   49.6924  libc-2.5.so              memcpy
32028     8.2039  [vdso] (tgid:10971 range:0xa34000-0xa35000) (no
symbols)
24339     6.2344  libc-2.5.so              vfprintf
15562     3.9862  libpthread-2.5.so        pthread_spin_lock
13029     3.3374  libpthread-2.5.so
__pthread_mutex_unlock_usercnt
11219     2.8737  libpthread-2.5.so        pthread_mutex_lock
10110     2.5897  libc-2.5.so              semop
9588      2.4560  libc-2.5.so              _IO_default_xsputn

With the URL memcpy routine:
evsbench results:
428449 Writes     1 bytes per write  10.002 Seconds runtime 42836.791
TP/s   0.043 MB/s.
350277 Writes  1001 bytes per write  10.002 Seconds runtime 35020.829
TP/s  35.056 MB/s.
298791 Writes  2001 bytes per write  10.001 Seconds runtime 29876.088
TP/s  59.782 MB/s.
260680 Writes  3001 bytes per write  10.001 Seconds runtime 26065.341
TP/s  78.222 MB/s.
241360 Writes  4001 bytes per write  10.002 Seconds runtime 24131.186
TP/s  96.549 MB/s.
199283 Writes  5001 bytes per write  10.002 Seconds runtime 19924.441
TP/s  99.642 MB/s.
166882 Writes  6001 bytes per write  10.002 Seconds runtime 16684.933
TP/s 100.126 MB/s.

results: faster evsbench throughput and msgs/sec 10% or so

Oprofile percentages of app use for the system:
  1172931 41.1805 /root/trunk/exec/corosync
  1084976 38.0925 /usr/lib/debug/lib/modules/2.6.18-128.el5PAE/vmlinux
   297212 10.4348 /root/trunk/test/evsbench
   229355  8.0524 /usr/bin/oprofiled

breakdown of corosync
samples  %        image name               symbol name
237635   20.2599  corosync                 memcpy
172623   14.7172  libpthread-2.5.so        pthread_mutex_lock
161879   13.8012  libpthread-2.5.so
__pthread_mutex_unlock_usercnt
156196   13.3167  libpthread-2.5.so        pthread_spin_lock
42768     3.6463  [vdso] (tgid:13295 range:0x166000-0x167000) (no
symbols)
40774     3.4762  libc-2.5.so              malloc
37471     3.1946  libc-2.5.so              free
36413     3.1044  libc-2.5.so              _int_malloc
17141     1.4614  libtotem_pg.so.3.0.0     totemsrp_mcast

breakdown of evsbench:
samples  %        image name               symbol name
83174    27.9847  libevs.so.3.0.0          memcpy
33458    11.2573  [vdso] (tgid:13298 range:0x7bf000-0x7c0000) (no
symbols)
26347     8.8647  libc-2.5.so              vfprintf
16409     5.5210  libpthread-2.5.so        pthread_spin_lock
13464     4.5301  libpthread-2.5.so
__pthread_mutex_unlock_usercnt
12543     4.2202  libpthread-2.5.so        pthread_mutex_lock
10462     3.5200  libc-2.5.so              _IO_default_xsputn
10050     3.3814  libc-2.5.so              semop

Conclusion:

The end results show that evsbench is ~10% faster, memcpy consumes 7%
(thats almost a 30% reduction!) less cpu time in corosync with the URL
routine, and memcpy consumes 24% less cpu time in evsbench (almost a 50%
reduction).

There are many low hanging fruits for optimization, especially for 10gig
interconnect support (targeted in 1.1), and this one seems especially
easy to integrate.

Comments welcome.

Regards
-steve


_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to