Re: [O-MPI devel] TCP performance

2005-11-29 Thread Tim S. Woodall

George Bosilca wrote:

Tim,

It looks a little bit better. Here are the latencies for 1 to 4 bytes  
messages as well as for the maximum length in Netpipe (8 MB).


old ob1:
   0:   1 bytes694 times -->  0.06 Mbps in 137.54 usec
   1:   2 bytes727 times -->  0.11 Mbps in 140.54 usec
   2:   3 bytes711 times -->  0.16 Mbps in 141.54 usec
   3:   4 bytes470 times -->  0.22 Mbps in 140.55 usec
121: 8388605 bytes  3 times -->889.75 Mbps in   71929.97 usec
122: 8388608 bytes  3 times -->889.72 Mbps in   71932.47 usec
123: 8388611 bytes  3 times -->889.59 Mbps in   71943.16 usec

new ob1:
   0:   1 bytes760 times -->  0.07 Mbps in 116.08 usec
   1:   2 bytes861 times -->  0.13 Mbps in 116.73 usec
   2:   3 bytes856 times -->  0.20 Mbps in 116.69 usec
   3:   4 bytes571 times -->  0.26 Mbps in 117.48 usec
121: 8388605 bytes  3 times -->890.37 Mbps in   71880.14 usec
122: 8388608 bytes  3 times -->890.33 Mbps in   71883.64 usec
123: 8388611 bytes  3 times -->890.40 Mbps in   71878.00 usec

teg:
   0:   1 bytes867 times -->  0.07 Mbps in 114.91 usec
   1:   2 bytes870 times -->  0.13 Mbps in 115.99 usec
   2:   3 bytes862 times -->  0.20 Mbps in 114.37 usec
   3:   4 bytes582 times -->  0.26 Mbps in 115.20 usec
121: 8388605 bytes  3 times -->893.42 Mbps in   71634.49 usec
122: 8388608 bytes  3 times -->893.22 Mbps in   71651.18 usec
123: 8388611 bytes  3 times -->893.24 Mbps in   71649.35 usec

uniq:
   0:   1 bytes870 times -->  0.07 Mbps in 114.59 usec
   1:   2 bytes872 times -->  0.13 Mbps in 114.20 usec
   2:   3 bytes875 times -->  0.20 Mbps in 114.52 usec
   3:   4 bytes582 times -->  0.27 Mbps in 113.70 usec
121: 8388605 bytes  3 times -->893.41 Mbps in   71635.64 usec
122: 8388608 bytes  3 times -->893.57 Mbps in   71623.01 usec
123: 8388611 bytes  3 times -->893.39 Mbps in   71637.67 usec

raw tcp:
   0:   1 bytes   1081 times -->  0.08 Mbps in  90.74 usec
   1:   2 bytes   1102 times -->  0.17 Mbps in  90.88 usec
   2:   3 bytes   1100 times -->  0.25 Mbps in  90.66 usec
   3:   4 bytes735 times -->  0.34 Mbps in  89.21 usec
121: 8388605 bytes  3 times -->894.90 Mbps in   71516.32 usec
122: 8388608 bytes  3 times -->894.99 Mbps in   71508.84 usec
123: 8388611 bytes  3 times -->894.96 Mbps in   71511.51 usec

The changes seems to remove around 20 micro-seconds ... that's not  
bad. However, we are still 25 microseconds away from the raw TCP  
stream. And these 25 microsecond should come from somewhere on the  
TCP BTL because the request management is quite quick in Open MPI. I  
think I have an explanation. In the Netpipe TCP they are not using  
any select or poll functions, they just wait on the receive until  
they get the full messages. As we do a select/poll before going to  
read from the socket that should explain at least partially the  
difference. But there is still a small problem. Why ob1 is 3 micro- 
seconds slower than teg or uniq ?


Due to the structure of the pml/btl interface, the btl code does an
extra recv call. The cost of this varies, on odin it appears to be
closer to 0.5us... Have to think about this a bit - may be able to
remove this.


Tim




Re: [O-MPI devel] TCP performance

2005-11-29 Thread George Bosilca

Tim,

It looks a little bit better. Here are the latencies for 1 to 4 bytes  
messages as well as for the maximum length in Netpipe (8 MB).


old ob1:
  0:   1 bytes694 times -->  0.06 Mbps in 137.54 usec
  1:   2 bytes727 times -->  0.11 Mbps in 140.54 usec
  2:   3 bytes711 times -->  0.16 Mbps in 141.54 usec
  3:   4 bytes470 times -->  0.22 Mbps in 140.55 usec
121: 8388605 bytes  3 times -->889.75 Mbps in   71929.97 usec
122: 8388608 bytes  3 times -->889.72 Mbps in   71932.47 usec
123: 8388611 bytes  3 times -->889.59 Mbps in   71943.16 usec

new ob1:
  0:   1 bytes760 times -->  0.07 Mbps in 116.08 usec
  1:   2 bytes861 times -->  0.13 Mbps in 116.73 usec
  2:   3 bytes856 times -->  0.20 Mbps in 116.69 usec
  3:   4 bytes571 times -->  0.26 Mbps in 117.48 usec
121: 8388605 bytes  3 times -->890.37 Mbps in   71880.14 usec
122: 8388608 bytes  3 times -->890.33 Mbps in   71883.64 usec
123: 8388611 bytes  3 times -->890.40 Mbps in   71878.00 usec

teg:
  0:   1 bytes867 times -->  0.07 Mbps in 114.91 usec
  1:   2 bytes870 times -->  0.13 Mbps in 115.99 usec
  2:   3 bytes862 times -->  0.20 Mbps in 114.37 usec
  3:   4 bytes582 times -->  0.26 Mbps in 115.20 usec
121: 8388605 bytes  3 times -->893.42 Mbps in   71634.49 usec
122: 8388608 bytes  3 times -->893.22 Mbps in   71651.18 usec
123: 8388611 bytes  3 times -->893.24 Mbps in   71649.35 usec

uniq:
  0:   1 bytes870 times -->  0.07 Mbps in 114.59 usec
  1:   2 bytes872 times -->  0.13 Mbps in 114.20 usec
  2:   3 bytes875 times -->  0.20 Mbps in 114.52 usec
  3:   4 bytes582 times -->  0.27 Mbps in 113.70 usec
121: 8388605 bytes  3 times -->893.41 Mbps in   71635.64 usec
122: 8388608 bytes  3 times -->893.57 Mbps in   71623.01 usec
123: 8388611 bytes  3 times -->893.39 Mbps in   71637.67 usec

raw tcp:
  0:   1 bytes   1081 times -->  0.08 Mbps in  90.74 usec
  1:   2 bytes   1102 times -->  0.17 Mbps in  90.88 usec
  2:   3 bytes   1100 times -->  0.25 Mbps in  90.66 usec
  3:   4 bytes735 times -->  0.34 Mbps in  89.21 usec
121: 8388605 bytes  3 times -->894.90 Mbps in   71516.32 usec
122: 8388608 bytes  3 times -->894.99 Mbps in   71508.84 usec
123: 8388611 bytes  3 times -->894.96 Mbps in   71511.51 usec

The changes seems to remove around 20 micro-seconds ... that's not  
bad. However, we are still 25 microseconds away from the raw TCP  
stream. And these 25 microsecond should come from somewhere on the  
TCP BTL because the request management is quite quick in Open MPI. I  
think I have an explanation. In the Netpipe TCP they are not using  
any select or poll functions, they just wait on the receive until  
they get the full messages. As we do a select/poll before going to  
read from the socket that should explain at least partially the  
difference. But there is still a small problem. Why ob1 is 3 micro- 
seconds slower than teg or uniq ?


  Thanks,
george.

PS: I will print again the graph and send them around this evening  
because the rest on the data are on my laptop (and it's at home right  
now).


On Nov 29, 2005, at 12:36 PM, Tim S. Woodall wrote:


George,

Can you try out the changes I just commited on the trunk? We were  
doing

more select/recvs then necessary.

Thanks,
Tim


George Bosilca wrote:
I run Netpipe on 4 different clusters with differents OSes and  
Eternet
devices. The results is that nearly the same behaviour happens all  
the
time for small messages. Basically, our latency is really bad.  
Attached
are 2 of the graphs on one MAC OS X cluster (wotan) and a Linux  
2.6.10
32 bits one. The graph are for Netpipe compiled over tcp, and for  
Open

MPI with all the PMLs (uniq, teg and ob1).Here is the global trend:

- we are always slower than native TCP (what a guess!)

- uniq is faster than teg by a fraction of second (it's more  
visible on

fast networks).

- teg and uniq are always better than ob1 in terms of latency.

- the behaviour of ob1 differ on wotan and boba. On boba the
performances are a lot closer to the other PML when on wotan it's  
like

40 micro-second slower (it nearly double the raw TCP latency).

On the same nodes I run other Netpipe with SM and MX and the  
results are
pretty good. So I think we have this latency problem only on TCP.  
I will

take a look to see how exactly is happens but any help is welcome.

  george.

"We must accept finite disappointment, but we must never lose  
infinite

hope."
  Martin Luther King


- 
---



Re: [O-MPI devel] TCP performance

2005-11-29 Thread Tim S. Woodall

George,

Can you try out the changes I just commited on the trunk? We were doing
more select/recvs then necessary.

Thanks,
Tim


George Bosilca wrote:
I run Netpipe on 4 different clusters with differents OSes and Eternet 
devices. The results is that nearly the same behaviour happens all the 
time for small messages. Basically, our latency is really bad. Attached 
are 2 of the graphs on one MAC OS X cluster (wotan) and a Linux 2.6.10 
32 bits one. The graph are for Netpipe compiled over tcp, and for Open 
MPI with all the PMLs (uniq, teg and ob1).Here is the global trend:


- we are always slower than native TCP (what a guess!)

- uniq is faster than teg by a fraction of second (it's more visible on 
fast networks).


- teg and uniq are always better than ob1 in terms of latency.

- the behaviour of ob1 differ on wotan and boba. On boba the 
performances are a lot closer to the other PML when on wotan it's like 
40 micro-second slower (it nearly double the raw TCP latency).


On the same nodes I run other Netpipe with SM and MX and the results are 
pretty good. So I think we have this latency problem only on TCP. I will 
take a look to see how exactly is happens but any help is welcome.


  george.

"We must accept finite disappointment, but we must never lose infinite
hope."
  Martin Luther King




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel