On Tuesday, May 31, 2005 Serguei Osokine wrote:
> We tried to use UDP to transfer stuff over a gigabit LAN inside
> the cluster. Pretty soon we discovered that with small (~1500 byte) 
> packets the CPU was the bottleneck, because you can send only so many
> packets per second, and the resulting throughput was nowhere close to
> a gigabit. 
> ...
> (Datagrams smaller than MTU sucked performance-wise when compared to
> TCP, but that is another story - gigabit cards tend to offload plenty
> of TCP functionality from the CPU, so it was not that the UDP was
> particularly bad, but rather that TCP performance was very good.)

        An update for anyone who still cares after two and a half years:
it turns out that UDP *was* particularly bad. We have discovered this
almost as an accident, and it looks like a Windows problem - probably
in WinSock UDP implementation. 

        As it turned out, the CPU percent vs sending rate chart has a 
clear 'hockey stick' shape - CPU use is zero until some middle point,
and then it starts to grow linearly, which is already unexpected by
itself. What's even more funny, the sendto() call time is always the
same regardless of the sending rate (controlled by sleeps between 
sends) and regardless of CPU usage percent, and it is this time that
is limiting the single thread sending performance. 

        Think about it: your sendto() call *always* takes the same
number of microseconds, whether your CPU use is 0% or 100%, and it
is this time that caps your thread sending rate long before 1 GBit/s
can be reached. You can use multiple threads, but then the steep
incline area of the 'hockey stick' CPU use chart maxes out all CPUs, 
and you bottleneck on CPU before reaching 1 GBit/sec anyway - just 
as I was describing in my previous message two years ago.

        To make the long story short, the lucky break came when I decided
to try the noblocking socket with this command:

        nbFlag = 1;
        ioctlsocket(sock, FIONBIO, &nbFlag);

- and the socket buffer was set to 1 MB (note that the buffer increase
alone did not have any effect). That was done just out of desperation:
I saw no reason why would the CPU use of the nonblocking socket be any
better than of the blocking one (especially after IOCompletion approach
failed to reduce the CPU use).

        I mean, if the socket would try to block, then some extra time
would be spent on select() calls (as I saw happening when I tried to
call the blocking sendto() only after select() was giving me the go-
ahead signal; this was only increasing the thread overhead without
affecting the sendto() timing or CPU use). And if the socket would 
not try to block, then what is the difference between blocking and 
non-blocking timing, right?

        Wrong. Using nonblocking sockets led to a result that shocked 
me: sendto() calls never wanted to block, always returning success;
time spent in these calls has decreased by a factor of ten(!); the 
CPU use dropped to just a few percentage points regardless of the
sending data rate (which included the rates that were previously
unachievable at all due to the CPUs maxing out long before that) -
and all these UDP packets were actually going on the wire and were
successfully delivered to destination.

        If anyone can give it a rational explanation, go ahead. I can 
only speculate that there is some bug in the Windows UDP stack that
leads to some polling under some conditions, which causes the CPU 
load to increase unnecessarily - all this CPU load is clearly visible
in Performance Monitor in some system thread and shows up as a kernel
CPU load in Task Manager. So the blocking sendto() thread is not
consuming a lot of CPU time by itself, but is probably mostly waiting
inside sendto() for some event to happen, while the kernel thread is
looping instead of sleeping. As to why this would be happening for the
blocking socket only, your guess is as good as mine. Like I said, once
you switch to nonblocking, the socket does not show the slightest desire
to wait for anything, so it is not like the blocking sendto() call is
actually expected to block - remember, its execution time seems to be
constant and does not depend on the sending rate or on the CPU load 
at all. 

        But that polling by itself does not explain the sendto() time
dropping tenfold in the nonblocking socket case even when the data
sending rate is low and the CPU load is zero in both blocking and
nonblocking cases. Oh well. Maybe at the same time there is also 
some scheduler problem that delays the sendto() thread wakeup when 
the socket is in blocking mode. Who knows.

        Sorry for the offtop message, but since it all started here, I
thought that it would be appropriate to provide the update at the same
place. Besides, even though UDP is getting more and more popular these 
days, this behaviour still seems to be virtually unknown - I could not
find anything about this on Google, and saw people asking questions
about high CPU use with large UDP streams even in such unlikely places
as Sun Java forums, apparently assuming that it gotta be Java causing
that in their applications. No - I'd say that this seems to point in 
the direction of some Windows 2001/XP UDP implementation bug...

        Best wishes -
        S.Osokine.
        16 Dec 2007.


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Serguei Osokine
Sent: Tuesday, May 31, 2005 10:19 AM
To: Peer-to-peer development.
Subject: RE: [p2p-hackers] MTU in the real world


On Tuesday, May 31, 2005 David Barrett wrote:
> With this in mind, have you tried using a MTU bigger than 1500 bytes
> and been bitten by it?

        Yes. That was not your typical everyday situation, but I think
some on this list might find it entertaining anyway:

        We tried to use UDP to transfer stuff over a gigabit LAN inside
the cluster. Pretty soon we discovered that with small (~1500 byte) 
packets the CPU was the bottleneck, because you can send only so many
packets per second, and the resulting throughput was nowhere close to
a gigabit. (You have to send almost 100K such packets a second to
achieve a gigabit throughput, and we were doing several times less
on our 2-CPU 2.4GHz Win XP boxes.)

        So then we tried to increase the UDP datagram size. The gigabit 
switch did not support jumbo frames, by the way, so we were fragmenting
as soon as we exceeded 1500. The throughput went up, and was pretty 
decent with 64-KB datgrams (don't remember the exact numbers, but it
was close to a gigabit and generally everything was peachy).

        Which is when the funny things started to happen. In the middle
of a test, the communication channel would just shut down and nothing 
would be delivered over it for a minute or two (though both the sender
and the receiver kept looking fine and no errors were returned by the
socket calls - sender was sending data, but the receiver recfrom()
call was not getting it); after that pause the channel would wake up 
as if nothing happened (except for several gigabytes of lost data),
work normally for a few minutes, after which this shutdown would be 
repeated, and so on. 

        Took us a while to figure out what was going on, but here is the
scoop: the gigabit LAN had a fairly small, but nonetheless non-zero
packet loss rate. When one 1500-byte frame from a 64-KB datgram is
lost, the rest of the datagram frames (all 62 KB)have to be buffered
somewhere in case the missing frame arrives and the datagram can be
fully reassembled. This arrival will never happen, but the socket 
layer does not know that, so it has to keep the partial datagram for 
a while, discarding all its frames if the missing frame won't arrive
before some timeout (RFC 1122 recommends this timout value to be 
between 60 and 120 seconds, and this seems to be in line with what
we saw).

        Now, the gigabit link sends quite a lot of data - 100MB+ per
second, to be precise. Even with 0.01% loss rate, you're losing about
10,000 bytes per second. This is no big deal, but every 1500 bytes lost
cause you to store 62KBs of partial datagrams, so with the loss rate
above you have to store 400 KB of new data every second. If this data 
expires in 120 seconds, you need about 50 MB for the partial datagram 
storage in the socket layer - and proportionally more if your data loss
rate is higher than 0.01%. And this amount of memory is something that 
the socket layer in Win XP simply does not have. So as soon as it runs
out of memory for the partially assembled datagrams, it stops the data
delivery and waits for the memory to be released. Apparently after it
gets enough free memory, it switches the data delivery back on again.

        This approach does seem funny, and I don't see any compelling 
reason for the socket layer to handle that situation in this "trigger"
fashion - either it works normally, or shuts down the data delivery
completely. Might have handled this a bit more gracefully, I'd think.
But this was Windows, and there was no arguing with it. (We were stuck
with Windows for unrelated reasons.)

        So the bottom line was, we had to go with TCP, because there was 
no way we could make the UDP transport that would be both fast enough
and would work on our hardware/OS combination. And the part about
"would work" was definitely related to an attempt to send the datgrams
that would exceed MTU. (Datagrams smaller than MTU sucked performance-
wise when compared to TCP, but that is another story - gigabit cards
tend to offload plenty of TCP functionality from the CPU, so it was
not that the UDP was particularly bad, but rather that TCP performance
was very good.)

        Best wishes -
        S.Osokine.
        31 May 2005.

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of David Barrett
Sent: Tuesday, May 31, 2005 3:11 AM
To: Peer-to-peer development.
Subject: [p2p-hackers] MTU in the real world


I've read in multiple places that it's best to have a UDP MTU of under 
1500 bytes.  However, it sounds like most of this is based on 
theoretical analysis, and not on real-world experience.

With this in mind, have you tried using a MTU bigger than 1500 bytes and 
been bitten by it?  Basically, do you know of any emperical analysis (of 
any level of formality) of a real-world UDP application that supports or 
refutes the 1500 byte rule of thumb?

Furthermore, I've read that if you "connect" your UDP socket to the 
remote side and then start sending large packets and backing off slowly, 
the socket layer will compute the "real" MTU between two endpoints, and 
you can obtain it through "getsockopt".  Do you know of anyone who's 
tried this, and the results?

-david
_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to