James MacLean wrote:
David Sommerseth wrote:
James MacLean wrote:
Hi Folks,
I have parsed around a bit but have not come up with a solid
suggestion to increase performance in the following environment :
. +150 clients always on, always via COAX modem 15Mb/s down 1.5Mb/s up.
. OpenVPN-2.0.9 and 2.1rc13 tested, setup as single server
. Server Kernel 2.6.25.4
. Server 64bit
. Server CPU % rarely goes above 30
. Server is fed over a 10G link
Currently we get what appears to be only between 5 and 6 MB/s average
using this setup.
If only activity is over a single tunnel we can get the expected max
(about 14Mb/s to the remote site) for the COAX sites. Once traffic
builds during the day, that number drops.
We know if we hit it locally we can get 160Mb/s. We know if we do hit
it locally and are getting the 160Mb/s that the COAX tunnels do
suffer. Starting by almost 1/2 of their normal throughput tunnel
speed of almost 14Mb/s.
So in my small mind, I am thinking we are seeing around 48Mb/s
(6MB/s*8) used, but that we should be able to get over 150Mb/s. CPU
isn't hurting. Almost feels like there is a governor slowing down the
traffic :).
Important settings from latest config :
verb 1
dev tap
tun-mtu 1500
tun-mtu-extra 32
mssfix 1468
proto udp
ca SSCert.pem
cert servercert.pem
key serverkey.pem
dh dh1024.pem
tls-auth ./tlspass
keepalive 30 63
ping-timer-rem
persist-tun 1
persist-key 1
cipher none
tcp-queue-limit 4096
sndbuf 131072
rcvbuf 131072
Anyone have any words of wisdom :) ?
Have you tried different ciphers and/or cipher key sizes? I know you
say the server do not suffer with too high load, but it could be
inefficiency in the cipher algorithm. If that's the case it might be
as well an OpenSSL issue too. It's a shot in the dark, but would be
good to wipe this one out. The default is blowfish, so I really do
not expect an improvement.
Do you know if threads are enabled in your OpenVPN setup?
(compile/configure setting). I believe the default is not to use
threads.
Does the performance drop if you have 150+ clients connected while
being passive (not sending any traffic over the tunnel) and only
having 1 client sending traffic?
kind regards,
David Sommerseth
Hi David,
I had hoped that "cipher none" would have the least overhead. Perhaps
there is a better one to try?
Hehe ... no, "cipher none" should have the very least overhead. I would be
very much surprised if anything goes through OpenSSL at this moment. But I
probably don't need to say anything about the security level by doing it.
Anyway, for testing and debugging - good approach!
Threads are enabled in the build, but I only ever see one in the running
program. Maybe 64bit is showing it differently or "ps axms" and "ps
-eLf" are not the way to display them ?
ps -eLf should display all threads, afaik.
Not sure though how the really threads are implemented, but when I dig into
the code it seems to be initialised as a single thread. I cannot find
traces in the code that indicates that multiple threads is implemented.
But it seems like the code is getting ready for it.
I will need to be corrected if my suspicion is wrong, that the core
behaviour between threaded and non-threaded binaries is almost behaving the
same, and not spawning out a thread per connection. If this is the case,
I'm not sure if it has any performance impact to use the threaded model.
Unless OpenSSL encryption is running in an own separate thread (I have not
investigated this)
Performance seems fine if they are doing nothing. We can get the full
expected bandwidth from a single client, or even a small number of clients.
But when the general use of the tunnels comes up, that's when they
appear to suffer.
I regret I do not have much in depth info, but I'm really not sure which
direction I should be aiming :).
Hmm ... that just seems to indicate that it is a drastic performance drop
when too many clients are using the tunnels.
When I look at the code, which is quite complex when it comes to the part
when clients connect, it seems like OpenVPN has it own way of scheduling
for when and how to handle the clients. And it might be that you've found
a limit in the implementation.
This code is taken from mudp.c
/* per-packet event loop */
while (true)
{
perf_push (PERF_EVENT_LOOP);
/* set up and do the io_wait() */
multi_get_timeout (&multi, &multi.top.c2.timeval);
io_wait (&multi.top, p2mp_iow_flags (&multi));
MULTI_CHECK_SIG (&multi);
/* check on status of coarse timers */
multi_process_per_second_timers (&multi);
/* timeout? */
if (multi.top.c2.event_set_status == ES_TIMEOUT)
{
multi_process_timeout (&multi, MPP_PRE_SELECT|MPP_CLOSE_ON_SIGNAL);
}
else
{
/* process I/O */
multi_process_io_udp (&multi);
MULTI_CHECK_SIG (&multi);
}
perf_pop ();
}
This seems to me to be the main loop. Here it seems that OpenVPN server is
listening for traffic on the network connections and processes each packet,
no matter which client sending it - and then analysing the packet and let a
connection "object" take care of further processing of the packet. This is
just a wild-guess, as I only spent 10-15 min looking through the code. But
a lot of process magic happens in multi_process_io_udp(), and a couple of
levels deeper a scheduling function is called.
If this really is true, it might be that this model works very well for a
good number of clients, until you reach a limit around 150+, when the cost
of doing this rescheduling begins to be too costly. If this scheduling is
not efficient enough (having a small "sleep" in between, waiting for IO,
inefficient or too many code jumps, etc), you will not see that the load on
the server increases too much - but you will most probably feel the
performance loss on the client side. With few active clients, this will of
course go better, as the internal scheduler has less clients to switch between.
In addition, I see that the code path is quite long, doing a lot of jumps
between a lot of function, and this of course also adds some penalty - even
though each function seems to be optimised.
This is of course a way how to avoid forking out or starting a new thread
per client which works independently, being task switched by the OS. But
to be honest, I think the OS scheduler might be much more efficient in the
scheduling and process switches than to have a separate one.
Can anyone with deeper knowledge than me verify or correct me? I would
like to understand this part of the code much better.
kind regards,
David Sommerseth