[OpenSIPS-Devel] Allocating TCP workers to process requests

Dan Pascu Fri, 10 Jan 2020 04:01:23 -0800

I noticed some unexpected behavior related to how TCP workers are allocated to 
process requests. This was highlighted during dome DNS outage due to how 
opensips was configured.


Here are the relevant bits from my configuration to layout the context:

I listen on UDP, TCP and TLS and I start 5 UDP and 5 TCP worker processes, but 
allow them to grow up to 25 based on load:

listen = udp:IP:5060
listen = tcp:IP:5060
listen = tls:IP:5061

auto_scaling_profile = SIP_WORKERS
     scale up to 25 on 80% for 4 cycles within 5
     scale down to 5 on 20% for 10 cycles

tcp_workers = 5 use_auto_scaling_profile SIP_WORKERS
udp_workers = 5 use_auto_scaling_profile SIP_WORKERS


DNS is configured to use only 1 server and only 1 attempt with a timeout of 5 
seconds per request:

dns                   = yes
rev_dns               = no
dns_use_search_list   = no
disable_dns_blacklist = yes
dns_retr_time         = 5
dns_retr_no           = 1
dns_servers_no        = 1

This means that every time a domain is looked up but the DNS server is down, it 
will do 3 requests (NAPTR, SRV and A) and each will take 5 seconds to timeout. 
In other words a DNS lookup for a domain will timeout after 15 seconds.

I have 1 device that connects over TLS and registers an account that uses RLS 
and has 30 contacts stored.

Now the event was that the main DNS server was down, and because of my 
configuration I didn't fallback to the secondary one from resolv.conf so all 
DNS requests failed.

During this time I noticed that whenever RLS kicked in it would attempt to send 
SUBSCRIBEs to the 30 contacts and fail for each of them, and the whole thing 
would take approximately 7.5 minutes, during which time it would always use the 
1st TCP worker which would increase it's load and 1 minute load to 100% and the 
10 minutes load would stay at 77%. This was in line with the fact that RLS was 
triggered every 10 minutes and spend 7.5 minutes stuck in DNS timeouts, so it 
was busy approximately 75% of the time.

The fact that RLS always used TCP worker 1 is not unexpected as the SIP device 
I mentioned was the only one connected to the proxy and the only one sending 
requests, so the proxy was mostly idle doing RLS every 10 minutes, besides the 
occasional REGISTER/SUBSCRIBE from the device.

The unexpected behavior is that during the 7.5 minutes when RLS tried to send 
SUBSCRIBEs to the contacts, any REGISTER or INVITE received by the proxy would 
not be processed. They seem to be scheduled on the same 1st TCP worker that is 
already loaded 100% from the RLS processing that is going on. I never see any 
log message from my script about processing the REGISTER or INVITE and they 
just timeout on the client. If I send the REGISTER or INVITE during the 2.5 
minutes when RLS is not trying to send SUBSCRIBEs to the contacts, then I see 
the REGISTER and INVITE being processed and logging from the script, but the 
INVITE also fails due to DNS failure.

If I change my outbound proxy to prefer UDP, then I see the REGISTER and INVITE 
being processed, but if I use TCP or TLS I do not see them being processed 
unless I'm in the 2.5 minute window when the proxy is not doing RLS (actually I 
never checked but it's possible that the requests that arrived in the 7.5 
minute window were actually processed and logged when the RLS processing window 
ended, but I never waited that long and they always timeout out on the client 
in 30 seconds).

Now I can understand that RLS does all in a single worker (it does a database 
lookup for the contacts and then loops all of them trying to send a SUBSCRIBE 
for each), even though it could be argued that it could be optimized to 
delegate each sending out to a different worker.

What puzzles me is why is opensips allocating the incoming requests it receives 
to a TCP worker that is already busy and shows a 100% load in opensips-cli, 
while it has 4 other TCP workers that are completely idle. Or if my conclusion 
is wrong, what exactly happens that during the 7.5 minutes where RLS uses TCP 
worker 1 trying to send out the SUBSCRIBEs and failing, that no incoming 
request is processed by the other 4 idle TCP workers and it just times out?

That is not to say that I do not see the other TCP worker's pid in syslog at 
all, but they only appear very rarely and the idle workers do not seem to ever 
be used during the 7.5 minute busy window when the 1st worker is 100% loaded. 
So some worker allocation seems to happen when processing multiple incoming 
requests that arrive in parallel, but while RLS is sending out the SUBSCRIBEs 
it never seems to try to use the idle workers for incoming requests.

--
Dan





_______________________________________________
Devel mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel

[OpenSIPS-Devel] Allocating TCP workers to process requests

Reply via email to