On 20 Jan 2020, at 10:12, Bogdan-Andrei Iancu wrote:

> Hi Dan,
> 
> If you are able to reproduce the issue, while having OpenSIPS "blocked" do 
> "trap" and "mi get_statistics load:" in cli. Let's see exactly what is going 
> on. Eventually run these commands at 1 min intervals (considering that you 
> mentioned the 7.5 mins blocking time).

I observed this behavior when my main DNS server accidentally died. In order to 
reproduce it, I'd have to voluntarily kill it, which I don't want to do as this 
is the main DNS server for all the domains we use, not just some test server. 
The only other alternative would be o setup a whole parallel platform with 
separate domains and name servers and proxies just to test this which is 
something I do not have the time for, just to make this test.

I posted this hoping that someone could take a look at the TCP worker 
allocation and spot if there is something wrong with it.

> 
> Regards,
> 
> Bogdan-Andrei Iancu
> 
> OpenSIPS Founder and Developer
>  https://www.opensips-solutions.com
> OpenSIPS Summit, Amsterdam, May 2020
>  https://www.opensips.org/events/Summit-2020Amsterdam/
> OpenSIPS Bootcamp, Miami, March 2020
>  https://opensips.org/training/OpenSIPS_Bootcamp_2020/
> 
> On 1/17/20 3:27 PM, Dan Pascu wrote:
>> On 17 Jan 2020, at 12:19, Bogdan-Andrei Iancu wrote:
>> 
>>> Hi Dan,
>>> 
>>> So, 29 of the SUBSCRIBE requests are forwarded via TCP to another servers, 
>>> let's call it S1.
>>> 
>>> Now, looking again at the blocking stage:  if the UAC tries to send new 
>>> traffic via the same TCP conn, it is clear it will block, as the conn, on 
>>> the OpenSIPS side, is blocked by the DNS queries.
>> Figured that much.
>> 
>>> Still, opening a new TCP conn should do it - the conn will be assigned to 
>>> the next TCP worker process.
>> That's what I thought, but not what I observed. During the busy window 
>> nothing worked except the RLS trying to send using 1 worker, while 4 other 
>> workers were idle and no other incoming traffic was processed.
>> 
>>> I think it is very unlikely (still not impossible) to have this new conn 
>>> assigned also to the same proc as the first one (the blocked one). Have you 
>>> tried checking the internal load of the TCP processes ("get_statistics 
>>> load:") ?
>> It's in my original post. The load was 100%/100%/77% during the busy window.
>> 
>> My question is why is it still possible (even if unlikely) to allocate some 
>> job to this TCP worker that is already busy. When a request comes in and a 
>> worker is selected to process it, isn't that worker marked as busy, so it's 
>> not used by anything else until it finishes processing the current request, 
>> when it should be released back to the pool and only then be able to be used 
>> for something else?
>> 
>>> Regards,
>>> 
>>> Bogdan-Andrei Iancu
>>> 
>>> OpenSIPS Founder and Developer
>>>  https://www.opensips-solutions.com
>>> OpenSIPS Summit, Amsterdam, May 2020
>>>  https://www.opensips.org/events/Summit-2020Amsterdam/
>>> OpenSIPS Bootcamp, Miami, March 2020
>>>  https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>> 
>>> On 1/16/20 5:23 PM, Dan Pascu wrote:
>>>> On 15 Jan 2020, at 15:52, Bogdan-Andrei Iancu wrote:
>>>> 
>>>>> Hi Dan,
>>>>> 
>>>>> OK, so the Presence Server is on a different instance than the RLS server 
>>>>> and the routing between is DNS based  - thanks for clarifying this for me.
>>>> No. The server handles both RLS and presence, but the contacts for the 
>>>> account are not on this server they are handled by remote servers. So when 
>>>> the RLS subscribe comes in, it finds the 30 contacts out of which only 1 
>>>> is local. The other 29 are on some other server. But overall it doesn't 
>>>> matter. Because dns=yes in config, even for the local contact it will 
>>>> still to DNS lookups before forwarding it to itself.
>>>> 
>>>>> Again, as a fist step, I'm trying to understand the scenario you have 
>>>>> (including how RLS works) so I can understand what is going on inside of 
>>>>> OpenSIPS.
>>>>> 
>>>>> So we have the 30 SUBSCRIBE requests being sent out from one TCP worker 
>>>>> process. And due the DNS failure, this takes 7.5 minutes. And you claim 
>>>>> that during these 7.5 minutes OpenSIPS was not able to handle any other 
>>>>> traffic ? If so, same questions:
>>>>>     - the remaining traffic was coming via the same TCP conn as the 
>>>>> original received SUBSCRIBE ?
>>>> Yes and no. Initially was yes, because it was just a client connected over 
>>>> TLS and sending all requests over that. So when I attempted to send an 
>>>> INVITE it was sent over the same TLS connection. But when that failed, I 
>>>> tried to change my outbound proxy to TCP and also tried to dial the 
>>>> address as sip:user@domain;transport=tcp which should have forced a 
>>>> separate TCP connection. But neither worked, the only thing that worked 
>>>> was to use UDP (either through the outbound proxy or explicitly in the 
>>>> URI).
>>>> 
>>>>>     - if NO on above, the remaining traffic was also RLS related ?
>>>> No. It was INVITE/REGISTER
>>>> 
>>>>>     - if NO on above, the remaining traffic also TCP/TLS or was also UDP ?
>>>> TCP/TLS didn't work in any way during the 7.5 minute busy window, UDP 
>>>> worked.
>>>> 
>>>>> Just as a fact, there is no kind of dependency/sync between the TCP 
>>>>> workers, in a way to block all the other procs while one of them is busy.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Bogdan-Andrei Iancu
>>>>> 
>>>>> OpenSIPS Founder and Developer
>>>>>  https://www.opensips-solutions.com
>>>>> OpenSIPS Summit, Amsterdam, May 2020
>>>>>  https://www.opensips.org/events/Summit-2020Amsterdam/
>>>>> OpenSIPS Bootcamp, Miami, March 2020
>>>>>  https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>>>> 
>>>>> On 1/15/20 1:13 PM, Dan Pascu wrote:
>>>>>> On 15 Jan 2020, at 12:07, Bogdan-Andrei Iancu wrote:
>>>>>> 
>>>>>>> Hi Dan,
>>>>>>> 
>>>>>>> I'm not so familiar with how the RLS works, so bear with me while 
>>>>>>> trying to understand the scenario.
>>>>>>> 
>>>>>>> In your case, the initial subscribe (to RLS), from the end point, will 
>>>>>>> be received by opensips via one of the TCP workers (the alg for 
>>>>>>> distributing the new TCP conns to a TCP worker is rather simple, 
>>>>>>> filling in the workers from first to last, so yes, the first conn will 
>>>>>>> land all the time into the first worker).
>>>>>>> 
>>>>>>> Now, while handling this initial SUBSCRIBE (in RLS module), the module 
>>>>>>> will fire the 30 new SUBSCRIBE requests (to the real entities) - these 
>>>>>>> SUBSCRIBEs (I assume) will go via loopback to OpenSIPS again, as 
>>>>>>> presence server - it this correct?
>>>>>> No. All (except maybe one) will go to a different server (this is a test 
>>>>>> server and all contacts are in different domains). But even for the one 
>>>>>> that is local it still does the DNS lookup because I have dns = yes in 
>>>>>> settings that forces DNS lookups even for local domains.
>>>>>> 
>>>>>>> if so, is there any DNS lookup involved while these SUBSCRIBEs are 
>>>>>>> looping (I would say no, as RLS uses the `presence_server` param (but 
>>>>>>> it can be FQDN I guess :-/ ) ?
>>>>>> Yes. See above. There is a DNS lookup (3 actually NAPTR/SRV/A) involved 
>>>>>> for every destination. That's why it's blocked for 7.5 minutes out of 
>>>>>> every 10 minutes.
>>>>>> 
>>>>>>> Aslo, this looping is done via TCP (between opensips RLS and opensips 
>>>>>>> Presence Server) ?
>>>>>> Yes. TLS actually.
>>>>>> 
>>>>>> In case I didn't convey my message well, I want to reiterate that I do 
>>>>>> not see this as an RLS issue (except maybe that RLS could be optimized 
>>>>>> to use different workers for the different SUBSCRIBEs it sends out, 
>>>>>> instead of sending all of them from just one worker). The problem I see 
>>>>>> is that while this 1st worker is occupied for 7.5 minutes out of 10 with 
>>>>>> RLS, no incoming request is processed, as if it is allocated to the same 
>>>>>> worker that is already busy with RLS. RLS in this case is just a useful 
>>>>>> trigger that highlights the problem because it keeps 1 worker busy 75% 
>>>>>> of the time.
>>>>>> 
>>>>>> My question is about why are no other incoming requests processed during 
>>>>>> the 7.5 minutes while RLS keeps the 1st TCP worker busy (because of DNS 
>>>>>> failures), but I see incoming requests processed when I'm in the 2.5 
>>>>>> minute window when RLS is idle and not using TCP worker 1. For me it 
>>>>>> looks like some allocation problem with TCP workers as there are 4 idle 
>>>>>> workers, yet the incoming messages are not allocated to any of them.
>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Bogdan-Andrei Iancu
>>>>>>> 
>>>>>>> OpenSIPS Founder and Developer
>>>>>>>  https://www.opensips-solutions.com
>>>>>>> OpenSIPS Summit, Amsterdam, May 2020
>>>>>>>  https://www.opensips.org/events/Summit-2020Amsterdam/
>>>>>>> OpenSIPS Bootcamp, Miami, March 2020
>>>>>>>  https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>>>>>> 
>>>>>>> On 1/10/20 1:59 PM, Dan Pascu wrote:
>>>>>>>> I noticed some unexpected behavior related to how TCP workers are 
>>>>>>>> allocated to process requests. This was highlighted during dome DNS 
>>>>>>>> outage due to how opensips was configured.
>>>>>>>> 
>>>>>>>> Here are the relevant bits from my configuration to layout the context:
>>>>>>>> 
>>>>>>>> I listen on UDP, TCP and TLS and I start 5 UDP and 5 TCP worker 
>>>>>>>> processes, but allow them to grow up to 25 based on load:
>>>>>>>> 
>>>>>>>> listen = udp:IP:5060
>>>>>>>> listen = tcp:IP:5060
>>>>>>>> listen = tls:IP:5061
>>>>>>>> 
>>>>>>>> auto_scaling_profile = SIP_WORKERS
>>>>>>>>      scale up to 25 on 80% for 4 cycles within 5
>>>>>>>>      scale down to 5 on 20% for 10 cycles
>>>>>>>> 
>>>>>>>> tcp_workers = 5 use_auto_scaling_profile SIP_WORKERS
>>>>>>>> udp_workers = 5 use_auto_scaling_profile SIP_WORKERS
>>>>>>>> 
>>>>>>>> 
>>>>>>>> DNS is configured to use only 1 server and only 1 attempt with a 
>>>>>>>> timeout of 5 seconds per request:
>>>>>>>> 
>>>>>>>> dns                   = yes
>>>>>>>> rev_dns               = no
>>>>>>>> dns_use_search_list   = no
>>>>>>>> disable_dns_blacklist = yes
>>>>>>>> dns_retr_time         = 5
>>>>>>>> dns_retr_no           = 1
>>>>>>>> dns_servers_no        = 1
>>>>>>>> 
>>>>>>>> This means that every time a domain is looked up but the DNS server is 
>>>>>>>> down, it will do 3 requests (NAPTR, SRV and A) and each will take 5 
>>>>>>>> seconds to timeout. In other words a DNS lookup for a domain will 
>>>>>>>> timeout after 15 seconds.
>>>>>>>> 
>>>>>>>> I have 1 device that connects over TLS and registers an account that 
>>>>>>>> uses RLS and has 30 contacts stored.
>>>>>>>> 
>>>>>>>> Now the event was that the main DNS server was down, and because of my 
>>>>>>>> configuration I didn't fallback to the secondary one from resolv.conf 
>>>>>>>> so all DNS requests failed.
>>>>>>>> 
>>>>>>>> During this time I noticed that whenever RLS kicked in it would 
>>>>>>>> attempt to send SUBSCRIBEs to the 30 contacts and fail for each of 
>>>>>>>> them, and the whole thing would take approximately 7.5 minutes, during 
>>>>>>>> which time it would always use the 1st TCP worker which would increase 
>>>>>>>> it's load and 1 minute load to 100% and the 10 minutes load would stay 
>>>>>>>> at 77%. This was in line with the fact that RLS was triggered every 10 
>>>>>>>> minutes and spend 7.5 minutes stuck in DNS timeouts, so it was busy 
>>>>>>>> approximately 75% of the time.
>>>>>>>> 
>>>>>>>> The fact that RLS always used TCP worker 1 is not unexpected as the 
>>>>>>>> SIP device I mentioned was the only one connected to the proxy and the 
>>>>>>>> only one sending requests, so the proxy was mostly idle doing RLS 
>>>>>>>> every 10 minutes, besides the occasional REGISTER/SUBSCRIBE from the 
>>>>>>>> device.
>>>>>>>> 
>>>>>>>> The unexpected behavior is that during the 7.5 minutes when RLS tried 
>>>>>>>> to send SUBSCRIBEs to the contacts, any REGISTER or INVITE received by 
>>>>>>>> the proxy would not be processed. They seem to be scheduled on the 
>>>>>>>> same 1st TCP worker that is already loaded 100% from the RLS 
>>>>>>>> processing that is going on. I never see any log message from my 
>>>>>>>> script about processing the REGISTER or INVITE and they just timeout 
>>>>>>>> on the client. If I send the REGISTER or INVITE during the 2.5 minutes 
>>>>>>>> when RLS is not trying to send SUBSCRIBEs to the contacts, then I see 
>>>>>>>> the REGISTER and INVITE being processed and logging from the script, 
>>>>>>>> but the INVITE also fails due to DNS failure.
>>>>>>>> 
>>>>>>>> If I change my outbound proxy to prefer UDP, then I see the REGISTER 
>>>>>>>> and INVITE being processed, but if I use TCP or TLS I do not see them 
>>>>>>>> being processed unless I'm in the 2.5 minute window when the proxy is 
>>>>>>>> not doing RLS (actually I never checked but it's possible that the 
>>>>>>>> requests that arrived in the 7.5 minute window were actually processed 
>>>>>>>> and logged when the RLS processing window ended, but I never waited 
>>>>>>>> that long and they always timeout out on the client in 30 seconds).
>>>>>>>> 
>>>>>>>> Now I can understand that RLS does all in a single worker (it does a 
>>>>>>>> database lookup for the contacts and then loops all of them trying to 
>>>>>>>> send a SUBSCRIBE for each), even though it could be argued that it 
>>>>>>>> could be optimized to delegate each sending out to a different worker.
>>>>>>>> 
>>>>>>>> What puzzles me is why is opensips allocating the incoming requests it 
>>>>>>>> receives to a TCP worker that is already busy and shows a 100% load in 
>>>>>>>> opensips-cli, while it has 4 other TCP workers that are completely 
>>>>>>>> idle. Or if my conclusion is wrong, what exactly happens that during 
>>>>>>>> the 7.5 minutes where RLS uses TCP worker 1 trying to send out the 
>>>>>>>> SUBSCRIBEs and failing, that no incoming request is processed by the 
>>>>>>>> other 4 idle TCP workers and it just times out?
>>>>>>>> 
>>>>>>>> That is not to say that I do not see the other TCP worker's pid in 
>>>>>>>> syslog at all, but they only appear very rarely and the idle workers 
>>>>>>>> do not seem to ever be used during the 7.5 minute busy window when the 
>>>>>>>> 1st worker is 100% loaded. So some worker allocation seems to happen 
>>>>>>>> when processing multiple incoming requests that arrive in parallel, 
>>>>>>>> but while RLS is sending out the SUBSCRIBEs it never seems to try to 
>>>>>>>> use the idle workers for incoming requests.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Dan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/devel
>>>>>> --
>>>>>> Dan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> --
>>>> Dan
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> --
>> Dan
>> 
>> 
>> 
>> 
> 


--
Dan





_______________________________________________
Devel mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel

Reply via email to