On 20 Jan 2020, at 10:12, Bogdan-Andrei Iancu wrote: > Hi Dan, > > If you are able to reproduce the issue, while having OpenSIPS "blocked" do > "trap" and "mi get_statistics load:" in cli. Let's see exactly what is going > on. Eventually run these commands at 1 min intervals (considering that you > mentioned the 7.5 mins blocking time).
I observed this behavior when my main DNS server accidentally died. In order to reproduce it, I'd have to voluntarily kill it, which I don't want to do as this is the main DNS server for all the domains we use, not just some test server. The only other alternative would be o setup a whole parallel platform with separate domains and name servers and proxies just to test this which is something I do not have the time for, just to make this test. I posted this hoping that someone could take a look at the TCP worker allocation and spot if there is something wrong with it. > > Regards, > > Bogdan-Andrei Iancu > > OpenSIPS Founder and Developer > https://www.opensips-solutions.com > OpenSIPS Summit, Amsterdam, May 2020 > https://www.opensips.org/events/Summit-2020Amsterdam/ > OpenSIPS Bootcamp, Miami, March 2020 > https://opensips.org/training/OpenSIPS_Bootcamp_2020/ > > On 1/17/20 3:27 PM, Dan Pascu wrote: >> On 17 Jan 2020, at 12:19, Bogdan-Andrei Iancu wrote: >> >>> Hi Dan, >>> >>> So, 29 of the SUBSCRIBE requests are forwarded via TCP to another servers, >>> let's call it S1. >>> >>> Now, looking again at the blocking stage: if the UAC tries to send new >>> traffic via the same TCP conn, it is clear it will block, as the conn, on >>> the OpenSIPS side, is blocked by the DNS queries. >> Figured that much. >> >>> Still, opening a new TCP conn should do it - the conn will be assigned to >>> the next TCP worker process. >> That's what I thought, but not what I observed. During the busy window >> nothing worked except the RLS trying to send using 1 worker, while 4 other >> workers were idle and no other incoming traffic was processed. >> >>> I think it is very unlikely (still not impossible) to have this new conn >>> assigned also to the same proc as the first one (the blocked one). Have you >>> tried checking the internal load of the TCP processes ("get_statistics >>> load:") ? >> It's in my original post. The load was 100%/100%/77% during the busy window. >> >> My question is why is it still possible (even if unlikely) to allocate some >> job to this TCP worker that is already busy. When a request comes in and a >> worker is selected to process it, isn't that worker marked as busy, so it's >> not used by anything else until it finishes processing the current request, >> when it should be released back to the pool and only then be able to be used >> for something else? >> >>> Regards, >>> >>> Bogdan-Andrei Iancu >>> >>> OpenSIPS Founder and Developer >>> https://www.opensips-solutions.com >>> OpenSIPS Summit, Amsterdam, May 2020 >>> https://www.opensips.org/events/Summit-2020Amsterdam/ >>> OpenSIPS Bootcamp, Miami, March 2020 >>> https://opensips.org/training/OpenSIPS_Bootcamp_2020/ >>> >>> On 1/16/20 5:23 PM, Dan Pascu wrote: >>>> On 15 Jan 2020, at 15:52, Bogdan-Andrei Iancu wrote: >>>> >>>>> Hi Dan, >>>>> >>>>> OK, so the Presence Server is on a different instance than the RLS server >>>>> and the routing between is DNS based - thanks for clarifying this for me. >>>> No. The server handles both RLS and presence, but the contacts for the >>>> account are not on this server they are handled by remote servers. So when >>>> the RLS subscribe comes in, it finds the 30 contacts out of which only 1 >>>> is local. The other 29 are on some other server. But overall it doesn't >>>> matter. Because dns=yes in config, even for the local contact it will >>>> still to DNS lookups before forwarding it to itself. >>>> >>>>> Again, as a fist step, I'm trying to understand the scenario you have >>>>> (including how RLS works) so I can understand what is going on inside of >>>>> OpenSIPS. >>>>> >>>>> So we have the 30 SUBSCRIBE requests being sent out from one TCP worker >>>>> process. And due the DNS failure, this takes 7.5 minutes. And you claim >>>>> that during these 7.5 minutes OpenSIPS was not able to handle any other >>>>> traffic ? If so, same questions: >>>>> - the remaining traffic was coming via the same TCP conn as the >>>>> original received SUBSCRIBE ? >>>> Yes and no. Initially was yes, because it was just a client connected over >>>> TLS and sending all requests over that. So when I attempted to send an >>>> INVITE it was sent over the same TLS connection. But when that failed, I >>>> tried to change my outbound proxy to TCP and also tried to dial the >>>> address as sip:user@domain;transport=tcp which should have forced a >>>> separate TCP connection. But neither worked, the only thing that worked >>>> was to use UDP (either through the outbound proxy or explicitly in the >>>> URI). >>>> >>>>> - if NO on above, the remaining traffic was also RLS related ? >>>> No. It was INVITE/REGISTER >>>> >>>>> - if NO on above, the remaining traffic also TCP/TLS or was also UDP ? >>>> TCP/TLS didn't work in any way during the 7.5 minute busy window, UDP >>>> worked. >>>> >>>>> Just as a fact, there is no kind of dependency/sync between the TCP >>>>> workers, in a way to block all the other procs while one of them is busy. >>>>> >>>>> Regards, >>>>> >>>>> Bogdan-Andrei Iancu >>>>> >>>>> OpenSIPS Founder and Developer >>>>> https://www.opensips-solutions.com >>>>> OpenSIPS Summit, Amsterdam, May 2020 >>>>> https://www.opensips.org/events/Summit-2020Amsterdam/ >>>>> OpenSIPS Bootcamp, Miami, March 2020 >>>>> https://opensips.org/training/OpenSIPS_Bootcamp_2020/ >>>>> >>>>> On 1/15/20 1:13 PM, Dan Pascu wrote: >>>>>> On 15 Jan 2020, at 12:07, Bogdan-Andrei Iancu wrote: >>>>>> >>>>>>> Hi Dan, >>>>>>> >>>>>>> I'm not so familiar with how the RLS works, so bear with me while >>>>>>> trying to understand the scenario. >>>>>>> >>>>>>> In your case, the initial subscribe (to RLS), from the end point, will >>>>>>> be received by opensips via one of the TCP workers (the alg for >>>>>>> distributing the new TCP conns to a TCP worker is rather simple, >>>>>>> filling in the workers from first to last, so yes, the first conn will >>>>>>> land all the time into the first worker). >>>>>>> >>>>>>> Now, while handling this initial SUBSCRIBE (in RLS module), the module >>>>>>> will fire the 30 new SUBSCRIBE requests (to the real entities) - these >>>>>>> SUBSCRIBEs (I assume) will go via loopback to OpenSIPS again, as >>>>>>> presence server - it this correct? >>>>>> No. All (except maybe one) will go to a different server (this is a test >>>>>> server and all contacts are in different domains). But even for the one >>>>>> that is local it still does the DNS lookup because I have dns = yes in >>>>>> settings that forces DNS lookups even for local domains. >>>>>> >>>>>>> if so, is there any DNS lookup involved while these SUBSCRIBEs are >>>>>>> looping (I would say no, as RLS uses the `presence_server` param (but >>>>>>> it can be FQDN I guess :-/ ) ? >>>>>> Yes. See above. There is a DNS lookup (3 actually NAPTR/SRV/A) involved >>>>>> for every destination. That's why it's blocked for 7.5 minutes out of >>>>>> every 10 minutes. >>>>>> >>>>>>> Aslo, this looping is done via TCP (between opensips RLS and opensips >>>>>>> Presence Server) ? >>>>>> Yes. TLS actually. >>>>>> >>>>>> In case I didn't convey my message well, I want to reiterate that I do >>>>>> not see this as an RLS issue (except maybe that RLS could be optimized >>>>>> to use different workers for the different SUBSCRIBEs it sends out, >>>>>> instead of sending all of them from just one worker). The problem I see >>>>>> is that while this 1st worker is occupied for 7.5 minutes out of 10 with >>>>>> RLS, no incoming request is processed, as if it is allocated to the same >>>>>> worker that is already busy with RLS. RLS in this case is just a useful >>>>>> trigger that highlights the problem because it keeps 1 worker busy 75% >>>>>> of the time. >>>>>> >>>>>> My question is about why are no other incoming requests processed during >>>>>> the 7.5 minutes while RLS keeps the 1st TCP worker busy (because of DNS >>>>>> failures), but I see incoming requests processed when I'm in the 2.5 >>>>>> minute window when RLS is idle and not using TCP worker 1. For me it >>>>>> looks like some allocation problem with TCP workers as there are 4 idle >>>>>> workers, yet the incoming messages are not allocated to any of them. >>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Bogdan-Andrei Iancu >>>>>>> >>>>>>> OpenSIPS Founder and Developer >>>>>>> https://www.opensips-solutions.com >>>>>>> OpenSIPS Summit, Amsterdam, May 2020 >>>>>>> https://www.opensips.org/events/Summit-2020Amsterdam/ >>>>>>> OpenSIPS Bootcamp, Miami, March 2020 >>>>>>> https://opensips.org/training/OpenSIPS_Bootcamp_2020/ >>>>>>> >>>>>>> On 1/10/20 1:59 PM, Dan Pascu wrote: >>>>>>>> I noticed some unexpected behavior related to how TCP workers are >>>>>>>> allocated to process requests. This was highlighted during dome DNS >>>>>>>> outage due to how opensips was configured. >>>>>>>> >>>>>>>> Here are the relevant bits from my configuration to layout the context: >>>>>>>> >>>>>>>> I listen on UDP, TCP and TLS and I start 5 UDP and 5 TCP worker >>>>>>>> processes, but allow them to grow up to 25 based on load: >>>>>>>> >>>>>>>> listen = udp:IP:5060 >>>>>>>> listen = tcp:IP:5060 >>>>>>>> listen = tls:IP:5061 >>>>>>>> >>>>>>>> auto_scaling_profile = SIP_WORKERS >>>>>>>> scale up to 25 on 80% for 4 cycles within 5 >>>>>>>> scale down to 5 on 20% for 10 cycles >>>>>>>> >>>>>>>> tcp_workers = 5 use_auto_scaling_profile SIP_WORKERS >>>>>>>> udp_workers = 5 use_auto_scaling_profile SIP_WORKERS >>>>>>>> >>>>>>>> >>>>>>>> DNS is configured to use only 1 server and only 1 attempt with a >>>>>>>> timeout of 5 seconds per request: >>>>>>>> >>>>>>>> dns = yes >>>>>>>> rev_dns = no >>>>>>>> dns_use_search_list = no >>>>>>>> disable_dns_blacklist = yes >>>>>>>> dns_retr_time = 5 >>>>>>>> dns_retr_no = 1 >>>>>>>> dns_servers_no = 1 >>>>>>>> >>>>>>>> This means that every time a domain is looked up but the DNS server is >>>>>>>> down, it will do 3 requests (NAPTR, SRV and A) and each will take 5 >>>>>>>> seconds to timeout. In other words a DNS lookup for a domain will >>>>>>>> timeout after 15 seconds. >>>>>>>> >>>>>>>> I have 1 device that connects over TLS and registers an account that >>>>>>>> uses RLS and has 30 contacts stored. >>>>>>>> >>>>>>>> Now the event was that the main DNS server was down, and because of my >>>>>>>> configuration I didn't fallback to the secondary one from resolv.conf >>>>>>>> so all DNS requests failed. >>>>>>>> >>>>>>>> During this time I noticed that whenever RLS kicked in it would >>>>>>>> attempt to send SUBSCRIBEs to the 30 contacts and fail for each of >>>>>>>> them, and the whole thing would take approximately 7.5 minutes, during >>>>>>>> which time it would always use the 1st TCP worker which would increase >>>>>>>> it's load and 1 minute load to 100% and the 10 minutes load would stay >>>>>>>> at 77%. This was in line with the fact that RLS was triggered every 10 >>>>>>>> minutes and spend 7.5 minutes stuck in DNS timeouts, so it was busy >>>>>>>> approximately 75% of the time. >>>>>>>> >>>>>>>> The fact that RLS always used TCP worker 1 is not unexpected as the >>>>>>>> SIP device I mentioned was the only one connected to the proxy and the >>>>>>>> only one sending requests, so the proxy was mostly idle doing RLS >>>>>>>> every 10 minutes, besides the occasional REGISTER/SUBSCRIBE from the >>>>>>>> device. >>>>>>>> >>>>>>>> The unexpected behavior is that during the 7.5 minutes when RLS tried >>>>>>>> to send SUBSCRIBEs to the contacts, any REGISTER or INVITE received by >>>>>>>> the proxy would not be processed. They seem to be scheduled on the >>>>>>>> same 1st TCP worker that is already loaded 100% from the RLS >>>>>>>> processing that is going on. I never see any log message from my >>>>>>>> script about processing the REGISTER or INVITE and they just timeout >>>>>>>> on the client. If I send the REGISTER or INVITE during the 2.5 minutes >>>>>>>> when RLS is not trying to send SUBSCRIBEs to the contacts, then I see >>>>>>>> the REGISTER and INVITE being processed and logging from the script, >>>>>>>> but the INVITE also fails due to DNS failure. >>>>>>>> >>>>>>>> If I change my outbound proxy to prefer UDP, then I see the REGISTER >>>>>>>> and INVITE being processed, but if I use TCP or TLS I do not see them >>>>>>>> being processed unless I'm in the 2.5 minute window when the proxy is >>>>>>>> not doing RLS (actually I never checked but it's possible that the >>>>>>>> requests that arrived in the 7.5 minute window were actually processed >>>>>>>> and logged when the RLS processing window ended, but I never waited >>>>>>>> that long and they always timeout out on the client in 30 seconds). >>>>>>>> >>>>>>>> Now I can understand that RLS does all in a single worker (it does a >>>>>>>> database lookup for the contacts and then loops all of them trying to >>>>>>>> send a SUBSCRIBE for each), even though it could be argued that it >>>>>>>> could be optimized to delegate each sending out to a different worker. >>>>>>>> >>>>>>>> What puzzles me is why is opensips allocating the incoming requests it >>>>>>>> receives to a TCP worker that is already busy and shows a 100% load in >>>>>>>> opensips-cli, while it has 4 other TCP workers that are completely >>>>>>>> idle. Or if my conclusion is wrong, what exactly happens that during >>>>>>>> the 7.5 minutes where RLS uses TCP worker 1 trying to send out the >>>>>>>> SUBSCRIBEs and failing, that no incoming request is processed by the >>>>>>>> other 4 idle TCP workers and it just times out? >>>>>>>> >>>>>>>> That is not to say that I do not see the other TCP worker's pid in >>>>>>>> syslog at all, but they only appear very rarely and the idle workers >>>>>>>> do not seem to ever be used during the 7.5 minute busy window when the >>>>>>>> 1st worker is 100% loaded. So some worker allocation seems to happen >>>>>>>> when processing multiple incoming requests that arrive in parallel, >>>>>>>> but while RLS is sending out the SUBSCRIBEs it never seems to try to >>>>>>>> use the idle workers for incoming requests. >>>>>>>> >>>>>>>> -- >>>>>>>> Dan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Devel mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/devel >>>>>> -- >>>>>> Dan >>>>>> >>>>>> >>>>>> >>>>>> >>>> -- >>>> Dan >>>> >>>> >>>> >>>> >> >> -- >> Dan >> >> >> >> > -- Dan _______________________________________________ Devel mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/devel
