On Wed, Oct 9, 2024 at 11:36 AM Joe Fletcher <jflet...@sscinc.com> wrote:
> Just a thought: what are the open files limits set to on the system? > > > > I’ve seen plenty of cases where increasing descriptor limits can help in > these instances. > > > > > > > > *From:* Thierry Bordaz via 389-users <389-users@lists.fedoraproject.org> > *Sent:* Wednesday, October 9, 2024 10:30 AM > *To:* Gary Waters <gwat...@caltech.edu>; General discussion list for the > 389 Directory server project. <389-users@lists.fedoraproject.org>; Marc > Sauton <msau...@redhat.com>; James Chapman <jacha...@redhat.com> > *Cc:* gwaters-...@caltech.edu; Thierry Bordaz <tbor...@redhat.com> > *Subject:* [389-users] Re: Inconsistent Ldap connection issues > > > > **** The e-mail below is from an external source. Please do not open > attachments or click links from an unknown or suspicious origin. **** > > > > Hi Gary, > > My understanding is that a single client opens ~10K connections in ~6min, > that is an average of 26 connections per second. It creates a set of > established connections that spike up to 961 parallel connections. The > problem being that from time to time the client reports 'can not contact > LDAP server' and while you are running 2.4 this problem did not occur in > 1.4.3. Am I correct ? > > My first guess is that Marc is correct and tuning of tcp param should be > the culprit. Now it does not explain why it occurs only in 2.4 but may be > you were running 1.4.3 on RHEL7 and are running 2.4 on RHEL8. > > 2.4 contains significant changes around that area. I do not recall all > commit in connection handling side. AFAIK what is "new" is a dedicated > listener thread (polling on ports 389/636/ldapi for new connections) and a > multiple polling threads (polling the established connections) > My understanding is that your problem impacts the listener thread as it is > responsible to accept new connections and it is the new problem you detect > in 2.4. @james did I miss something ? > JC: There has been a lot of work around connection management around this version, the "accept" thread that is responsible for polling the ldap/ldaps ports for connection attempts was introduced in a previous version. This thread will fail to accept a new connection when the number of existing connections hits a certain limit. However, when this happens you should see this message in the error logs: "Not listening for new connections - too many fds open", so IMHO I think the error message you see is related to something else. > If the culprit is not in TCP tuning, @james do you remember known bugs > that would limit the listener thread ? > JC: I failed to reproduce this issue locally with the same version and file descriptor limit (8192) and 10k connections. So I am not aware of anything else that might cause this error message. > best regards > thierry > > On 10/8/24 7:36 PM, Gary Waters wrote: > > Hi Thierry, > > Ah yes of course. Here is 1 run of their web app load test, it is 6 > minutes long, and it should mostly be only the test it self. I will start > looking for > > We encountered 2 "Can not contact ldap server" errors during this run. > > 2 cant contact ldap server errors in this run > > ----------- Access Log Output ------------ > > Start of Logs: 08/Oct/2024:09:53:35.810833927 > End of Logs: 08/Oct/2024:09:59:52.361830449 > > Processed Log Time: 0 Hours, 6 Minutes, 16.550998016 Seconds > > Restarts: 1 > Secure Protocol Versions: > - TLS1.2 128-bit AES-GCM (9833 connections) > > Peak Concurrent Connections: 689 > Total Operations: 86412 > Total Results: 86412 > Overall Performance: 100.0% > > Total Connections: 9933 (26.38/sec) (1582.73/min) > - LDAP Connections: 9933 (26.38/sec) (1582.73/min) > - LDAPI Connections: 0 (0.00/sec) (0.00/min) > - LDAPS Connections: 0 (0.00/sec) (0.00/min) > - StartTLS Extended Ops: 9833 (26.11/sec) (1566.80/min) > > Searches: 66647 (176.99/sec) (10619.60/min) > Modifications: 0 (0.00/sec) (0.00/min) > Adds: 0 (0.00/sec) (0.00/min) > Deletes: 0 (0.00/sec) (0.00/min) > Mod RDNs: 0 (0.00/sec) (0.00/min) > Compares: 0 (0.00/sec) (0.00/min) > Binds: 9932 (26.38/sec) (1582.57/min) > > Average wtime (wait time): 0.001407368 > Average optime (op time): 0.003186859 > Average etime (elapsed time): 0.004591048 > > Multi-factor Authentications: 0 > Proxied Auth Operations: 0 > Persistent Searches: 0 > Internal Operations: 0 > Entry Operations: 0 > Extended Operations: 9833 > Abandoned Requests: 0 > Smart Referrals Received: 0 > > VLV Operations: 0 > VLV Unindexed Searches: 0 > VLV Unindexed Components: 0 > SORT Operations: 0 > > Entire Search Base Queries: 0 > Paged Searches: 0 > Unindexed Searches: 0 > Unindexed Components: 0 > Invalid Attribute Filters: 0 > FDs Taken: 9933 > FDs Returned: 9932 > Highest FD Taken: 961 > > Broken Pipes: 0 > Connections Reset By Peer: 0 > Resource Unavailable: 0 > Max BER Size Exceeded: 0 > > Binds: 9932 > Unbinds: 9225 > ----------------------------------- > - LDAP v2 Binds: 0 > - LDAP v3 Binds: 9932 > - AUTOBINDs(LDAPI): 0 > - SSL Client Binds: 0 > - Failed SSL Client Binds: 0 > - SASL Binds: 0 > - Directory Manager Binds: 0 > - Anonymous Binds: 99 > > ================ > > After this run I bumped up these from 4096, > > net.ipv4.tcp_max_syn_backlog = 6144 > net.core.somaxconn = 6144 > > Yet we still get the ldap errors (this one and the start tls request error > previously mentioned.) > > Should I bump up the > > On 10/8/24 03:47, Thierry Bordaz wrote: > > Hi Gary, > > I meant that the access logs covered 5hours. It would be helpful to > capture/focus on the logs from the few minutes before/after the time when > the problem occurred. Then check from those limited logs if there is a > pattern or unexpected events (long operation, no operation, abandon, ...) > > best regards > thierry > > On 10/7/24 7:37 PM, Gary Waters wrote: > > Hi Thierry, > > Ok I ll decrease the timeout to 15 seconds then. > > Reducing the size of the logs will help. > > Which log and how do I do this ? > > Thanks Marc and Theirry! > > -Gary > > On 10/7/24 00:26, Thierry Bordaz wrote: > > Hi, > > Those slap_poll error means that the server was unable to send back PDU to > the client. It can occur if the client sends a request and does not read > fast enough the results. The timeout is high 30s (30000), could it be that > the problem is on the client side (app) ? > > I suggest that you focus on the timestamp when the application reports a > failure. Then look in the access/error logs from 1-3min before and after > the time of the failure. Logconv from that limited scope will be more > helpful than a global one. > > The pattern looks to be an app opens a connection, switch to secure > connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks > fine but as it is an average (over 1M op) it is not helpful. Reducing the > size of the logs will help. > I found interesting the abandon op as it is possibly related to a > performance issue. > > best regards > thierry > > > > On 10/4/24 11:54 PM, Gary Waters via 389-users wrote: > > Hi Marc, > > I have made nsslapd-listen-backlog-size to 512. > > For the ioblocktimeout, I increased it because of an error I was seeing: > > [30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed > out > [30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed > out > > Googling stated that I should increase the ioblocktimeout. So I bumped it > up from 20000 to 30000. > > Since then, those slapd_poll timed out errors have not occurred. Should I > have changed something else? > > What should I increase these to? > > net.core.somaxconn = 4096 > net.ipv4.tcp_max_syn_backlog = 4096 > > Thanks so much for your help! > > -Gary > > On 10/4/24 11:55, Marc Sauton wrote: > > tune up nsslapd-listen-backlog-size > > and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are > high enough ( sysctl -a ) > > possibly tune down the nsslapd-ioblocktimeout value > > Thanks, > > M. > > > > On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users < > 389-users@lists.fedoraproject.org> wrote: > > Hello, > > We are experiencing a new issue since we upgraded from 389-ds-base from > 1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I > switched to RHEL9 and are on 2.4.5-9. > > The issue is during a performance load test of a web application. The > app logs into a website and does some things that searches against ldap, > and does some transactions. This app has been performing fine for years, > the app has changed so it could be something there, but I am not sure > about that because of the percentage of the traffic that is successful. > > The errors for the web app are "Can't contact Ldap Server" and sometimes > "Can't contact LDAP server. Start TLS request accepted.Server willing to > negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these > errors will happen like 5 or 6 times, so its wildly inconsistent and > random. > > I did a logconv analysis with 6 hours of a day of testing, see below. > One thing that really stood out to me was the peak concurrent > connections = 22.. That peak is so low, I dont know how these errors are > happening. > > I dont see any errors in the access log ( grepping for err=1). > I looked for cache warnings/errors in the access/errors logs, but didnt > find any. I dont see things like unavailable connections in the access > logs. > > Suggestions on what to change or look for in the logs ? > > Thanks, > Gary > > > information: > Machine Size: 16G of ram, 4 core AMD (its an EC2.m5.large, gp3 disk type) > > kernel: > Linux 5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC > packages: > 389-ds-base-libs-2.4.5-9.el9_4.x86_64 > 389-ds-base-2.4.5-9.el9_4.x86_64 > > single instance of dirsrv running > dirsrv modifcations from default: > > nsslapd-logging-backend: dirsrv-log,syslog > nsslapd-maxdescriptors: 8192 > nsslapd-listen-backlog-size: 256 > nsslapd-allow-hashed-passwords: on > nsslapd-idletimeout: 30 > nsslapd-ioblocktimeout: 30000 > nsslapd-sizelimit: -1 > nsslapd-auditlog-logging-enabled: off > nsslapd-lookthroughlimit: -1 > > dirsrv.systemd: > limitNOFILE=8192 > > >Total Log Lines Analszed: 2694287 > > > > > > > > ---------- Access Log Output ------------ > > > > Start of Logs: 26/Sep/2024:10:07:32.089983378 > > End of Logs: 26/Sep/2024:15:54:29.895403688 > > > > Processed Log Time: 5 Hours, 46 Minutes, 57.805426688 Seconds > > > > Restarts: 0 > > Secure Protocol Versions: > > - TLS1.2 128-bit AES-GCM (123117 connections) > > > > Peak Concurrent Connections: 22 > > Total Operations: 1097043 > > Total Results: 1097044 > > Overall Performance: 100.0% > > > > Total Connections: 128646 (6.18/sec) (370.78/min) > > - LDAP Connections: 128646 (6.18/sec) (370.78/min) > > - LDAPI Connections: 0 (0.00/sec) (0.00/min) > > - LDAPS Connections: 0 (0.00/sec) (0.00/min) > > - StartTLS Extended Ops: 123116 (5.91/sec) (354.84/min) > > > > Searches: 845279 (40.60/sec) (2436.22/min) > > Modifications: 0 (0.00/sec) (0.00/min) > > Adds: 0 (0.00/sec) (0.00/min) > > Deletes: 0 (0.00/sec) (0.00/min) > > Mod RDNs: 0 (0.00/sec) (0.00/min) > > Compares: 0 (0.00/sec) (0.00/min) > > Binds: 128647 (6.18/sec) (370.78/min) > > > > Average wtime (wait time): 0.001560856 > > Average optime (op time): 0.003310453 > > Average etime (elapsed time): 0.004868040 > > > > Multi-factor Authentications: 0 > > Proxied Auth Operations: 0 > > Persistent Searches: 0 > > Internal Operations: 0 > > Entry Operations: 0 > > Extended Operations: 123116 > > Abandoned Requests: 1 > > Smart Referrals Received: 0 > > > > VLV Operations: 0 > > VLV Unindexed Searches: 0 > > VLV Unindexed Components: 0 > > SORT Operations: 0 > > > > Entire Search Base Queries: 0 > > Paged Searches: 0 > > Unindexed Searches: 0 > > Unindexed Components: 0 > > Invalid Attribute Filters: 0 > > FDs Taken: 128646 > > FDs Returned: 129318 > > Highest FD Taken: 968 > > > > Broken Pipes: 0 > > Connections Reset By Peer: 0 > > Resource Unavailable: 0 > > Max BER Size Exceeded: 0 > > > > Binds: 128647 > > Unbinds: 119206 > > ------------------------------------- > > - LDAP v2 Binds: 0 > > - LDAP v3 Binds: 128647 > > - AUTOBINDs(LDAPI): 0 > > - SSL Client Binds: 0 > > - Failed SSL Client Binds: 0 > > - SASL Binds: 0 > > - Dir > > -- > _______________________________________________ > 389-users mailing list -- 389-users@lists.fedoraproject.org > To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > <https://urldefense.com/v3/__https:/docs.fedoraproject.org/en-US/project/code-of-conduct/__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEiq_mj_I$> > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > <https://urldefense.com/v3/__https:/fedoraproject.org/wiki/Mailing_list_guidelines__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEsTfSxSj$> > List Archives: > https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org > <https://urldefense.com/v3/__https:/lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEslJO53N$> > Do not reply to spam, report it: > https://pagure.io/fedora-infrastructure/new_issue > <https://urldefense.com/v3/__https:/pagure.io/fedora-infrastructure/new_issue__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEvlOa8tQ$> > > > >
-- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue