[389-users] Re: Inconsistent Ldap connection issues

James Chapman via 389-users Wed, 09 Oct 2024 09:25:12 -0700

On Wed, Oct 9, 2024 at 11:36 AM Joe Fletcher <jflet...@sscinc.com> wrote:


> Just a thought: what are the open files limits set to on the system?
>
>
>
> I’ve seen plenty of cases where increasing descriptor limits can help in
> these instances.
>
>
>
>
>
>
>
> *From:* Thierry Bordaz via 389-users <389-users@lists.fedoraproject.org>
> *Sent:* Wednesday, October 9, 2024 10:30 AM
> *To:* Gary Waters <gwat...@caltech.edu>; General discussion list for the
> 389 Directory server project. <389-users@lists.fedoraproject.org>; Marc
> Sauton <msau...@redhat.com>; James Chapman <jacha...@redhat.com>
> *Cc:* gwaters-...@caltech.edu; Thierry Bordaz <tbor...@redhat.com>
> *Subject:* [389-users] Re: Inconsistent Ldap connection issues
>
>
>
> **** The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin. ****
>
>
>
> Hi Gary,
>
> My understanding is that a single client opens ~10K connections in ~6min,
> that is an average of 26 connections per second. It creates a set of
> established connections that spike up to 961 parallel connections. The
> problem being that from time to time the client reports 'can not contact
> LDAP server' and while you are running 2.4 this problem did not occur in
> 1.4.3. Am I correct ?
>
> My first guess is that Marc is correct and tuning of tcp param should be
> the culprit. Now it does not explain why it occurs only in 2.4 but may be
> you were running 1.4.3 on RHEL7 and are running 2.4 on RHEL8.
>
> 2.4 contains significant changes around that area. I do not recall all
> commit in connection handling side. AFAIK what is "new" is a dedicated
> listener thread (polling on ports 389/636/ldapi for new connections) and a
> multiple polling threads (polling the established connections)
> My understanding is that your problem impacts the listener thread as it is
> responsible to accept new connections and it is the new problem you detect
> in 2.4. @james did I miss something ?
>
JC: There has been a lot of work around connection management around this
version, the "accept" thread that is responsible for polling the ldap/ldaps
ports for connection attempts was introduced in a previous version. This
thread will fail to accept a new connection when the number of existing
connections hits a certain limit. However, when this happens you should see
this message in the error logs: "Not listening for new connections - too
many fds open", so IMHO I think the error message you see is related to
something else.

> If the culprit is not in TCP tuning, @james do you remember known bugs
> that would limit the listener thread ?
>
JC: I failed to reproduce this issue locally with the same version and file
descriptor limit (8192) and 10k connections. So I am not aware of anything
else that might cause this error message.

> best regards
> thierry
>
> On 10/8/24 7:36 PM, Gary Waters wrote:
>
> Hi Thierry,
>
> Ah yes of course. Here is 1 run of their web app load test, it is 6
> minutes long, and it should mostly be only the test it self. I will start
> looking for
>
> We encountered 2 "Can not contact ldap server" errors during this run.
>
> 2 cant contact ldap server errors in this run
>
> ----------- Access Log Output ------------
>
> Start of Logs:    08/Oct/2024:09:53:35.810833927
> End of Logs:      08/Oct/2024:09:59:52.361830449
>
> Processed Log Time:  0 Hours, 6 Minutes, 16.550998016 Seconds
>
> Restarts:                      1
> Secure Protocol Versions:
>   - TLS1.2 128-bit AES-GCM (9833 connections)
>
> Peak Concurrent Connections:   689
> Total Operations:              86412
> Total Results:                 86412
> Overall Performance:           100.0%
>
> Total Connections:             9933          (26.38/sec)  (1582.73/min)
>  - LDAP Connections:           9933          (26.38/sec)  (1582.73/min)
>  - LDAPI Connections:          0             (0.00/sec)  (0.00/min)
>  - LDAPS Connections:          0             (0.00/sec)  (0.00/min)
>  - StartTLS Extended Ops:      9833          (26.11/sec)  (1566.80/min)
>
> Searches:                      66647         (176.99/sec)  (10619.60/min)
> Modifications:                 0             (0.00/sec)  (0.00/min)
> Adds:                          0             (0.00/sec)  (0.00/min)
> Deletes:                       0             (0.00/sec)  (0.00/min)
> Mod RDNs:                      0             (0.00/sec)  (0.00/min)
> Compares:                      0             (0.00/sec)  (0.00/min)
> Binds:                         9932          (26.38/sec)  (1582.57/min)
>
> Average wtime (wait time):     0.001407368
> Average optime (op time):      0.003186859
> Average etime (elapsed time):  0.004591048
>
> Multi-factor Authentications:  0
> Proxied Auth Operations:       0
> Persistent Searches:           0
> Internal Operations:           0
> Entry Operations:              0
> Extended Operations:           9833
> Abandoned Requests:            0
> Smart Referrals Received:      0
>
> VLV Operations:                0
> VLV Unindexed Searches:        0
> VLV Unindexed Components:      0
> SORT Operations:               0
>
> Entire Search Base Queries:    0
> Paged Searches:                0
> Unindexed Searches:            0
> Unindexed Components:          0
> Invalid Attribute Filters:     0
> FDs Taken:                     9933
> FDs Returned:                  9932
> Highest FD Taken:              961
>
> Broken Pipes:                  0
> Connections Reset By Peer:     0
> Resource Unavailable:          0
> Max BER Size Exceeded:         0
>
> Binds:                         9932
> Unbinds:                       9225
> -----------------------------------
>  - LDAP v2 Binds:              0
>  - LDAP v3 Binds:              9932
>  - AUTOBINDs(LDAPI):           0
>  - SSL Client Binds:           0
>  - Failed SSL Client Binds:    0
>  - SASL Binds:                 0
>  - Directory Manager Binds:    0
>  - Anonymous Binds:            99
>
> ================
>
> After this run I bumped up these from 4096,
>
> net.ipv4.tcp_max_syn_backlog = 6144
> net.core.somaxconn = 6144
>
> Yet we still get the ldap errors (this one and the start tls request error
> previously mentioned.)
>
> Should I bump up the
>
> On 10/8/24 03:47, Thierry Bordaz wrote:
>
> Hi Gary,
>
> I meant that the access logs covered 5hours. It would be helpful to
> capture/focus on the logs from the few minutes before/after the time when
> the problem occurred. Then check from those limited logs if there is a
> pattern or unexpected events (long operation, no operation, abandon, ...)
>
> best regards
> thierry
>
> On 10/7/24 7:37 PM, Gary Waters wrote:
>
> Hi Thierry,
>
> Ok I ll decrease the timeout to 15 seconds then.
>
> Reducing the size of the logs will help.
>
> Which log and how do I do this ?
>
> Thanks Marc and Theirry!
>
> -Gary
>
> On 10/7/24 00:26, Thierry Bordaz wrote:
>
> Hi,
>
> Those slap_poll error means that the server was unable to send back PDU to
> the client. It can occur if the client sends a request and does not read
> fast enough the results. The timeout is high 30s (30000), could it be that
> the problem is on the client side (app) ?
>
> I suggest that you focus on the timestamp when the application reports a
> failure. Then look in the access/error logs from 1-3min before and after
> the time of the failure. Logconv from that limited scope will be more
> helpful than a global one.
>
> The pattern looks to be an app opens a connection, switch to secure
> connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks
> fine but as it is an average (over 1M op) it is not helpful. Reducing the
> size of the logs will help.
> I found interesting the abandon op as it is possibly related to a
> performance issue.
>
> best regards
> thierry
>
>
>
> On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:
>
> Hi Marc,
>
> I have made nsslapd-listen-backlog-size to 512.
>
> For the ioblocktimeout, I increased it because of an error I was seeing:
>
> [30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed
> out
> [30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed
> out
>
> Googling stated that I should increase the ioblocktimeout. So I bumped it
> up from 20000 to 30000.
>
> Since then, those slapd_poll timed out errors have not occurred. Should I
> have changed something else?
>
> What should I increase these to?
>
> net.core.somaxconn = 4096
> net.ipv4.tcp_max_syn_backlog = 4096
>
> Thanks so much for your help!
>
> -Gary
>
> On 10/4/24 11:55, Marc Sauton wrote:
>
> tune up nsslapd-listen-backlog-size
>
> and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are
> high enough ( sysctl -a )
>
> possibly tune down the nsslapd-ioblocktimeout value
>
> Thanks,
>
> M.
>
>
>
> On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users <
> 389-users@lists.fedoraproject.org> wrote:
>
> Hello,
>
> We are experiencing a new issue since we upgraded from 389-ds-base from
> 1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I
> switched to RHEL9 and are on 2.4.5-9.
>
> The issue is during a performance load test of a web application. The
> app logs into a website and does some things that searches against ldap,
> and does some transactions. This app has been performing fine for years,
> the app has changed so it could be something there, but I am not sure
> about that because of the percentage of the traffic that is successful.
>
> The errors for the web app are "Can't contact Ldap Server" and sometimes
> "Can't contact LDAP server. Start TLS request accepted.Server willing to
> negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these
> errors will happen like 5 or 6 times, so its wildly inconsistent and
> random.
>
> I did a logconv analysis with 6 hours of a day of testing, see below.
> One thing that really stood out to me was the peak concurrent
> connections = 22.. That peak is so low, I dont know how these errors are
> happening.
>
> I dont see any errors in the access log ( grepping for err=1).
> I looked for cache warnings/errors in the access/errors logs, but didnt
> find any. I dont see things like unavailable connections in the access
> logs.
>
> Suggestions on what to change or look for in the logs ?
>
> Thanks,
> Gary
>
>
> information:
> Machine Size: 16G of ram, 4 core AMD  (its an EC2.m5.large, gp3 disk type)
>
> kernel:
> Linux  5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC
> packages:
> 389-ds-base-libs-2.4.5-9.el9_4.x86_64
> 389-ds-base-2.4.5-9.el9_4.x86_64
>
> single instance of dirsrv running
> dirsrv modifcations from default:
>
> nsslapd-logging-backend: dirsrv-log,syslog
> nsslapd-maxdescriptors: 8192
> nsslapd-listen-backlog-size: 256
> nsslapd-allow-hashed-passwords: on
> nsslapd-idletimeout: 30
> nsslapd-ioblocktimeout: 30000
> nsslapd-sizelimit: -1
> nsslapd-auditlog-logging-enabled: off
> nsslapd-lookthroughlimit: -1
>
> dirsrv.systemd:
> limitNOFILE=8192
>
>  >Total Log Lines Analszed:  2694287
>  >
>  >
>  >
>  > ---------- Access Log Output ------------
>  >
>  > Start of Logs:    26/Sep/2024:10:07:32.089983378
>  > End of Logs:      26/Sep/2024:15:54:29.895403688
>  >
>  > Processed Log Time:  5 Hours, 46 Minutes, 57.805426688 Seconds
>  >
>  > Restarts:                      0
>  > Secure Protocol Versions:
>  >   - TLS1.2 128-bit AES-GCM (123117 connections)
>  >
>  > Peak Concurrent Connections:   22
>  > Total Operations:              1097043
>  > Total Results:                 1097044
>  > Overall Performance:           100.0%
>  >
>  > Total Connections:             128646        (6.18/sec) (370.78/min)
>  >  - LDAP Connections:           128646        (6.18/sec) (370.78/min)
>  >  - LDAPI Connections:          0             (0.00/sec) (0.00/min)
>  >  - LDAPS Connections:          0             (0.00/sec) (0.00/min)
>  >  - StartTLS Extended Ops:      123116        (5.91/sec) (354.84/min)
>  >
>  > Searches:                      845279        (40.60/sec) (2436.22/min)
>  > Modifications:                 0             (0.00/sec) (0.00/min)
>  > Adds:                          0             (0.00/sec) (0.00/min)
>  > Deletes:                       0             (0.00/sec) (0.00/min)
>  > Mod RDNs:                      0             (0.00/sec) (0.00/min)
>  > Compares:                      0             (0.00/sec) (0.00/min)
>  > Binds:                         128647        (6.18/sec) (370.78/min)
>  >
>  > Average wtime (wait time):     0.001560856
>  > Average optime (op time):      0.003310453
>  > Average etime (elapsed time):  0.004868040
>  >
>  > Multi-factor Authentications:  0
>  > Proxied Auth Operations:       0
>  > Persistent Searches:           0
>  > Internal Operations:           0
>  > Entry Operations:              0
>  > Extended Operations:           123116
>  > Abandoned Requests:            1
>  > Smart Referrals Received:      0
>  >
>  > VLV Operations:                0
>  > VLV Unindexed Searches:        0
>  > VLV Unindexed Components:      0
>  > SORT Operations:               0
>  >
>  > Entire Search Base Queries:    0
>  > Paged Searches:                0
>  > Unindexed Searches:            0
>  > Unindexed Components:          0
>  > Invalid Attribute Filters:     0
>  > FDs Taken:                     128646
>  > FDs Returned:                  129318
>  > Highest FD Taken:              968
>  >
>  > Broken Pipes:                  0
>  > Connections Reset By Peer:     0
>  > Resource Unavailable:          0
>  > Max BER Size Exceeded:         0
>  >
>  > Binds:                         128647
>  > Unbinds:                       119206
>  > -------------------------------------
>  >  - LDAP v2 Binds:              0
>  >  - LDAP v3 Binds:              128647
>  >  - AUTOBINDs(LDAPI):           0
>  >  - SSL Client Binds:           0
>  >  - Failed SSL Client Binds:    0
>  >  - SASL Binds:                 0
>  >  - Dir
>
> --
> _______________________________________________
> 389-users mailing list -- 389-users@lists.fedoraproject.org
> To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> <https://urldefense.com/v3/__https:/docs.fedoraproject.org/en-US/project/code-of-conduct/__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEiq_mj_I$>
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> <https://urldefense.com/v3/__https:/fedoraproject.org/wiki/Mailing_list_guidelines__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEsTfSxSj$>
> List Archives:
> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
> <https://urldefense.com/v3/__https:/lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEslJO53N$>
> Do not reply to spam, report it:
> https://pagure.io/fedora-infrastructure/new_issue
> <https://urldefense.com/v3/__https:/pagure.io/fedora-infrastructure/new_issue__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEvlOa8tQ$>
>
>
>
>

-- 
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: Inconsistent Ldap connection issues

Reply via email to