[389-users] Re: Inconsistent Ldap connection issues

Gary Waters via 389-users Mon, 14 Oct 2024 11:45:15 -0700

JC: There has been a lot of work around connection management aroundthis version, the "accept" thread that is responsible for polling theldap/ldaps ports for connection attempts was introduced in a previousversion. This thread will fail to accept a new connection when thenumber of existing connections hits a certain limit. However, whenthis happens you should see this message in the error logs: "Notlistening for new connections - too many fds open", so IMHO I thinkthe error message you see is related to something else.

I have the 6 minute log when the event occurs, but I am not seeing "Notlistening" or "fds" in the logs.

JC: I failed to reproduce this issue locally with the same version andfile descriptor limit (8192) and 10k connections. So I am not aware ofanything else that might cause this error message.

How are you testing this ? Can you give me your load testing script??? Ihave used the "ldclt" script, and I see it does things, but based on theoutput I think its doing add/delete/modify/searches, but I cant tell howmany connects and disconnects. If I had a script, to emulate this, I cansee if I can get the developer or maybe I tweak their bonsai code.


Thanks so much,

Gary


On 10/9/24 09:24, James Chapman wrote:



On Wed, Oct 9, 2024 at 11:36 AM Joe Fletcher <jflet...@sscinc.com> wrote:

    Just a thought: what are the open files limits set to on the system?

    I’ve seen plenty of cases where increasing descriptor limits can
    help in these instances.

    *From:*Thierry Bordaz via 389-users
    <389-users@lists.fedoraproject.org>
    *Sent:* Wednesday, October 9, 2024 10:30 AM
    *To:* Gary Waters <gwat...@caltech.edu>; General discussion list
    for the 389 Directory server project.
    <389-users@lists.fedoraproject.org>; Marc Sauton
    <msau...@redhat.com>; James Chapman <jacha...@redhat.com>
    *Cc:* gwaters-...@caltech.edu; Thierry Bordaz <tbor...@redhat.com>
    *Subject:* [389-users] Re: Inconsistent Ldap connection issues

    **** The e-mail below is from an external source. Please do not
    open attachments or click links from an unknown or suspicious
    origin. ****

    Hi Gary,

    My understanding is that a single client opens ~10K connections in
    ~6min, that is an average of 26 connections per second. It creates
    a set of established connections that spike up to 961 parallel
    connections. The problem being that from time to time the client
    reports 'can not contact LDAP server' and while you are running
    2.4 this problem did not occur in 1.4.3. Am I correct ?

    My first guess is that Marc is correct and tuning of tcp param
    should be the culprit. Now it does not explain why it occurs only
    in 2.4 but may be you were running 1.4.3 on RHEL7 and are running
    2.4 on RHEL8.

    2.4 contains significant changes around that area. I do not recall
    all commit in connection handling side. AFAIK what is "new" is a
    dedicated listener thread (polling on ports 389/636/ldapi for new
    connections) and a multiple polling threads (polling the
    established connections)
    My understanding is that your problem impacts the listener thread
    as it is responsible to accept new connections and it is the new
    problem you detect in 2.4. @james did I miss something ?

JC: There has been a lot of work around connection management aroundthis version, the "accept" thread that is responsible for polling theldap/ldaps ports for connection attempts was introduced in a previousversion. This thread will fail to accept a new connection when thenumber of existing connections hits a certain limit. However, whenthis happens you should see this message in the error logs: "Notlistening for new connections - too many fds open", so IMHO I thinkthe error message you see is related to something else.


    If the culprit is not in TCP tuning, @james do you remember known
    bugs that would limit the listener thread ?

JC: I failed to reproduce this issue locally with the same version andfile descriptor limit (8192) and 10k connections. So I am not aware ofanything else that might cause this error message.


    best regards
    thierry

    On 10/8/24 7:36 PM, Gary Waters wrote:

        Hi Thierry,

        Ah yes of course. Here is 1 run of their web app load test, it
        is 6 minutes long, and it should mostly be only the test it
        self. I will start looking for

        We encountered 2 "Can not contact ldap server" errors during
        this run.

        2 cant contact ldap server errors in this run

        ----------- Access Log Output ------------

        Start of Logs:    08/Oct/2024:09:53:35.810833927
        End of Logs:      08/Oct/2024:09:59:52.361830449

        Processed Log Time:  0 Hours, 6 Minutes, 16.550998016 Seconds

        Restarts:                      1
        Secure Protocol Versions:
          - TLS1.2 128-bit AES-GCM (9833 connections)

        Peak Concurrent Connections:   689
        Total Operations:              86412
        Total Results:                 86412
        Overall Performance:           100.0%

        Total Connections:             9933 (26.38/sec)  (1582.73/min)
         - LDAP Connections:           9933 (26.38/sec)  (1582.73/min)
         - LDAPI Connections:          0 (0.00/sec)  (0.00/min)
         - LDAPS Connections:          0 (0.00/sec)  (0.00/min)
         - StartTLS Extended Ops:      9833 (26.11/sec)  (1566.80/min)

        Searches:                      66647 (176.99/sec)  (10619.60/min)
        Modifications:                 0 (0.00/sec)  (0.00/min)
        Adds:                          0 (0.00/sec)  (0.00/min)
        Deletes:                       0 (0.00/sec)  (0.00/min)
        Mod RDNs:                      0 (0.00/sec)  (0.00/min)
        Compares:                      0 (0.00/sec)  (0.00/min)
        Binds:                         9932 (26.38/sec)  (1582.57/min)

        Average wtime (wait time):     0.001407368
        Average optime (op time):      0.003186859
        Average etime (elapsed time):  0.004591048

        Multi-factor Authentications:  0
        Proxied Auth Operations:       0
        Persistent Searches:           0
        Internal Operations:           0
        Entry Operations:              0
        Extended Operations:           9833
        Abandoned Requests:            0
        Smart Referrals Received:      0

        VLV Operations:                0
        VLV Unindexed Searches:        0
        VLV Unindexed Components:      0
        SORT Operations:               0

        Entire Search Base Queries:    0
        Paged Searches:                0
        Unindexed Searches:            0
        Unindexed Components:          0
        Invalid Attribute Filters:     0
        FDs Taken:                     9933
        FDs Returned:                  9932
        Highest FD Taken:              961

        Broken Pipes:                  0
        Connections Reset By Peer:     0
        Resource Unavailable:          0
        Max BER Size Exceeded:         0

        Binds:                         9932
        Unbinds:                       9225
        -----------------------------------
         - LDAP v2 Binds:              0
         - LDAP v3 Binds:              9932
         - AUTOBINDs(LDAPI):           0
         - SSL Client Binds:           0
         - Failed SSL Client Binds:    0
         - SASL Binds:                 0
         - Directory Manager Binds:    0
         - Anonymous Binds:            99

        ================

        After this run I bumped up these from 4096,

        net.ipv4.tcp_max_syn_backlog = 6144
        net.core.somaxconn = 6144

        Yet we still get the ldap errors (this one and the start tls
        request error previously mentioned.)

        Should I bump up the

        On 10/8/24 03:47, Thierry Bordaz wrote:

            Hi Gary,

            I meant that the access logs covered 5hours. It would be
            helpful to capture/focus on the logs from the few minutes
            before/after the time when the problem occurred. Then
            check from those limited logs if there is a pattern or
            unexpected events (long operation, no operation, abandon, ...)

            best regards
            thierry

            On 10/7/24 7:37 PM, Gary Waters wrote:

                Hi Thierry,

                Ok I ll decrease the timeout to 15 seconds then.

                    Reducing the size of the logs will help.

                Which log and how do I do this ?

                Thanks Marc and Theirry!

                -Gary

                On 10/7/24 00:26, Thierry Bordaz wrote:

                    Hi,

                    Those slap_poll error means that the server was
                    unable to send back PDU to the client. It can
                    occur if the client sends a request and does not
                    read fast enough the results. The timeout is high
                    30s (30000), could it be that the problem is on
                    the client side (app) ?

                    I suggest that you focus on the timestamp when the
                    application reports a failure. Then look in the
                    access/error logs from 1-3min before and after the
                    time of the failure. Logconv from that limited
                    scope will be more helpful than a global one.

                    The pattern looks to be an app opens a connection,
                    switch to secure connection (start-tls), issue 6-8
                    SRCH then close. etime/wtime/optime looks fine but
                    as it is an average (over 1M op) it is not
                    helpful. Reducing the size of the logs will help.
                    I found interesting the abandon op as it is
                    possibly related to a performance issue.

                    best regards
                    thierry

                    On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:

                        Hi Marc,

                        I have made nsslapd-listen-backlog-size to 512.

                        For the ioblocktimeout, I increased it because
                        of an error I was seeing:

                        [30/Sep/2024:16:26:55.987681019 -0700] - ERR -
                        slapd_poll - (743) - Timed out
                        [30/Sep/2024:16:34:49.646922635 -0700] - ERR -
                        slapd_poll - (568) - Timed out

                        Googling stated that I should increase the
                        ioblocktimeout. So I bumped it up from 20000
                        to 30000.

                        Since then, those slapd_poll timed out errors
                        have not occurred. Should I have changed
                        something else?

                        What should I increase these to?

                        net.core.somaxconn = 4096
                        net.ipv4.tcp_max_syn_backlog = 4096

                        Thanks so much for your help!

                        -Gary

                        On 10/4/24 11:55, Marc Sauton wrote:

                            tune up nsslapd-listen-backlog-size

                            and verify the net.core.somaxconn
                            and net.ipv4.tcp_max_syn_backlog are high
                            enough ( sysctl -a )

                            possibly tune down
                            the nsslapd-ioblocktimeout value

                            Thanks,

                            M.

                            On Fri, Oct 4, 2024 at 11:06 AM
                            gwaters-web--- via 389-users
                            <389-users@lists.fedoraproject.org> wrote:

                                Hello,

                                We are experiencing a new issue since
                                we upgraded from 389-ds-base from
                                1.4~ish to 2.0.15 on RHEL 8. I couldnt
                                figure how to fix it, so I
                                switched to RHEL9 and are on 2.4.5-9.

                                The issue is during a performance load
                                test of a web application. The
                                app logs into a website and does some
                                things that searches against ldap,
                                and does some transactions. This app
                                has been performing fine for years,
                                the app has changed so it could be
                                something there, but I am not sure
                                about that because of the percentage
                                of the traffic that is successful.

                                The errors for the web app are "Can't
                                contact Ldap Server" and sometimes
                                "Can't contact LDAP server. Start TLS
                                request accepted.Server willing to
                                negotiate SSL. (0xFFFF [-1])". Out of
                                the 128k connections below, these
                                errors will happen like 5 or 6 times,
                                so its wildly inconsistent and random.

                                I did a logconv analysis with 6 hours
                                of a day of testing, see below.
                                One thing that really stood out to me
                                was the peak concurrent
                                connections = 22.. That peak is so
                                low, I dont know how these errors are
                                happening.

                                I dont see any errors in the access
                                log ( grepping for err=1).
                                I looked for cache warnings/errors in
                                the access/errors logs, but didnt
                                find any. I dont see things like
                                unavailable connections in the access
                                logs.

                                Suggestions on what to change or look
                                for in the logs ?

                                Thanks,
                                Gary


                                information:
                                Machine Size: 16G of ram, 4 core AMD 
                                (its an EC2.m5.large, gp3 disk type)

                                kernel:
                                Linux  5.14.0-427.35.1.el9_4.x86_64 #1
                                SMP PREEMPT_DYNAMIC
                                packages:
                                389-ds-base-libs-2.4.5-9.el9_4.x86_64
                                389-ds-base-2.4.5-9.el9_4.x86_64

                                single instance of dirsrv running
                                dirsrv modifcations from default:

                                nsslapd-logging-backend: dirsrv-log,syslog
                                nsslapd-maxdescriptors: 8192
                                nsslapd-listen-backlog-size: 256
                                nsslapd-allow-hashed-passwords: on
                                nsslapd-idletimeout: 30
                                nsslapd-ioblocktimeout: 30000
                                nsslapd-sizelimit: -1
                                nsslapd-auditlog-logging-enabled: off
                                nsslapd-lookthroughlimit: -1

                                dirsrv.systemd:
                                limitNOFILE=8192

                                 >Total Log Lines Analszed: 2694287
                                 >
                                 >
                                 >
                                 > ---------- Access Log Output
                                ------------
                                 >
                                 > Start of Logs:
                                26/Sep/2024:10:07:32.089983378
                                 > End of Logs:
                                26/Sep/2024:15:54:29.895403688
                                 >
                                 > Processed Log Time:  5 Hours, 46
                                Minutes, 57.805426688 Seconds
                                 >
                                 > Restarts: 0
                                 > Secure Protocol Versions:
                                 >   - TLS1.2 128-bit AES-GCM (123117
                                connections)
                                 >
                                 > Peak Concurrent Connections: 22
                                 > Total Operations: 1097043
                                 > Total Results: 1097044
                                 > Overall Performance: 100.0%
                                 >
                                 > Total Connections: 128646       
                                (6.18/sec) (370.78/min)
                                 >  - LDAP Connections: 128646       
                                (6.18/sec) (370.78/min)
                                 >  - LDAPI Connections: 0            
                                (0.00/sec) (0.00/min)
                                 >  - LDAPS Connections: 0            
                                (0.00/sec) (0.00/min)
                                 >  - StartTLS Extended Ops:
                                123116        (5.91/sec) (354.84/min)
                                 >
                                 > Searches: 845279        (40.60/sec)
                                (2436.22/min)
                                 > Modifications: 0            
                                (0.00/sec) (0.00/min)
                                 > Adds: 0             (0.00/sec)
                                (0.00/min)
                                 > Deletes: 0             (0.00/sec)
                                (0.00/min)
                                 > Mod RDNs: 0             (0.00/sec)
                                (0.00/min)
                                 > Compares: 0             (0.00/sec)
                                (0.00/min)
                                 > Binds: 128647        (6.18/sec)
                                (370.78/min)
                                 >
                                 > Average wtime (wait time): 0.001560856
                                 > Average optime (op time): 0.003310453
                                 > Average etime (elapsed time):
                                0.004868040
                                 >
                                 > Multi-factor Authentications: 0
                                 > Proxied Auth Operations: 0
                                 > Persistent Searches: 0
                                 > Internal Operations: 0
                                 > Entry Operations: 0
                                 > Extended Operations: 123116
                                 > Abandoned Requests: 1
                                 > Smart Referrals Received: 0
                                 >
                                 > VLV Operations: 0
                                 > VLV Unindexed Searches: 0
                                 > VLV Unindexed Components: 0
                                 > SORT Operations: 0
                                 >
                                 > Entire Search Base Queries: 0
                                 > Paged Searches: 0
                                 > Unindexed Searches: 0
                                 > Unindexed Components: 0
                                 > Invalid Attribute Filters: 0
                                 > FDs Taken: 128646
                                 > FDs Returned: 129318
                                 > Highest FD Taken: 968
                                 >
                                 > Broken Pipes: 0
                                 > Connections Reset By Peer: 0
                                 > Resource Unavailable: 0
                                 > Max BER Size Exceeded: 0
                                 >
                                 > Binds: 128647
                                 > Unbinds: 119206
                                 > -------------------------------------
                                 >  - LDAP v2 Binds: 0
                                 >  - LDAP v3 Binds: 128647
                                 >  - AUTOBINDs(LDAPI): 0
                                 >  - SSL Client Binds: 0
                                 >  - Failed SSL Client Binds: 0
                                 >  - SASL Binds: 0
                                 >  - Dir

--_______________________________________________

                                389-users mailing list --
                                389-users@lists.fedoraproject.org
                                To unsubscribe send an email to
                                389-users-le...@lists.fedoraproject.org
                                Fedora Code of Conduct:
                                
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
                                
<https://urldefense.com/v3/__https:/docs.fedoraproject.org/en-US/project/code-of-conduct/__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEiq_mj_I$>
                                List Guidelines:
                                
https://fedoraproject.org/wiki/Mailing_list_guidelines
                                
<https://urldefense.com/v3/__https:/fedoraproject.org/wiki/Mailing_list_guidelines__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEsTfSxSj$>
                                List Archives:
                                
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
                                
<https://urldefense.com/v3/__https:/lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEslJO53N$>
                                Do not reply to spam, report it:
                                
https://pagure.io/fedora-infrastructure/new_issue
                                
<https://urldefense.com/v3/__https:/pagure.io/fedora-infrastructure/new_issue__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEvlOa8tQ$>

-- 
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: Inconsistent Ldap connection issues

Reply via email to