Re: [Freeipa-users] Freeipa 4.2.0 hangs intermittently

thierry bordaz Mon, 05 Sep 2016 00:39:46 -0700


Hi Rakesh,

Were you able to get a pstack or full stack with gdb(http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes) whenthe server hangs ?

If it happens with 500 threads as well as with 30, using 30 threads is abetter choice to debug this issue.I will try to reproduce using 150 parallel 'ipa user-find p-testipa'commands

Something I am unsure is if the CPU consumption stays high (youmentioned 340% CPU usage) as long as the hang happens or if after asuddent shot up to 340% (that marks the beginning of the hang) it dropsand stay hanging ?


thanks
thierry

On 09/04/2016 08:40 PM, Rakesh Rajasekharan wrote:

starce on the slapd process actually had this in the output..
FUTEX_WAIT_PRIVATE

and checking for the number of threads slapd had.. there were 5015 threads

ps -efL|grep slapd|wc -l
5015

strace on most of the threads gave this output

strace -p 67411
Process 67411 attached

futex(0x7f3f0226b41c, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EAGAIN(Resource temporarily unavailable)

futex(0x7f3f0226b41c, FUTEX_WAIT_PRIVATE, 2, NULL^CProcess 67411 detached

On Sun, Sep 4, 2016 at 5:34 PM, Rakesh Rajasekharan<[email protected] <mailto:[email protected]>>wrote:


    I have again got the issue of IPA hanging.. The issue came up when
    i tried to run ipa-client-isntall on 142 clients simultaneously


    None of the IPA commands are responding,  and I see this error

    ipa user-find p-testipa
    ipa: ERROR: Insufficient access: SASL(-1): generic failure: GSSAPI
    Error: Unspecified GSS failure.  Minor code may provide more
    information (KDC returned error string: PROCESS_TGS)

     KRB5_TRACE=/dev/stdout kinit admin
    [41178] 1472984115.233214: Getting initial credentials for
    [email protected] <mailto:[email protected]>
    [41178] 1472984115.235257: Sending request (167 bytes) to XYZ.COM
    <http://XYZ.COM>
    [41178] 1472984115.235419: Initiating TCP connection to stream
    10.1.3.36:88 <http://10.1.3.36:88>
    [41178] 1472984115.235685: Sending TCP request to stream
    10.1.3.36:88 <http://10.1.3.36:88>
    [41178] 1472984120.238914: Received answer (174 bytes) from stream
    10.1.3.36:88 <http://10.1.3.36:88>
    [41178] 1472984120.238925: Terminating TCP connection to stream
    10.1.3.36:88 <http://10.1.3.36:88>
    [41178] 1472984120.238993: Response was from master KDC
    [41


    Running an ldapsearch to see the db.. does not give any results
    and just hangs there

    ldapsearch -x -D 'cn=Directory Manager' -W -s one -b
    'cn=kerberos,dc=xyz,dc=com'
    Enter LDAP Password:

    even an ldapsearch -x does not respond
    At this point, am sure that slapd is the one causing issues

    Running an strace against the hung slapd itself seems to get stuck
    does not proceed after saying "attaching to process"

    From some others posts I read Thierry suggesting to increase the
    nsslapd-threadnumber value

    It was set to 30, I think that might be too low.

    I have raised it to  500

    Now after restarting the service .. ldapsearch starts responding.
    But running the test to add a sudden high number of clients again
    left ns-slapd to hung state

    When i attempted adding the clients.. the ns-slapd cpu usage shot
    up to 340% and after that ns-slapd stopped responding

    So now, atleast I know what might be causing the issue and I can
    now easily reproduce it.

    Is there a way I can make ns-slapd handle a sudden bump in
    incoming request for ipa-client-install

    Thanks
    Rakesh






    On Mon, Aug 29, 2016 at 11:18 PM, Rich Megginson
    <[email protected] <mailto:[email protected]>> wrote:

        On 08/29/2016 10:53 AM, Rakesh Rajasekharan wrote:

        Hi Thierry,

        My machine has 30GB RAM ..and  389-ds version is 1.3.4

        ldapsearch shows the values for nsslapd-cachememsize updated
        to 200MB.

        ldapsearch -LLL -o ldif-wrap=no -D "cn=directory manager" -w
        'mypassword' -b 'cn=userRoot,cn=ldbm
        database,cn=plugins,cn=config'|grep nsslapd-cachememsize
        nsslapd-cachememsize: 209715200


        So, it seems to have updated though seeing that
        warning(WARNING: ipaca: entry cache size 10485760B is less
        than db size 11599872B) in the log confuses me a bit.

        Thers one more entry that I found from the ldapsearch to be
        bit low

        nsslapd-dncachememsize: 10485760
        maxdncachesize: 10485760

        Should I update these as well to a higher value

        At the time when the issue happened, the memory usage as well
        as the overall load of the system was very low .
        I will try reproducing the issue atleast in my QA
        env..probably by trying to mock  simultaneous parallel logins
        to a large number of hosts


        To monitor your cache sizes, please use the dbmon.sh tool
        provided with your distro.  If that is not available with your
        particular distro, see
        https://github.com/richm/scripts/wiki/dbmon.sh
        <https://github.com/richm/scripts/wiki/dbmon.sh>



        thanks
        Rakesh




        On Mon, Aug 29, 2016 at 8:16 PM, thierry bordaz
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Rakesh,

            Those tuning may depend on the memory available on your
            machine.
            nsslapd-cachememsize allows the entry cache to consume up
            to 200Mb but its memory footprint is known to go above.
            200Mb both looks pretty good to me. How large is your
            machine ? What is your version of 389-ds ?

            Those warnings do not change your settings. It just raise
            that entry cache of 'ipaca' and 'retrocl' are small but
            it is fine. The size of the entry cache is important
            mostly in userRoot.
            You may double check the actual values, after restart,
            with ldapsearch on 'cn=userRoot,cn=ldbm
            database,cn=plugins,cn=config' and 'cn=config,cn=ldbm
            database,cn=plugins,cn=config'.

            A step is to know what will be response time of DS to
            know if it is responsible of the hang or not.
            The logs and possibly pstack during those intermittent
            hangs will help to determine that.

            regards
            thierry





            On 08/29/2016 04:25 PM, Rakesh Rajasekharan wrote:

            I tried increasing the nsslapd-dbcachesize and
            nsslapd-cachememsize in my QA envs to 200MB.

            However, in my log files, I still see this message
            [29/Aug/2016:04:34:37 +0000] - WARNING: ipaca: entry
            cache size 10485760B is less than db size 11599872B; We
            recommend to increase the entry cache size
            nsslapd-cachememsize.
            [29/Aug/2016:04:34:37 +0000] - WARNING: changelog: entry
            cache size 2097152B is less than db size 441647104B; We
            recommend to increase the entry cache size
            nsslapd-cachememsize.

            these are my ldif files that i used to modify the values
            modify entry cache size
            cat modify-cache-mem-size.ldif
            dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config
            changetype: modify
            replace: nsslapd-cachememsize
            nsslapd-cachememsize: 209715200

            modify db cache size
            cat modfy-db-cache-size.ldif
            dn: cn=config,cn=ldbm database,cn=plugins,cn=config
            changetype: modify
            replace: nsslapd-dbcachesize
            nsslapd-dbcachesize: 209715200

            After modifying , i restarted IPA services

            Is there anything else that  I need to take care of as
            the logs suggest its still not getting the updated values

            Thanks
            Rakesh

            On Mon, Aug 29, 2016 at 6:07 PM, Rakesh Rajasekharan
            <[email protected]
            <mailto:[email protected]>> wrote:

                Hi Thierry,

                Coz of the issues we had to revert back to earlier
                running openldap in production.

                I have now done a few TCP related changes in
                sysctl.conf and have also increased the
                nsslapd-dbcachesize and nsslapd-cachememsize to 200MB

                I will again start migrating hosts back to IPA and
                see if I face the earlier issue.

                I will update back once I have something


                Thanks,
                Rakesh



                On Thu, Aug 25, 2016 at 2:17 PM, thierry bordaz
                <[email protected] <mailto:[email protected]>> wrote:



                    On 08/25/2016 10:15 AM, Rakesh Rajasekharan wrote:

                    All of the troubleshooting seems fine.


                    However, Running libconv.pl <http://libconv.pl>
                    gives me this output

                    ----- Recommendations -----

                     1.  You have unindexed components, this can be
                    caused from a search on an unindexed attribute,
                    or your returned results exceeded the
                    allidsthreshold. Unindexed components are not
                    recommended. To refuse unindexed searches,
                    switch 'nsslapd-require-index' to 'on' under
                    your database entry (e.g. cn=UserRoot,cn=ldbm
                    database,cn=plugins,cn=config).

                     2.  You have a significant difference between
                    binds and unbinds. You may want to investigate
                    this difference.


                    I feel, this could be a pointer to things going
                    slow.. and IPA hanging. I think i now have
                    something that I can try and nail down this issue.

                    On a sidenote, I was earlier running openldap
                    and migrated over to Freeipa,

                    Thanks
                    Rakesh



                    On Wed, Aug 24, 2016 at 12:38 PM, Petr Spacek
                    <[email protected]
                    <mailto:[email protected]>> wrote:

                        On 23.8.2016 18:44, Rakesh Rajasekharan wrote:
                        > I think thers something seriously wrong
                        with my system
                        >
                        > not able to run any IPA commands
                        >
                        > klist
                        > Ticket cache: KEYRING:persistent:0:0
                        > Default principal: [email protected]
                        <mailto:[email protected]>
                        >
                        > Valid starting  Expires Service principal
                        > 2016-08-23T16:26:36 2016-08-24T16:26:22
                        krbtgt/[email protected] <mailto:[email protected]>
                        >
                        >
                        > [root@prod-ipa-master-1a :~] ipactl status
                        > Directory Service: RUNNING
                        > krb5kdc Service: RUNNING
                        > kadmin Service: RUNNING
                        > ipa_memcached Service: RUNNING
                        > httpd Service: RUNNING
                        > pki-tomcatd Service: RUNNING
                        > ipa-otpd Service: RUNNING
                        > ipa: INFO: The ipactl command was successful
                        >
                        >
                        >
                        > [root@prod-ipa-master :~] ipa user-find
                        p-testuser
                        > ipa: ERROR: Kerberos error: ('Unspecified
                        GSS failure. Minor code may
                        > provide more information',
                        851968)/("Cannot contact any KDC for realm '
                        > XYZ.COM <http://XYZ.COM>'", -1765328228)


                    Hi Rakesh,

                        Having a reproducible test case would you
                        rerun the command above.
                        During its processing you may monitor DS
                        process load (top). If it is high, you may
                        get some pstacks of it.
                        Also would you attach the part of DS access
                        logs taken during the command.

                        regards
                        thierry

                        >

                        This is weird because the server seems to
                        be up.

                        Please follow
                        
http://www.freeipa.org/page/Troubleshooting#Authentication.2FKerberos
                        
<http://www.freeipa.org/page/Troubleshooting#Authentication.2FKerberos>

                        Petr^2 Spacek

                        >
                        >
                        > Thanks
                        >
                        > Rakesh
                        >
                        > On Tue, Aug 23, 2016 at 10:01 PM, Rakesh
                        Rajasekharan <
                        > [email protected]
                        <mailto:[email protected]>> wrote:
                        >
                        >> i changed the loggin level to 4 .
                        Modifying nsslapd-accesslog-level
                        >>
                        >> But, the hang is still there. though I
                        dont see the sigfault now
                        >>
                        >>
                        >>
                        >>
                        >> On Tue, Aug 23, 2016 at 9:02 PM, Rakesh
                        Rajasekharan <
                        >> [email protected]
                        <mailto:[email protected]>> wrote:
                        >>
                        >>> My disk was getting filled too fast
                        >>>
                        >>> logs under /var/log/dirsrv was coming
                        around 5 gb quickly filling up
                        >>>
                        >>> Is there a way to make the logging less
                        verbose
                        >>>
                        >>>
                        >>>
                        >>> On Tue, Aug 23, 2016 at 6:41 PM, Petr
                        Spacek <[email protected]
                        <mailto:[email protected]>> wrote:
                        >>>
                        >>>> On 23.8.2016 15:07, Rakesh
                        Rajasekharan wrote:
                        >>>>> I was able to fix that may be
                        temporarily... when i checked the
                        >>>> network..
                        >>>>> there was another process that was
                        running and consuming a lot of
                        >>>> network (
                        >>>>> i have no idea who did that. I need
                        to seriously start restricting
                        >>>> people
                        >>>>> access to this machine )
                        >>>>>
                        >>>>> after killing that perfomance
                        improved drastically
                        >>>>>
                        >>>>> But now, suddenly I started
                        experiencing the same hang.
                        >>>>>
                        >>>>> This time , I gert the following
                        error when checked dmesg
                        >>>>>
                        >>>>> [  301.236976] ns-slapd[3124]:
                        segfault at 0 ip 00007f1de416951c sp
                        >>>>> 00007f1dee1dba70 error 4 in
                        libcos-plugin.so[7f1de4166000+b000]
                        >>>>> [ 1116.248431] TCP: request_sock_TCP:
                        Possible SYN flooding on port 88.
                        >>>>> Sending cookies. Check SNMP counters.
                        >>>>> [11831.397037] ns-slapd[22550]:
                        segfault at 0 ip 00007f533d82251c sp
                        >>>>> 00007f5347894a70 error 4 in
                        libcos-plugin.so[7f533d81f000+b000]
                        >>>>> [11832.727989] ns-slapd[22606]:
                        segfault at 0 ip 00007f6231eb951c sp
                        >>>>> 00007f623bf2ba70 error 4 in
                        libcos-plugin.so[7f6231eb6000+b00
                        >>>>
                        >>>> Okay, this one is serious. The LDAP
                        server crashed.
                        >>>>
                        >>>> 1. Make sure all your packages are
                        up-to-date.
                        >>>>
                        >>>> Please see
                        >>>>
                        
http://directory.fedoraproject.org/docs/389ds/FAQ/faq.html#d
                        >>>> ebugging-crashes
                        >>>> for further instructions how to debug
                        this.
                        >>>>
                        >>>> Petr^2 Spacek
                        >>>>
                        >>>>>
                        >>>>> and in /var/log/dirsrv/example-com/errors
                        >>>>>
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291138 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291139 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291140 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291141 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291142 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291143 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291144 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:36 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3291145 (rc: 32)
                        >>>>> [23/Aug/2016:12:49:50 +0000] - Retry
                        count exceeded in delete
                        >>>>> [23/Aug/2016:12:49:50 +0000]
                        DSRetroclPlugin - delete_changerecord:
                        >>>> could
                        >>>>> not delete change record 3292734 (rc: 51)
                        >>>>>
                        >>>>>
                        >>>>> Can  i do something about this
                        error.. I treid to restart ipa a couple
                        >>>> of
                        >>>>> time but that did not help
                        >>>>>
                        >>>>> Thanks
                        >>>>> Rakesh
                        >>>>>
                        >>>>> On Mon, Aug 22, 2016 at 2:27 PM, Petr
                        Spacek <[email protected]
                        <mailto:[email protected]>>
                        >>>> wrote:
                        >>>>>
                        >>>>>> On 19.8.2016 19:32, Rakesh
                        Rajasekharan wrote:
                        >>>>>>> I am running my set up on AWS
                        cloud, and entropy is low at around
                        >>>> 180 .
                        >>>>>>>
                        >>>>>>> I plan to increase it bu installing
                        haveged . But, would low entropy
                        >>>> by
                        >>>>>> any
                        >>>>>>> chance cause this issue of
                        intermittent hang .
                        >>>>>>> Also, the hang is mostly observed
                        when registering around 20 clients
                        >>>>>>> together
                        >>>>>>
                        >>>>>> Possibly, I'm not sure. If you want
                        to dig into this, I would do this:
                        >>>>>> 1. look what process hangs on client
                        (using pstree command or so)
                        >>>>>> $ pstree
                        >>>>>>
                        >>>>>> 2. look to what server and port is
                        the hanging client connected to
                        >>>>>> $ lsof -p <PID of the hanging process>
                        >>>>>>
                        >>>>>> 3. jump to server and see what
                        process is bound to the target port
                        >>>>>> $ netstat -pn
                        >>>>>>
                        >>>>>> 4. see where the process if hanging
                        >>>>>> $ strace -p <PID of the hanging process>
                        >>>>>>
                        >>>>>> I hope it helps.
                        >>>>>>
                        >>>>>> Petr^2 Spacek
                        >>>>>>
                        >>>>>>> On Fri, Aug 19, 2016 at 7:24 PM,
                        Rakesh Rajasekharan <
                        >>>>>>> [email protected]
                        <mailto:[email protected]>> wrote:
                        >>>>>>>
                        >>>>>>>> yes there seems to be something
                        thats worrying.. I have faced this
                        >>>> today
                        >>>>>>>> as well.
                        >>>>>>>> There are few hosts around 280 odd
                        left and when i try adding them
                        >>>> to
                        >>>>>> IPA
                        >>>>>>>> , the slowness begins..
                        >>>>>>>>
                        >>>>>>>> all the ipa commands like ipa
                        user-find.. etc becomes very slow in
                        >>>>>>>> responding.
                        >>>>>>>>
                        >>>>>>>> the SYNC_RECV are not many though
                        just around 80-90 and today that
                        >>>> was
                        >>>>>>>> around 20 only
                        >>>>>>>>
                        >>>>>>>>
                        >>>>>>>> I have for now increased
                        tcp_max_syn_backlog to 5000.
                        >>>>>>>> For now the slowness seems to have
                        gone.. but I will do a try
                        >>>> adding the
                        >>>>>>>> clients again tomorrow and see how
                        it goes
                        >>>>>>>>
                        >>>>>>>> Thanks
                        >>>>>>>> Rakesh
                        >>>>>>>>
                        >>>>>>>> The issues
                        >>>>>>>>
                        >>>>>>>> On Fri, Aug 19, 2016 at 12:58 PM,
                        Petr Spacek <[email protected]
                        <mailto:[email protected]>>
                        >>>>>> wrote:
                        >>>>>>>>
                        >>>>>>>>> On 18.8.2016 17:23, Rakesh
                        Rajasekharan wrote:
                        >>>>>>>>>> Hi
                        >>>>>>>>>>
                        >>>>>>>>>> I am migrating to freeipa from
                        openldap and have around 4000
                        >>>> clients
                        >>>>>>>>>>
                        >>>>>>>>>> I had openned a another thread
                        on that, but chose to start a new
                        >>>> one
                        >>>>>>>>> here
                        >>>>>>>>>> as its a separate issue
                        >>>>>>>>>>
                        >>>>>>>>>> I was able to change the
                        nssslapd-maxdescriptors adding an ldif
                        >>>> file
                        >>>>>>>>>>
                        >>>>>>>>>> cat nsslapd-modify.ldif
                        >>>>>>>>>> dn: cn=config
                        >>>>>>>>>> changetype: modify
                        >>>>>>>>>> replace: nsslapd-maxdescriptors
                        >>>>>>>>>> nsslapd-maxdescriptors: 17000
                        >>>>>>>>>>
                        >>>>>>>>>> and running the ldapmodify command
                        >>>>>>>>>>
                        >>>>>>>>>> I have now started moving
                        clients running an openldap to Freeipa
                        >>>> and
                        >>>>>>>>> have
                        >>>>>>>>>> today moved close to 2000 clients
                        >>>>>>>>>>
                        >>>>>>>>>> However, I have noticed that IPA
                        hangs intermittently.
                        >>>>>>>>>>
                        >>>>>>>>>> running a kinit admin returns
                        the below error
                        >>>>>>>>>> kinit: Generic error (see
                        e-text) while getting initial
                        >>>> credentials
                        >>>>>>>>>>
                        >>>>>>>>>> from the /var/log/messages, I
                        see this entry
                        >>>>>>>>>>
                        >>>>>>>>>> prod-ipa-master-int kernel:
                        [104090.315801] TCP:
                        >>>> request_sock_TCP:
                        >>>>>>>>>> Possible SYN flooding on port
                        88. Sending cookies. Check SNMP
                        >>>>>> counters.
                        >>>>>>>>>
                        >>>>>>>>> I would be worried about this
                        message. Maybe kernel/firewall is
                        >>>> doing
                        >>>>>>>>> something fishy behind your back
                        and blocking some connections or
                        >>>> so.
                        >>>>>>>>>
                        >>>>>>>>> Petr^2 Spacek
                        >>>>>>>>>
                        >>>>>>>>>
                        >>>>>>>>>> Aug 18 13:00:01
                        prod-ipa-master-int systemd[1]: Started Session
                        >>>> 4885
                        >>>>>> of
                        >>>>>>>>>> user root.
                        >>>>>>>>>> Aug 18 13:00:01
                        prod-ipa-master-int systemd[1]: Starting
                        Session
                        >>>> 4885
                        >>>>>> of
                        >>>>>>>>>> user root.
                        >>>>>>>>>> Aug 18 13:01:01
                        prod-ipa-master-int systemd[1]: Started Session
                        >>>> 4886
                        >>>>>> of
                        >>>>>>>>>> user root.
                        >>>>>>>>>> Aug 18 13:01:01
                        prod-ipa-master-int systemd[1]: Starting
                        Session
                        >>>> 4886
                        >>>>>> of
                        >>>>>>>>>> user root.
                        >>>>>>>>>> Aug 18 13:02:40
                        prod-ipa-master-int python[28984]:
                        ansible-command
                        >>>>>>>>> Invoked
                        >>>>>>>>>> with creates=None
                        executable=None shell=True args= removes=None
                        >>>>>>>>> warn=True
                        >>>>>>>>>> chdir=None
                        >>>>>>>>>> Aug 18 13:04:37
                        prod-ipa-master-int sssd_be: GSSAPI Error:
                        >>>> Unspecified
                        >>>>>>>>> GSS
                        >>>>>>>>>> failure. Minor code may provide
                        more information (KDC returned
                        >>>> error
                        >>>>>>>>>> string: PROCESS_TGS)
                        >>>>>>>>>>
                        >>>>>>>>>> Could it be possible that its
                        due to the initial load of adding
                        >>>> the
                        >>>>>>>>> clients
                        >>>>>>>>>> or is there something else that
                        I need to take care of.



        --
        Manage your subscription for the Freeipa-users mailing list:
        https://www.redhat.com/mailman/listinfo/freeipa-users
        <https://www.redhat.com/mailman/listinfo/freeipa-users>
        Go to http://freeipa.org for more info on the project

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] Freeipa 4.2.0 hangs intermittently

Reply via email to