Hi list,
Just closing the loop on this one.
This issue finally got resolved for us after installing the latest FreeIPA
update available for CentOS 7:

OS version: CentOS Linux release 7.4.1708 (Core)

ipa-server-trust-ad-4.5.0-22.el7.centos.x86_64
ipa-common-4.5.0-22.el7.centos.noarch
sssd-ipa-1.15.2-50.el7_4.8.x86_64
ipa-client-4.5.0-22.el7.centos.x86_64
ipa-server-4.5.0-22.el7.centos.x86_64
ipa-client-common-4.5.0-22.el7.centos.noarch
ipa-server-dns-4.5.0-22.el7.centos.noarch
ipa-python-compat-4.5.0-22.el7.centos.noarch
ipa-server-common-4.5.0-22.el7.centos.noarch

389-ds-base-libs-1.3.6.1-24.el7_4.x86_64
389-ds-base-snmp-1.3.6.1-24.el7_4.x86_64
389-ds-base-1.3.6.1-24.el7_4.x86_64

pki-ca-10.4.1-17.el7_4.noarch
pki-tools-10.4.1-17.el7_4.x86_64
pki-server-10.4.1-17.el7_4.noarch
pki-kra-10.4.1-17.el7_4.noarch
pki-base-10.4.1-17.el7_4.noarch
pki-base-java-10.4.1-17.el7_4.noarch
pki-symkey-10.4.1-17.el7_4.x86_64

​Many thanks for such a great product!​

Guillermo


On Mon, Jul 18, 2016 at 2:37 PM, Guillermo Fuentes <guillermo.fuentes@
modernizingmedicine.com> wrote:

> Hi all,
>
> Did any ipa/sssd developer had a chance to take a look at this issue?
>
> Updating to the latest version available for CentOS 7 didn't fix it:
> ipa-debuginfo-4.2.0-15.0.1.el7_2.6.1.x86_64
> ipa-python-4.2.0-15.0.1.el7.centos.17.x86_64
> ipa-server-dns-4.2.0-15.0.1.el7.centos.17.x86_64
> sssd-ipa-1.13.0-40.el7_2.9.x86_64
> python-libipa_hbac-1.13.0-40.el7_2.9.x86_64
> ipa-admintools-4.2.0-15.0.1.el7.centos.17.x86_64
> ipa-server-4.2.0-15.0.1.el7.centos.17.x86_64
> libipa_hbac-1.13.0-40.el7_2.9.x86_64
> ipa-server-trust-ad-4.2.0-15.0.1.el7.centos.17.x86_64
> ipa-client-4.2.0-15.0.1.el7.centos.17.x86_64
>
> 389-ds-base-libs-1.3.4.0-32.el7_2.x86_64
> 389-ds-base-1.3.4.0-32.el7_2.x86_64
> 389-ds-base-debuginfo-1.3.4.0-30.el7_2.x86_64
>
>
> Please let me know if you need more information or how I can help to get
> it fixed.
>
> Thanks so much,
> Guillermo
>
>
> On Mon, Jun 13, 2016 at 6:30 PM, Rich Megginson <rmegg...@redhat.com>
> wrote:
>
>> On 06/13/2016 01:13 PM, Guillermo Fuentes wrote:
>>
>>> Hi Rich,
>>>
>>> After I started running the stack traces, the problem hasn't happen as
>>> frequently as it use to but today I was able to get the stack traces.
>>> As they aren't similar I'll send them over to you in a separate email.
>>>
>>> This is what I did to start the stack traces (CentOS 7):
>>> # yum install -y --enablerepo=base-debuginfo 389-ds-base-debuginfo
>>> ipa-debuginfo slapi-nis-debuginfo nspr-debuginfo
>>> # yum install -y gdb
>>> # systemctl stop ipa.service ; sleep 10; systemctl start ipa.service
>>> # mkdir -p /var/log/stacktraces
>>>
>>> Setup crontab to run the following every minute:
>>> gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply
>>> all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` >
>>> /var/log/stacktraces/stacktrace.`date +%s`.txt 2>&1
>>>
>>
>> It looks similar to https://fedorahosted.org/389/ticket/48341 but you
>> already have that fix.
>>
>> One of the problems is that ids_sasl_check_bind acquires the connection
>> lock and holds it for a very long time, which causes the main loop to block
>> on that connection, which is similar to the above problem, and also similar
>> to https://fedorahosted.org/389/ticket/48882.  Basically, anything which
>> holds the connection c_mutex lock too long can hang the server.  In your
>> case, this stack trace:
>>
>> poll sss_cli_make_request_nochecks sss_cli_check_socket
>> sss_pac_make_request sssdpac_verify krb5int_authdata_verify
>> rd_req_decoded_opt krb5_rd_req_decoded kg_accept_krb5
>> krb5_gss_accept_sec_context_ext krb5_gss_accept_sec_context
>> gss_accept_sec_context gssapi_server_mech_step sasl_server_step
>> sasl_server_start ids_sasl_check_bind do_bind connection_dispatch_operation
>> _pt_root start_thread clone
>>
>> I'm not sure if this particular situation is known/fixed.  Perhaps there
>> is a way to make the poll() called by sss_cli_make_request_nochecks()
>> have a smaller timeout?
>>
>> Does this look familiar to any ipa/sssd developer?
>>
>>
>>
>>> Thank you so much for your help,
>>>
>>> Guillermo
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 1, 2016 at 6:52 PM, Guillermo Fuentes
>>> <guillermo.fuen...@modernizingmedicine.com> wrote:
>>>
>>>> I'm now taking stack traces every minute and waiting for it to hang
>>>> again to check it. It happens usually under load but it's
>>>> unpredictable. Must likely tomorrow.
>>>> GUILLERMO FUENTES
>>>> SR. SYSTEMS ADMINISTRATOR
>>>>
>>>> 561-880-2998 x1337
>>>>
>>>> guillermo.fuen...@modmed.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 1, 2016 at 2:03 PM, Rich Megginson <rmegg...@redhat.com>
>>>> wrote:
>>>>
>>>>> On 06/01/2016 10:37 AM, Guillermo Fuentes wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We are experiencing a similar issue like the one discussed in the
>>>>>> following thread but we are running FreeIPA 4.2 on CentOS 7.2:
>>>>>> https://www.redhat.com/archives/freeipa-users/2015-February/
>>>>>> msg00205.html
>>>>>>
>>>>>
>>>>> Are your stack traces similar?
>>>>>
>>>>>
>>>>> LDAP service stops responding to queries (hangs). LDAP connections on
>>>>>> the server climb sometimes up to 10 times the normal amount and load
>>>>>> goes to 0. Then, the connections start to drop until they get to a
>>>>>> normal level and the LDAP service starts to respond to queries again.
>>>>>> This happens in between 3-5 minutes:
>>>>>>
>>>>>> Time,LDAP conn, Opened files(ns-slapd), File
>>>>>> Desc(ns-slapd),Threads(ns-slapd),Load1,Load5,Load15
>>>>>> 8:54:03,101,353,216,142,0.43,0.20,0.16
>>>>>> 8:55:02,108,359,221,142,0.19,0.18,0.15
>>>>>> 8:56:03,110,361,224,142,0.07,0.15,0.14
>>>>>> 8:57:14,117,383,246,142,0.15,0.16,0.15
>>>>>> 8:58:04,276,371,234,142,0.05,0.13,0.14
>>>>>> 8:59:05,469,371,234,142,0.02,0.11,0.13
>>>>>> 9:00:08,719,371,234,142,0.01,0.09,0.12
>>>>>> 9:01:18,1060,371,234,142,0.00,0.07,0.12
>>>>>> 9:02:10,742,371,233,142,0.10,0.09,0.12
>>>>>> 9:03:06,365,372,235,142,0.13,0.10,0.13
>>>>>> 9:04:04,262,379,242,142,0.87,0.29,0.19
>>>>>> 9:05:02,129,371,233,142,0.51,0.31,0.20
>>>>>> 9:06:03,126,377,240,142,0.42,0.33,0.22
>>>>>> 9:07:03,125,377,238,142,0.17,0.27,0.21
>>>>>>
>>>>>> Nothing is logged in the errors log file of the server having the
>>>>>> problem (ipa1 as an example).
>>>>>> In the replicas this is logged:
>>>>>> 8:59:05 -0400] NSMMReplicationPlugin - agmt="cn=meToipa1.example.com"
>>>>>> (ipa1:389): Unable to receive the response for a startReplication
>>>>>> extended operation to consumer (Timed out). Will retry later.
>>>>>> 9:01:05 -0400] NSMMReplicationPlugin - agmt="cn=meToipa1.example.com"
>>>>>> (ipa1:389): Unable to receive the response for a startReplication
>>>>>> extended operation to consumer (Timed out). Will retry later.
>>>>>>
>>>>>> Nothing is logged in the access log file until after ns-slapd starts
>>>>>> responding again:
>>>>>> ...
>>>>>> 8:57:00 -0400] conn=12384 fd=234 slot=234 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 8:57:00 -0400] conn=12385 fd=235 slot=235 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 8:57:00 -0400] conn=12386 fd=236 slot=236 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 8:57:00 -0400] conn=12387 fd=237 slot=237 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 8:57:00 -0400] conn=10384 op=1227 EXT oid="2.16.840.1.113730.3.5.12"
>>>>>> name="replication-multimaster-extop"
>>>>>> 8:57:00 -0400] conn=12324 op=8 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 8:57:00 -0400] conn=8838 op=2545 EXT oid="2.16.840.1.113730.3.5.12"
>>>>>> name="replication-multimaster-extop"
>>>>>> 8:57:00 -0400] conn=8838 op=2545 RESULT err=0 tag=120 nentries=0
>>>>>> etime=0
>>>>>> 8:57:00 -0400] conn=10384 op=1227 RESULT err=0 tag=120 nentries=0
>>>>>> etime=0
>>>>>> 8:57:00 -0400] conn=12382 op=-1 fd=170 closed - B1
>>>>>> 8:57:00 -0400] conn=12383 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 8:57:00 -0400] conn=12384 op=-1 fd=234 closed - B1
>>>>>> 8:57:00 -0400] conn=12385 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 8:57:00 -0400] conn=12383 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 8:57:00 -0400] conn=12386 op=-1 fd=236 closed - B1
>>>>>> 8:57:00 -0400] conn=12385 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 8:57:00 -0400] conn=12387 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 8:57:00 -0400] conn=12387 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 8:57:00 -0400] conn=12385 op=1 BIND dn="" method=sasl version=3
>>>>>> mech=GSSAPI
>>>>>> 8:57:00 -0400] conn=12387 op=1 BIND dn="" method=sasl version=3
>>>>>> mech=GSSAPI
>>>>>> 8:57:00 -0400] conn=12385 op=1 RESULT err=14 tag=97 nentries=0
>>>>>> etime=0, SASL bind in progress
>>>>>> 8:57:00 -0400] conn=12383 op=1 BIND dn="" method=sasl version=3
>>>>>> mech=GSSAPI
>>>>>> 8:57:00 -0400] conn=10384 op=1228 EXT oid="2.16.840.1.113730.3.5.5"
>>>>>> name="Netscape Replication End Session"
>>>>>> 8:57:00 -0400] conn=10384 op=1228 RESULT err=0 tag=120 nentries=0
>>>>>> etime=0
>>>>>> 8:57:00 -0400] conn=12383 op=1 RESULT err=14 tag=97 nentries=0
>>>>>> etime=0, SASL bind in progress
>>>>>> 9:02:00 -0400] conn=12388 fd=170 slot=170 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12389 fd=234 slot=234 SSL connection from
>>>>>> 172.20.0.24 to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12390 fd=236 slot=236 connection from local to
>>>>>> /var/run/slapd-EXAMPLE-COM.socket
>>>>>> 9:02:00 -0400] conn=12391 fd=238 slot=238 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12392 fd=239 slot=239 SSL connection from
>>>>>> 172.20.0.24 to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12393 fd=240 slot=240 connection from local to
>>>>>> /var/run/slapd-EXAMPLE-COM.socket
>>>>>> 9:02:00 -0400] conn=12394 fd=241 slot=241 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12395 fd=242 slot=242 SSL connection from
>>>>>> 172.20.0.24 to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12396 fd=243 slot=243 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12397 fd=244 slot=244 SSL connection from
>>>>>> 172.20.0.24 to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12398 fd=245 slot=245 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12400 fd=247 slot=247 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> 9:02:00 -0400] conn=12401 fd=248 slot=248 connection from 172.20.0.1
>>>>>> to 172.20.2.45
>>>>>> ...
>>>>>> 9:02:00 -0400] conn=12390 op=0 BIND dn="" method=sasl version=3
>>>>>> mech=GSSAPI
>>>>>> 9:02:00 -0400] conn=12388 op=-1 fd=170 closed - B1
>>>>>> 9:02:00 -0400] conn=12393 op=0 BIND dn="" method=sasl version=3
>>>>>> mech=GSSAPI
>>>>>> 9:02:00 -0400] conn=12391 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 9:02:00 -0400] conn=12394 op=-1 fd=241 closed - B1
>>>>>> 9:02:00 -0400] conn=12391 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 9:02:00 -0400] conn=12396 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 9:02:00 -0400] conn=12396 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 9:02:00 -0400] conn=12398 op=-1 fd=245 closed - B1
>>>>>> 9:02:00 -0400] conn=12400 op=0 SRCH base="" scope=0
>>>>>> filter="(objectClass=*)" attrs="supportedSASLMechanisms
>>>>>> defaultnamingcontext namingContexts schemanamingcontext saslrealm"
>>>>>> 9:02:00 -0400] conn=12400 op=0 RESULT err=0 tag=101 nentries=1 etime=0
>>>>>> 9:02:00 -0400] conn=12401 op=-1 fd=248 closed - B1
>>>>>> 9:02:00 -0400] conn=12391 op=1 ABANDON targetop=NOTFOUND msgid=1
>>>>>> 9:02:00 -0400] conn=12396 op=1 ABANDON targetop=NOTFOUND msgid=1
>>>>>> 9:02:00 -0400] conn=12400 op=1 ABANDON targetop=NOTFOUND msgid=1
>>>>>> 9:02:00 -0400] conn=12391 op=2 UNBIND
>>>>>> 9:02:00 -0400] conn=12396 op=2 UNBIND
>>>>>> 9:02:00 -0400] conn=12391 op=2 fd=238 closed - U1
>>>>>> 9:02:00 -0400] conn=12396 op=2 fd=243 closed - U1
>>>>>> 9:02:00 -0400] conn=12400 op=2 UNBIND
>>>>>> 9:02:00 -0400] conn=12400 op=2 fd=247 closed - U1
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> Environment:
>>>>>> # cat /etc/redhat-release
>>>>>> CentOS Linux release 7.2.1511 (Core)
>>>>>>
>>>>>> # rpm -qa ipa*
>>>>>> ipa-server-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>> ipa-python-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>> ipa-admintools-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>> ipa-server-trust-ad-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>> ipa-client-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>> ipa-server-dns-4.2.0-15.0.1.el7.centos.6.1.x86_64
>>>>>>
>>>>>> # rpm -qa 389*
>>>>>> 389-ds-base-libs-1.3.4.0-30.el7_2.x86_64
>>>>>> 389-ds-base-1.3.4.0-30.el7_2.x86_64
>>>>>>
>>>>>> We have 4 FreeIPA servers with replication working fine between them.
>>>>>> ipa1 is handling LDAP authentication for +400 clients and has been
>>>>>> tunned as recommended per
>>>>>>
>>>>>> https://access.redhat.com/documentation/en-US/Red_Hat_Direct
>>>>>> ory_Server/8.2/html/Performance_Tuning_Guide/system-tuning.html
>>>>>>
>>>>>> Is this a known issue?
>>>>>> Any idea what can be causing ns-slapd to hang?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Guillermo
>>>>>>
>>>>>> --
>>>>> Manage your subscription for the Freeipa-users mailing list:
>>>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>>>> Go to http://freeipa.org for more info on the project
>>>>>
>>>>
>>
>>
>
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org

Reply via email to