On 20-04-2020 15:16, thierry bordaz wrote:
> On 4/20/20 3:02 PM, Kees Bakker wrote:
>> On 20-04-2020 14:51, Rob Crittenden wrote:
>>> Kees Bakker via FreeIPA-users wrote:
>>>> On 20-04-2020 09:58, Kees Bakker via FreeIPA-users wrote:
>>>>> On 20-04-2020 09:09, Florence Blanc-Renaud wrote:
>>>>>> On 4/20/20 8:28 AM, Kees Bakker via FreeIPA-users wrote:
>>>>>>> Hey,
>>>>>>>
>>>>>>> I'm looking for advice how to analyse/debug this.
>>>>>>>
>>>>>>> On one of the masters the dirsrv is unresponsive. It runs, but every
>>>>>>> attempt to connect it hangs.
>>>>>>>
>>>>>>> The command "systemctl status" does not show anything alarming
>>>>>>>
>>>>>>> ● dirsrv@EXAMPLE-COM.service - 389 Directory Server EXAMPLE-COM.
>>>>>>>      Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service; enabled; 
>>>>>>> vendor preset: disabled)
>>>>>>>      Active: active (running) since vr 2020-04-17 13:46:25 CEST; 1h 
>>>>>>> 33min ago
>>>>>>>     Process: 3123 ExecStartPre=/usr/sbin/ds_systemd_ask_password_acl 
>>>>>>> /etc/dirsrv/slapd-%i/dse.ldif (code=exited, status=0/SUCCESS)
>>>>>>>    Main PID: 3134 (ns-slapd)
>>>>>>>      Status: "slapd started: Ready to process requests"
>>>>>>>      CGroup: 
>>>>>>> /system.slice/system-dirsrv.slice/dirsrv@EXAMPLE-COM.service
>>>>>>>              └─3134 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-EXAMPLE-COM 
>>>>>>> -i /var/run/dirsrv/slapd-EXAMPLE-COM.pid
>>>>>>>
>>>>>>> apr 17 15:13:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:13:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:13:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:13:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:13:54 linge.example.com ns-slapd[3134]: GSSAPI client step 2
>>>>>>> apr 17 15:18:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:18:54 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:18:55 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:18:55 linge.example.com ns-slapd[3134]: GSSAPI client step 1
>>>>>>> apr 17 15:18:55 linge.example.com ns-slapd[3134]: GSSAPI client step 2
>>>>>>>
>>>>>>> However, an ldapsearch command hangs forever
>>>>>>>
>>>>>>> [root@rotte ~]# ldapsearch -H ldaps://linge.example.com -D 
>>>>>>> uid=keesbtest,cn=users,cn=accounts,dc=example,dc=com -W -LLL -o 
>>>>>>> ldif-wrap=no -b cn=users,cn=accounts,dc=example,dc=com 
>>>>>>> '(&(objectClass=person)(memberOf=cn=admins,cn=groups,cn=accounts,dc=example,dc=com))'
>>>>>>>  uid
>>>>>>> Enter LDAP Password:
>>>>>>>
>>>>>>> Even if I use the socket 
>>>>>>> (ldapi://%2fvar%2frun%2fslapd-EXAMPLE-COM.socket) the ldapsearch
>>>>>>> command hangs.
>>>>>>>
>>>>>>> "ipactl status" hangs
>>>>>>>
>>>>>>> "kinit" hangs
>>>>>>>
>>>>>>>
>>>>>> Hi,
>>>>>> you can start by having a look at dirsrv error log in
>>>>>> /var/log/dirsrv-slapd-YOUR_DOMAIN/errors, and the journal.
>>>>>>
>>>>>> The FAQ page of 389 also explains a few troubleshooting steps:
>>>>>> http://www.port389.org/docs/389ds/FAQ/faq.html#Troubleshooting
>>>>> I did exactly that, look at the "errors" log, but there was no clue, at 
>>>>> least
>>>>> not for me. Strange enough it kept running for a few hours and then it
>>>>> was hanging again.
>>>>>
>>>>> I tried the command "ipctl restart", but that was hanging forever.
>>>>> However "systemctl restart dirsrv@MY-DOMAIN" was able to restart
>>>>> it after several minutes. Meanwhile the sn-slapd process was using 100%
>>>>> CPU.
>>>>>
>>>>> Another remark I want to make. Every ldap connection (ldapsearch, 
>>>>> whatever)
>>>>> hangs for ever. No timeout, nothing.
>>>>>
>>>>> When it rains, it pours, they say. There is another master with the same 
>>>>> symptom.
>>>>> I'm getting nervous now.
>>>>>
>>>>> Thanks for the Troubleshooting link. I'll have to dive into the deep, I 
>>>>> guess.
>>>> Could it be a deadlock?
>>>>
>>>> [root@linge ~]# grep -a1 '^#0.*lock' slapd-stacktrace.1587374239.txt
>>>> Thread 23 (Thread 0x7ff8ff265700 (LWP 14474)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> -- 
>>>> Thread 7 (Thread 0x7ff8f7255700 (LWP 14490)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> -- 
>>>> Thread 4 (Thread 0x7ff8f5a52700 (LWP 14493)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> -- 
>>>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>>>> #0  0x00007ff92943035e in pthread_rwlock_wrlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190b6639 in backend_be_pre_write_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>>
>>>> Without debuginfo the trace of these threads look like this:
>>>>
>>>> Thread 23 (Thread 0x7ff8ff265700 (LWP 14474)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #2  0x00007ff9190b8745 in backend_search_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #3  0x00007ff92bcd9028 in plugin_call_func () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #4  0x00007ff92bcd92e3 in plugin_call_plugins () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #5  0x00007ff92bccc0d7 in op_shared_search () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #6  0x0000562454427bbe in do_search ()
>>>> #7  0x000056245441595a in connection_threadmain ()
>>>> #8  0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>>>> #9  0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>>>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>>>
>>>> Thread 7 (Thread 0x7ff8f7255700 (LWP 14490)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #2  0x00007ff9190b8745 in backend_search_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #3  0x00007ff92bcd9028 in plugin_call_func () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #4  0x00007ff92bcd92e3 in plugin_call_plugins () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #5  0x00007ff92bccc0d7 in op_shared_search () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #6  0x0000562454427bbe in do_search ()
>>>> #7  0x000056245441595a in connection_threadmain ()
>>>> #8  0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>>>> #9  0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>>>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>>>
>>>> Thread 4 (Thread 0x7ff8f5a52700 (LWP 14493)):
>>>> #0  0x00007ff929430144 in pthread_rwlock_rdlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190cc49c in map_rdlock () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #2  0x00007ff9190b8745 in backend_search_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #3  0x00007ff92bcd9028 in plugin_call_func () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #4  0x00007ff92bcd92e3 in plugin_call_plugins () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #5  0x00007ff92bccc0d7 in op_shared_search () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #6  0x0000562454427bbe in do_search ()
>>>> #7  0x000056245441595a in connection_threadmain ()
>>>> #8  0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>>>> #9  0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>>>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>>>
>>>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>>>> #0  0x00007ff92943035e in pthread_rwlock_wrlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190b6639 in backend_be_pre_write_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #2  0x00007ff92bcd9028 in plugin_call_func () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #3  0x00007ff92bcd92e3 in plugin_call_plugins () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #4  0x00007ff91b08ac88 in ldbm_back_delete () at 
>>>> /usr/lib64/dirsrv/plugins/libback-ldbm.so
>>>> #5  0x00007ff92bc89feb in op_shared_delete () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #6  0x00007ff92bc8a1c8 in delete_internal_pb () at 
>>>> /usr/lib64/dirsrv/libslapd.so.0
>>>> #7  0x00007ff9196f131e in delete_changerecord () at 
>>>> /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
>>>> #8  0x00007ff9196f1605 in changelog_trim_thread_fn () at 
>>>> /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
>>>> #9  0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>>>> #10 0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>>>> #11 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>>>
>>>> With the following (stripped) debuginfo you can see that it is hanging 
>>>> while deleting
>>>> a changerecord.
>>>>
>>>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>>>> #0  0x00007ff92943035e in pthread_rwlock_wrlock () at 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff9190b6639 in backend_be_pre_write_cb () at 
>>>> /usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>>> #2  0x00007ff92bcd9028 in plugin_call_func (list=0x5624565a3340, 
>>>> operation=operation@entry=453, pb=pb@entry=0x5624629c28
>>>> 40, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2028
>>>>          n = <optimized out>
>>>>          func = 0x7ff9190b65a0 <backend_be_pre_write_cb>
>>>>          rc = <optimized out>
>>>>          return_value = 0
>>>>          count = 0
>>>> #3  0x00007ff92bcd92e3 in plugin_call_list (pb=0x5624629c2840, 
>>>> operation=453, list=<optimized out>) at ldap/servers/slap
>>>> d/plugin.c:1972
>>>>          p = 0x5624564ea2c0
>>>>          locked = <optimized out>
>>>>          plugin_list_number = 3
>>>>          rc = 0
>>>>          do_op = <optimized out>
>>>> #4  0x00007ff92bcd92e3 in plugin_call_plugins (pb=pb@entry=0x5624629c2840, 
>>>> whichfunction=whichfunction@entry=453) at lda
>>>> p/servers/slapd/plugin.c:442
>>>>          p = 0x5624564ea2c0
>>>>          locked = <optimized out>
>>>>          plugin_list_number = 3
>>>>          rc = 0
>>>>          do_op = <optimized out>
>>>> #5  0x00007ff91b08ac88 in ldbm_back_delete (pb=0x5624629c2840) at 
>>>> ldap/servers/slapd/back-ldbm/ldbm_delete.c:373
>>>>          be = 0x5624564fc000
>>>>          inst = 0x56245635c680
>>>>          li = 0x5624562d3a40
>>>>          e = 0x562462377810
>>>>          tombstone = 0x0
>>>>          original_tombstone = 0x0
>>>>          tmptombstone = 0x0
>>>>          dn = 0x56245d90fa70 "changenumber=879530,cn=changelog"
>>>>          txn = {back_txn_txn = 0x562459c2cc60}
>>>>          parent_txn = 0x0
>>>>          retval = 0
>>>>          msg = <optimized out>
>>>>          errbuf = 0x0
>>>>          retry_count = <optimized out>
>>>>          disk_full = 0
>>>>          parent_found = <optimized out>
>>>>          ruv_c_init = 0
>>>>          parent_modify_c = {old_entry = 0x0, new_entry = 0x0, smods = 0x0, 
>>>> attr_encrypt = 0}
>>>>          ruv_c = {old_entry = 0x0, new_entry = 0x0, smods = 0x0, 
>>>> attr_encrypt = 0}
>>>> ...
>>>>
>>> What distribution are you using and what version of 389-ds?
>>>
>> Centos7
>>
>> [root@linge ~]# rpm -qa 389\*
>> 389-ds-base-debuginfo-1.3.9.1-13.el7_7.x86_64
>> 389-ds-base-libs-1.3.9.1-13.el7_7.x86_64
>> 389-ds-base-1.3.9.1-13.el7_7.x86_64
>>
>> [root@linge ~]# rpm -qa centos-release
>> centos-release-7-7.1908.0.el7.centos.x86_64
>>
>>
> This is a known bug [1].
> With the same bug there are two deadlock scenario but only one is fixed (for 
> example in  slapi-nis-0.56.4-1 [2]).
> A fix for the second one is under tests.
>
> At the moment I would recommend the workaround [3]. The drawback is a growth 
> of the retroCL database but unless you have a very high rate of update it 
> should not be a concern.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1751295
> [2] https://koji.fedoraproject.org/koji/buildinfo?buildID=1457771
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1751295#c5

OK, thanks for the help. I will try that.

Two remarks.
1) I did an update a few days ago. (1.3.8.4-18.el7_6 => 1.3.9.1-13.el7_7). Then 
two days after the update this hang started to happen.
2) the nasty thing about this is, that 398-ds was still alive, but all 
connections to it hang forever (even "ipactl restart")
-- 
Kees
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org

Reply via email to