[Freeipa-users] Re: After "writeback to ldap failed" -- silent total freeipa failure / deadlock.

Thierry Bordaz via FreeIPA-users Wed, 09 Aug 2023 10:05:47 -0700


On 8/9/23 18:55, Harry G Coin wrote:

Theirry asked for a recap summary below, so forgive the 'top post'. Here it is:
4.9.10 default install on two systems call them primary (with kasp.db)and secondary but otherwise multi-master, 1g link between them,modest/old cpu, drives, 5Gmemory, with dns/dnssec and adtrust (aimedat local samba share support only). Unremarkable initial install. Normal operations, GUI, etc.
A python program using the ldap2 backend on Primary starts loading afew dozen default domains with A / AAAA and associated PTR records. It first does dns find/show to check for existence, and if absent addsthe domain/subdomain, missing A / AAAA assoc PTR etc. Extensivetraffic in the logs to do with dnssec, notifies being sent back andforth between primary and secondary by bind9 (which you'd thinkalready had the info in ldap so why 'notify' via bind really?) serialnumbers going up, dnssec updates. Every now and then the programchecks whether dnssec keys need rotating or if new zones appear, butthat's fairly infrequent and seems unrelated.
After not more than a few minutes of adding records, in Primary's log"writeback to ldap failed" will appear. There will be nothing in anylog indicating anything else amiss, 'systemctl is-system-running'reports 'running'. Login attempts on the GUI fail 'for an unknownreason', named/bind9 queries for A/AAAA seem to work. Anything thatcalls ns-slapd times out or hangs waiting forever. CPU usage near 0.



Did you get a pstack (ns-slapd) at that time ?

'systemctl restart ipa' and or reboot restores operations-- HOWEVERthere will be at least a 10 minute wait with ns-slapd at 100% CPUuntil the reboot process forcibly kills it.

I guess most ns-slapd workers (thread running the requests) have beenstopped but no idea which ones is eating CPU. A 'top -H' and pstackwould help.

Upgrading to 4.9.11 caused the 'writeback to ldap failed' message tomove to Secondary, not primary. Same further consequences.
Alexander's dsconf notion changed the appearance, it broke dnssecupdates with an LDAP timeout error message.
There is nothing whatever remarkable about this two node setup. Isuspect that test environments using the latest processors and allnvme storage is just too performant to manifest it, or the testenvironments don't have dnssec enabled and don't add a few thousandrecords to a few dozen subdomains.
I need some way forward, it's dead in the water now. Presently my'plan' such as it is -- is move freeipa VMs to faster systems withmore memory and 10gb interconnects in hopes of not hitting this, butof course this is one of those 'sword hanging over everyone's head bya thread' 'don't breathe on it wrong or you'll die' situations thatneeds an answer before trust can come back.
I appreciate the focus!


On 8/9/23 11:24, Thierry Bordaz wrote:
On 8/9/23 17:15, Harry G Coin wrote:
On 8/9/23 01:00, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
Thanks for your help. Details below. The problem 'moved' in Ihope a diagnositcally useful way, but the system remains broken.
On 8/8/23 08:54, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
On 8/8/23 02:43, Alexander Bokovoy wrote:
pstack $(pgrep ns-slapd)  > ns-slapd log
Tried an upgrade from 4.9.10 to 4.9.11, the "writeback to ldapfailed" error moved from the primary instance (on which the dnsrecords were being added) to the replica which hung in the samefashion. Here's the log you asked for from attempting'systemctl restart dirsrv@...' it just hangs at 100% cpu forabout 10 minutes.
Thank you. Are you using schema compat for some legacy clients?
This is a fresh install of 4.9.10 about a week ago, upgraded to4.9.11 yesterday, just two freeipa instances and no appreciableuser load, using the install defaults. The 'in house' system thenstarts loading lots of dns records via the python ldap2 interfaceon the first of two systems installed, the replica produced whatyou see in this post. There is no 'private' information involvedof any sort, it's supposed to field DNS calls from the public butwas so unreliable I had to implement unbound on other servers, soall freeipa does is IXFR to unbound for the heavy load. I supposethere may be <16 other in-house lab systems, maybe 2 or 3 with anyactivity, that use it for dns. The only other clue is these arerunning on VMs in older servers and have no other softwarepackages installed other than freeipa and what freeipa needs torun, and the in-house program that loads the dns.
Just to exclude potential problems with schema compat, it can be
disabled if you are not using it.
How? The installs just use all the defaults, other than enablingdnssec and PTR records for all a/aaaa.
I'm officially in 'desperation mode' as not being able to populateDNS in freeipa reduces everyone to pencil and paper and coffee withfull project stoppage until it's fixed or at least 'workedaround'. So anything that 'might help' can be sacrificed so atleast 'something' works 'somewhat'. If old AD needs to be 'broken'or 'off' but mostly the rest of it 'works sort of' then how do I do it?
Really this can't be hard to reproduce, it's just two instances witha 1G link between them, each with a pair of old rusty hard drives inan lvm mirror using a COW file system, dnssec on, and one of themloading lots of dns with reverse pointers for each A/AAAA with maybe200 to 600 PTR records per *arpa and maybe 10-200 records persubdomain, maybe 200 domains total. A couple python for loops andhey presto you'll see freeipa lock up without notice in your lab aswell. I just can't imagine causing these race conditions to appearin the case of the only important load being DNS adds/finds/showsshould be difficult.
I appreciate the help, and have become officially fearful aboutfreeipa. Maybe it's seldom used extensively for DNS and so my usecase is an outlier? Why are so few seeing this? It's a fullydefault package install, no custom changes to the OS, freeipa, otherpackages. I don't get it.
Thanks for any leads or help!
Hi Harry,
I agree with Mark, nothing suspicious on Thread30. It is flushing itstxn.The discussion is quite long, do you mind to re-explain what are thecurrent symptoms ?
Is it hanging during update ? consuming CPU ?
Could you run top -H -p <pid> -n 5 -d 3
if it is hanging could you run 'db_stat -CA -h /dev/shm/slapd-<inst>/-N'
regards
thierry
I don't think it is about named per se, it is a bit of an unfortunate
interop inside ns-slapd between different plugins. bind-dyndb-ldap
relies on the syncrepl extension which implementation in ns-slapd is
using the retro changelog content. Retro changelog plugin triggerssome
updates that cause schema compatibility plugin to lock itself up
depending on the order of updates that retro changelog wouldcapture. We
fixed that in slapi-nis package some time ago and it *should* be
ignoring the retro changelog changes but somehow they still propagate
into it. There are few places in ns-slapd which were addressed just
recently and those updates might help (out later this year in RHEL).
Disabling schema compat would be the best.
What's worse, every reboot attempt waits the full '9 min 29 secs'before systemd forcibly terminates ns-slapd to finish the 'stop job'.
That's why I'm so troubled by all this, it's not like there is anyinterference from anything other than what freeipa puts out there,and it just locks with a message that gives no indication of whatto do about it, with nothing in any logs and 'systemctlis-system-running' reports 'running'.
You could easily replicate this: imagine a simple validation testthat sets up two freeipa nodes, turns on dnssec, creates somedomains, then adds A AAAA and *.arpa records using the ldap2 apion one of the nodes. Maybe limit the net speed between the nodesto a 1GB link typical, maybe at most 4 processor cores of someolder vintage and 5GB memory. It takes less than 2 minutes afterdns load start to lock up.
What's really odd is bind9 / named keeps blasting out changenotifications for some of the updated domains, then a few lineslater, with no intervening activity in any log or by any programaffecting the zone, will publish further change notifications witha new serial number for the same zone. This happens for all thezones that get modifications. I'm thinking 'rr' computations? Iwonder if those entries-- being auto-generated internally -- arecreating a 'flow control' issue between the primary and replica.
This is something that retro changelog is responsible for as it is the
data store used by the syncrepl protocol implementation. If these
'changes' appear again and again, it means retro changelog pluginmarks
them as new for this particular syncrepl client (bind-dyndb-ldap).
All threads other than the thread 30 are normal ones (idle threads)but
this one blocks the database backend in the log flush sequence while
writing the retro changelog entry for this updated DNS record:
Thread 30 (Thread 0x7f0e583ff700 (LWP 1438)):
#0  0x00007f0e9bf7d8af in fdatasync () at target:/lib64/libc.so.6
#1  0x00007f0e91cbe6b5 in __os_fsync () at target:/lib64/libdb-5.3.so
#2 0x00007f0e91ca598c in __log_flush_int () attarget:/lib64/libdb-5.3.so#3 0x00007f0e91ca7dd0 in __log_flush () attarget:/lib64/libdb-5.3.so#4 0x00007f0e91ca7f73 in __log_flush_pp () attarget:/lib64/libdb-5.3.so#5 0x00007f0e8afe1304 in bdb_txn_commit (li=<optimized out>, txn=0x7f0e583fd028, use_lock=1) atldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:2772#6 0x00007f0e8af95515 in dblayer_txn_commit (be=0x7f0e88424f00,txn=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:736#7 0x00007f0e8afa7ebe in ldbm_back_add (pb=0x7f0e85748860) atldap/servers/slapd/back-ldbm/ldbm_add.c:1242#8 0x00007f0e9d7d7728 in op_shared_add(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:692#9 0x00007f0e9d7d7bbe in add_internal_pb(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:407#10 0x00007f0e9d7d8975 in slapi_add_internal_pb(pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:331#11 0x00007f0e8960f8bf in write_replog_db (newsuperior=0x0,modrdn_mods=0x0, newrdn=0x0, post_entry=<optimized out>,log_e=0x7f0e4df5b9c0, curtime=1691511446, flag=0,log_m=0x7f0e5ec0d440, dn=0x7f0e57a40740"idnsname=8.0.f.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.,cn=dns,dc=1,dc=quietfountain,dc=com",optype=<optimized out>, pb=0x7f0e66a09580) atldap/servers/plugins/retrocl/retrocl_po.c:369 #120x00007f0e8960f8bf in retrocl_postob (pb=0x7f0e66a09580,optype=<optimized out>) atldap/servers/plugins/retrocl/retrocl_po.c:697#13 0x00007f0e9d83cc79 in plugin_call_func (list=0x7f0e924aae00,operation=operation@entry=561, pb=pb@entry=0x7f0e66a09580,call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2032#14 0x00007f0e9d83cec4 in plugin_call_list (pb=0x7f0e66a09580,operation=561, list=<optimized out>) atldap/servers/slapd/plugin.c:1973#15 0x00007f0e9d83cec4 in plugin_call_plugins(pb=pb@entry=0x7f0e66a09580,whichfunction=whichfunction@entry=561) atldap/servers/slapd/plugin.c:442#16 0x00007f0e8afc3658 in ldbm_back_modify (pb=<optimized out>) atldap/servers/slapd/back-ldbm/ldbm_modify.c:1002#17 0x00007f0e9d828300 in op_shared_modify(pb=pb@entry=0x7f0e66a09580, pw_change=pw_change@entry=0,old_pw=0x0) at ldap/servers/slapd/modify.c:1025#18 0x00007f0e9d829a00 in do_modify (pb=pb@entry=0x7f0e66a09580)at ldap/servers/slapd/modify.c:380#19 0x0000564ed703475b in connection_dispatch_operation(pb=0x7f0e66a09580, op=<optimized out>, conn=<optimized out>) atldap/servers/slapd/connection.c:651#20 0x0000564ed703475b in connection_threadmain (arg=<optimizedout>) at ldap/servers/slapd/connection.c:1803
#21 0x00007f0e9a24b968 in _pt_root () at target:/lib64/libnspr4.so
#22 0x00007f0e99be61ca in start_thread () attarget:/lib64/libpthread.so.0
#23 0x00007f0e9be90e73 in clone () at target:/lib64/libc.so.6
Mark, Thierry, any hints here? (For full trace see thread
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org/thread/TMRXHCORFU3QRQL6FSZTS4OIHYOAVXWF/)

_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[Freeipa-users] Re: After "writeback to ldap failed" -- silent total freeipa failure / deadlock.

Reply via email to