On 8/9/23 18:55, Harry G Coin wrote:
Theirry asked for a recap summary below, so forgive the 'top post'.  Here it is:

4.9.10 default install on two systems call them primary (with kasp.db) and secondary but otherwise multi-master, 1g link between them, modest/old cpu, drives, 5Gmemory, with dns/dnssec and adtrust (aimed at local samba share support only).  Unremarkable initial install.  Normal operations, GUI, etc.

A python program using the ldap2 backend on Primary starts loading a few dozen default domains with A / AAAA and associated PTR records.   It first does dns find/show to check for existence, and if absent adds the domain/subdomain, missing A / AAAA assoc PTR etc.    Extensive traffic in the logs to do with dnssec, notifies being sent back and forth between primary and secondary by bind9 (which you'd think already had the info in ldap so why 'notify' via bind really?)  serial numbers going up, dnssec updates.  Every now and then the program checks whether dnssec keys need rotating or if new zones appear, but that's fairly infrequent and seems unrelated.

After not more than a few minutes of adding records, in Primary's log "writeback to ldap failed" will appear.   There will be nothing in any log indicating anything else amiss, 'systemctl is-system-running' reports 'running'.  Login attempts on the GUI fail 'for an unknown reason', named/bind9 queries for A/AAAA seem to work.  Anything that calls ns-slapd times out or hangs waiting forever.  CPU usage near 0.


Did you get a pstack (ns-slapd) at that time ?



'systemctl restart ipa' and or reboot restores operations-- HOWEVER there will be at least a 10 minute wait with ns-slapd at 100% CPU until the reboot process forcibly kills it.

I guess most ns-slapd workers (thread running the requests) have been stopped but no idea which ones is eating CPU. A 'top -H' and pstack would help.


Upgrading to 4.9.11 caused the 'writeback to ldap failed' message to move to  Secondary, not primary.   Same further consequences.

Alexander's dsconf notion changed the appearance, it broke dnssec updates with an LDAP timeout error message.

There is nothing whatever remarkable about this two node setup. I suspect that test environments using the latest processors and all nvme storage is just too performant to manifest it, or the test environments don't have dnssec enabled and don't add a few thousand records to a few dozen subdomains.

I need some way forward, it's dead in the water now.   Presently my 'plan' such as it is -- is move freeipa VMs to faster systems with more memory and 10gb interconnects in hopes of not hitting this, but of course this is one of those 'sword hanging over everyone's head by a thread' 'don't breathe on it wrong or you'll die' situations that needs an answer before trust can come back.

I appreciate the focus!


On 8/9/23 11:24, Thierry Bordaz wrote:

On 8/9/23 17:15, Harry G Coin wrote:

On 8/9/23 01:00, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:
Thanks for your help.  Details below. The problem 'moved' in I hope a diagnositcally useful way, but the system remains broken.

On 8/8/23 08:54, Alexander Bokovoy wrote:
On Аўт, 08 жні 2023, Harry G Coin wrote:

On 8/8/23 02:43, Alexander Bokovoy wrote:
pstack $(pgrep ns-slapd)  > ns-slapd log
Tried an upgrade from 4.9.10 to 4.9.11, the "writeback to ldap failed" error moved from the primary instance (on which the dns records were being added) to the replica which hung in the same fashion.   Here's the log you asked for from attempting 'systemctl restart dirsrv@...'  it just hangs at 100% cpu for about 10 minutes.

Thank you. Are you using schema compat for some legacy clients?


This is a fresh install of 4.9.10 about a week ago, upgraded to 4.9.11 yesterday, just two freeipa instances and no appreciable user load, using the install defaults. The 'in house' system then starts loading lots of dns records via the python ldap2 interface on the first of two systems installed, the replica produced what you see in this post. There is no 'private' information involved of any sort, it's supposed to field DNS calls from the public but was so unreliable I had to implement unbound on other servers, so all freeipa does is IXFR to unbound for the heavy load.  I suppose there may be <16 other in-house lab systems, maybe 2 or 3 with any activity, that use it for dns.   The only other clue is these are running on VMs in older servers and have no other software packages installed other than freeipa and what freeipa needs to run, and the in-house program that loads the dns.

Just to exclude potential problems with schema compat, it can be
disabled if you are not using it.

How?  The installs just use all the defaults, other than enabling dnssec and PTR records for all a/aaaa.

I'm officially in 'desperation mode' as not being able to populate DNS in freeipa reduces everyone to pencil and paper and coffee with full project stoppage until it's fixed or at least 'worked around'.   So anything that 'might help' can be sacrificed so at least 'something' works 'somewhat'.   If old AD needs to be 'broken' or 'off' but mostly the rest of it 'works sort of' then how do I do it?

Really this can't be hard to reproduce, it's just two instances with a 1G link between them, each with a pair of old rusty hard drives in an lvm mirror using a COW file system, dnssec on, and one of them loading lots of dns with reverse pointers for each A/AAAA with maybe 200 to 600 PTR records per *arpa and maybe 10-200 records per subdomain, maybe 200 domains total.    A couple python for loops and hey presto you'll see freeipa lock up without notice in your lab as well.  I just can't imagine causing these race conditions to appear in the case of the only important load being DNS adds/finds/shows should be difficult.

I appreciate the help, and have become officially fearful about freeipa.  Maybe it's seldom used extensively for DNS and so my use case is an outlier?   Why are so few seeing this? It's a fully default package install, no custom changes to the OS, freeipa, other packages.   I don't get it.

Thanks for any leads or help!


Hi Harry,


I agree with Mark, nothing suspicious on Thread30. It is flushing its txn. The discussion is quite long, do you mind to re-explain what are the current symptoms ?
Is it hanging during update ? consuming CPU ?
Could you run top -H -p <pid> -n 5 -d 3

if it is hanging could you run 'db_stat -CA -h /dev/shm/slapd-<inst>/ -N'


regards
thierry




I don't think it is about named per se, it is a bit of an unfortunate
interop inside ns-slapd between different plugins. bind-dyndb-ldap
relies on the syncrepl extension which implementation in ns-slapd is
using the retro changelog content. Retro changelog plugin triggers some
updates that cause schema compatibility plugin to lock itself up
depending on the order of updates that retro changelog would capture. We
fixed that in slapi-nis package some time ago and it *should* be
ignoring the retro changelog changes but somehow they still propagate
into it. There are few places in ns-slapd which were addressed just
recently and those updates might help (out later this year in RHEL).
Disabling schema compat would be the best.

What's worse, every reboot attempt waits the full '9 min 29 secs' before systemd forcibly terminates ns-slapd to finish the 'stop job'.

That's why I'm so troubled by all this, it's not like there is any interference from anything other than what freeipa puts out there, and it just locks with a message that gives no indication of what to do about it, with nothing in any logs and 'systemctl is-system-running' reports 'running'.

You could easily replicate this:  imagine a simple validation test that sets up two freeipa nodes, turns on dnssec, creates some domains, then adds A AAAA and *.arpa records using the ldap2 api on one of the nodes.  Maybe limit the net speed between the nodes to a 1GB link typical, maybe at most 4 processor cores of some older vintage and 5GB memory.  It takes less than 2 minutes after dns load start to lock up.

What's really odd is bind9 / named keeps blasting out change notifications for some of the updated domains, then a few lines later, with no intervening activity in any log or by any program affecting the zone, will publish further change notifications with a new serial number for the same zone. This happens for all the zones that get modifications.  I'm thinking 'rr' computations?  I wonder if those entries-- being auto-generated internally -- are creating a 'flow control' issue between the primary and replica.

This is something that retro changelog is responsible for as it is the
data store used by the syncrepl protocol implementation. If these
'changes' appear again and again, it means retro changelog plugin marks
them as new for this particular syncrepl client (bind-dyndb-ldap).

All threads other than the thread 30 are normal ones (idle threads) but
this one blocks the database backend in the log flush sequence while
writing the retro changelog entry for this updated DNS record:

Thread 30 (Thread 0x7f0e583ff700 (LWP 1438)):
#0  0x00007f0e9bf7d8af in fdatasync () at target:/lib64/libc.so.6
#1  0x00007f0e91cbe6b5 in __os_fsync () at target:/lib64/libdb-5.3.so
#2  0x00007f0e91ca598c in __log_flush_int () at target:/lib64/libdb-5.3.so #3  0x00007f0e91ca7dd0 in __log_flush () at target:/lib64/libdb-5.3.so #4  0x00007f0e91ca7f73 in __log_flush_pp () at target:/lib64/libdb-5.3.so #5  0x00007f0e8afe1304 in bdb_txn_commit (li=<optimized out>,  txn=0x7f0e583fd028, use_lock=1) at ldap/servers/slapd/back-ldbm/db-bdb/bdb_layer.c:2772 #6  0x00007f0e8af95515 in dblayer_txn_commit (be=0x7f0e88424f00, txn=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:736 #7  0x00007f0e8afa7ebe in ldbm_back_add (pb=0x7f0e85748860) at ldap/servers/slapd/back-ldbm/ldbm_add.c:1242 #8  0x00007f0e9d7d7728 in op_shared_add (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:692 #9  0x00007f0e9d7d7bbe in add_internal_pb (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:407 #10 0x00007f0e9d7d8975 in slapi_add_internal_pb (pb=pb@entry=0x7f0e85748860) at ldap/servers/slapd/add.c:331 #11 0x00007f0e8960f8bf in write_replog_db (newsuperior=0x0, modrdn_mods=0x0, newrdn=0x0, post_entry=<optimized out>, log_e=0x7f0e4df5b9c0, curtime=1691511446, flag=0, log_m=0x7f0e5ec0d440, dn=0x7f0e57a40740 "idnsname=8.0.f.0.0.0.0.0.0.0.0.1.0.0.c.f.ip6.arpa.,cn=dns,dc=1,dc=quietfountain,dc=com", optype=<optimized out>, pb=0x7f0e66a09580) at ldap/servers/plugins/retrocl/retrocl_po.c:369 #12 0x00007f0e8960f8bf in retrocl_postob (pb=0x7f0e66a09580, optype=<optimized out>) at ldap/servers/plugins/retrocl/retrocl_po.c:697 #13 0x00007f0e9d83cc79 in plugin_call_func (list=0x7f0e924aae00, operation=operation@entry=561, pb=pb@entry=0x7f0e66a09580, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2032 #14 0x00007f0e9d83cec4 in plugin_call_list (pb=0x7f0e66a09580, operation=561, list=<optimized out>) at ldap/servers/slapd/plugin.c:1973 #15 0x00007f0e9d83cec4 in plugin_call_plugins (pb=pb@entry=0x7f0e66a09580, whichfunction=whichfunction@entry=561) at ldap/servers/slapd/plugin.c:442 #16 0x00007f0e8afc3658 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:1002 #17 0x00007f0e9d828300 in op_shared_modify (pb=pb@entry=0x7f0e66a09580, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1025 #18 0x00007f0e9d829a00 in do_modify (pb=pb@entry=0x7f0e66a09580) at ldap/servers/slapd/modify.c:380 #19 0x0000564ed703475b in connection_dispatch_operation (pb=0x7f0e66a09580, op=<optimized out>, conn=<optimized out>) at ldap/servers/slapd/connection.c:651 #20 0x0000564ed703475b in connection_threadmain (arg=<optimized out>) at ldap/servers/slapd/connection.c:1803
#21 0x00007f0e9a24b968 in _pt_root () at target:/lib64/libnspr4.so
#22 0x00007f0e99be61ca in start_thread () at target:/lib64/libpthread.so.0
#23 0x00007f0e9be90e73 in clone () at target:/lib64/libc.so.6

Mark, Thierry, any hints here? (For full trace see thread
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org/thread/TMRXHCORFU3QRQL6FSZTS4OIHYOAVXWF/)




_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to