On Wed, Oct 19, 2016 at 06:59:57PM +0200, thierry bordaz wrote: > > > On 10/19/2016 06:28 PM, Andrew E. Bruno wrote: > > On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote: > > > On 10/19/2016 05:02 PM, Ludwig Krispenz wrote: > > > > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote: > > > > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote: > > > > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote: > > > > > > > We had one of our replicas fail today with the following errors: > > > > > > > > > > > > > > > > > > > > > [18/Oct/2016:13:40:47 -0400] > > > > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu" > > > > > > > (srv-m14-32:389) - Can't locate CSN 58065ef3000100030000 in > > > > > > > the changelog (DB rc=-30988). If replication stops, the > > > > > > > consumer may need to be reinitialized. > > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5WriteOperationTxn: retry (49) the > > > > > > > transaction (csn=58065f74000500040000) failed (rc=-30993 > > > > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a > > > > > > > deadlock)) > > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5WriteOperationTxn: failed to write > > > > > > > entry with csn (58065f74000500040000); db error - -30993 > > > > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a > > > > > > > deadlock > > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - > > > > > > > write_changelog_and_ruv: can't add a change for > > > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu > > > > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to > > > > > > > changelog csn 58065f74000500040000 > > > > > > > [18/Oct/2016:13:43:07 -0400] - > > > > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but > > > > > > > did not set SLAPI_RESULT_CODE > > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin - > > > > > > > process_postop: Failed to apply update > > > > > > > (58065f74000500040000) error (1). Aborting replication > > > > > > > session(conn=1314106 op=1688559) > > > > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify: > > > > > > > modified entry is NULL--updating cache just in case > > > > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition > > > > > > > cn=Password > > > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS > > > > > > > Templates found, which should be added before the CoS > > > > > > > Definition. > > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null > > > > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993. > > > > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get > > > > > > > id for changenumber=30856302,cn=changelog from entryrdn > > > > > > > index (-30993) > > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching > > > > > > > changenumber=30856302,cn=changelog (null), error -30993. > > > > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an > > > > > > > error occured while adding change number 30856302, dn = > > > > > > > changenumber=30856302,cn=changelog: Operations error. > > > > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin - > > > > > > > retrocl_postob: operation failure [1] > > > > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin - > > > > > > > process_postop: Failed to apply update > > > > > > > (58065f9f000000600000) error (1). Aborting replication > > > > > > > session(conn=1901274 op=5) > > > > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry > > > > > > > BAD 1601, err=0 BDB0062 Successful return: 0 > > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5WriteOperationTxn: retry (49) the > > > > > > > transaction (csn=58065f7c000a00040000) failed (rc=-30993 > > > > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a > > > > > > > deadlock)) > > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5WriteOperationTxn: failed to write > > > > > > > entry with csn (58065f7c000a00040000); db error - -30993 > > > > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a > > > > > > > deadlock > > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin - > > > > > > > write_changelog_and_ruv: can't add a change for > > > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu > > > > > > > (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to > > > > > > > changelog csn 58065f7c000a00040000 > > > > > > > > > > > > > > > > > > > > > ns-slapd was hung so we restarted and now it's stuck and > > > > > > > won't come back up. It > > > > > > > hangs up here: > > > > > > > > > > > > > > [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition > > > > > > > cn=Password > > > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS > > > > > > > Templates found, which should be added before the CoS > > > > > > > Definition. > > > > > > > [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: > > > > > > > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema; > > > > > > > NSPR error - -5943 > > > > > > > [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin - > > > > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: > > > > > > > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema; > > > > > > > NSPR error - -5943 > > > > > > > > > > > > > > > > > > > > > Tried deleting the semaphore files and restarting but no > > > > > > > luck. Attached > > > > > > > is a stacktrace of the stuck ns-slapd process. > > > > > > > > > > > > > > Here's the versions were running: > > > > > > > > > > > > > > ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64 > > > > > > > 389-ds-base-1.3.4.0-33.el7_2.x86_64 > > > > > > > > > > > > > > FWIW, we were experimenting with the new life-cycle > > > > > > > management features, > > > > > > > specifically "preserved" users and deleted the user > > > > > > > "janedoe" when this > > > > > > > happened. From the errors above looks like this host failed to > > > > > > > replicate the change? Not sure if this is related or not. > > > > > > > > > > > > > > Is it possible to recover the database? Thanks in advance > > > > > > > for any pointers. > > > > > > from the stack trace the process is not hanging, it is trying to > > > > > > recover. > > > > > > After a crash/kill the changelog does not contai a RUV and it is > > > > > > reconstructed by reading all records in the changelog, if this > > > > > > is large it > > > > > > can take some time. > > > > > > If you look at that part of the stack repeatedly, > > > > > > > > > > > > #4 0x00007f4e88daeba5 in cl5DBData2Entry (data=<optimized out>, > > > > > > len=<optimized out>, entry=entry@entry=0x7ffff6598910) at > > > > > > ldap/servers/plugins/replication/cl5_api.c:2342 > > > > > > rc = <optimized out> > > > > > > version = <optimized out> > > > > > > pos = 0x7f4e9839d091 "" > > > > > > strCSN = 0x0 > > > > > > op = 0x7ffff6598980 > > > > > > add_mods = 0x7f4e983a5e80 > > > > > > rawDN = 0x7f4e98396e20 > > > > > > "fqdn=cpn-k08-29-02.cbls.ccr.buffalo.edu,cn=computers,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu" > > > > > > s = > > > > > > "\300\037>\230N\177\000\000@\210Y\366\377\177\000\000@\210Y\366\377" > > > > > > #5 0x00007f4e88daf5d6 in _cl5GetNextEntry > > > > > > (entry=entry@entry=0x7ffff6598910, iterator=0x7f4e983a5e80) at > > > > > > ldap/servers/plugins/replication/cl5_api.c:5291 > > > > > > rc = 0 > > > > > > it = 0x7f4e983a5e80 > > > > > > key = {data = 0x0, size = 21, ulen = 0, dlen = 0, doff > > > > > > = 0, app_data = 0x0, flags = 16} > > > > > > data = {data = 0x7f4e9839cff0, size = 335, ulen = 0, > > > > > > dlen = 0, doff = 0, app_data = 0x0, flags = 16} > > > > > > #6 0x00007f4e88dafb34 in _cl5ConstructRUV (purge=1, > > > > > > obj=0x7f4e983e1fc0, replGen=0x7ffff6598910 > > > > > > "\200\211Y\366\377\177") at > > > > > > ldap/servers/plugins/replication/cl5_api.c:4306 > > > > > > > > > > > > > > > > > > you should see some progress in which entry is handled > > > > > > > > > > > Ludwig, thanks very much for the help. As you pointed out just > > > > > needed to let it > > > > > finish. ns-slapd eventually came back up once it finished reading the > > > > > changelog. Still seeing some errors related to the > > > > > NSMMReplicationPlugin failed > > > > > to apply update and from the managed-entries-plugin. Can these safely > > > > > be > > > > > ignored or are they indicative of a more serious problem? > > > > > > > > > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching Null DN > > > > > (e73b48a4-95ff11e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:28:46 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856335,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching > > > > > changenumber=30856335,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:28:46 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856335, dn = > > > > > changenumber=30856335,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:28:46 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > > [19/Oct/2016:09:28:46 -0400] NSMMReplicationPlugin - process_postop: > > > > > Failed to apply update (5806acf7000000600000) error (1). Aborting > > > > > replication session(conn=9 op=28) > > > > > [19/Oct/2016:09:28:54 -0400] - Operation error fetching Null DN > > > > > (e73b48a7-95ff11e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:28:59 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856337,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:29:17 -0400] - Operation error fetching > > > > > changenumber=30856337,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:29:17 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856337, dn = > > > > > changenumber=30856337,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:29:17 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > > [19/Oct/2016:09:29:17 -0400] NSMMReplicationPlugin - process_postop: > > > > > Failed to apply update (5806acf7000000600000) error (1). Aborting > > > > > replication session(conn=15 op=5) > > > > > [19/Oct/2016:09:29:20 -0400] - Retry count exceeded in delete > > > > > [19/Oct/2016:09:29:20 -0400] DSRetroclPlugin - delete_changerecord: > > > > > could not delete change record 30712389 (rc: 51) > > > > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching Null DN > > > > > (0afe8e82-960011e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:29:28 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856351,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching > > > > > changenumber=30856351,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:29:28 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856351, dn = > > > > > changenumber=30856351,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:29:28 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > > [19/Oct/2016:09:29:28 -0400] NSMMReplicationPlugin - process_postop: > > > > > Failed to apply update (5806acf7000000600000) error (1). Aborting > > > > > replication session(conn=25 op=5) > > > > > [19/Oct/2016:09:29:31 -0400] - Retry count exceeded in delete > > > > > [19/Oct/2016:09:29:31 -0400] DSRetroclPlugin - delete_changerecord: > > > > > could not delete change record 30712865 (rc: 51) > > > > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching Null DN > > > > > (0afe8e90-960011e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:29:39 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856364,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching > > > > > changenumber=30856364,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:29:39 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856364, dn = > > > > > changenumber=30856364,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:29:39 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > > [19/Oct/2016:09:29:39 -0400] NSMMReplicationPlugin - process_postop: > > > > > Failed to apply update (5806acf7000000600000) error (1). Aborting > > > > > replication session(conn=35 op=5) > > > > > [19/Oct/2016:09:29:42 -0400] - Retry count exceeded in delete > > > > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - delete_changerecord: > > > > > could not delete change record 30713364 (rc: 51) > > > > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching Null DN > > > > > (0afe8ea0-960011e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:29:42 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856379,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching > > > > > changenumber=30856379,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856379, dn = > > > > > changenumber=30856379,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:29:42 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > > [19/Oct/2016:09:29:42 -0400] NSMMReplicationPlugin - process_postop: > > > > > Failed to apply update (5806acf7000000600000) error (1). Aborting > > > > > replication session(conn=36 op=5) > > > > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching Null DN > > > > > (0afe8ea2-960011e6-8bc7a516-e778e883), error -30993. > > > > > [19/Oct/2016:09:29:50 -0400] - dn2entry_ext: Failed to get id for > > > > > changenumber=30856380,cn=changelog from entryrdn index (-30993) > > > > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching > > > > > changenumber=30856380,cn=changelog (null), error -30993. > > > > > [19/Oct/2016:09:29:50 -0400] DSRetroclPlugin - replog: an error > > > > > occured while adding change number 30856380, dn = > > > > > changenumber=30856380,cn=changelog: Operations error. > > > > > [19/Oct/2016:09:29:50 -0400] retrocl-plugin - retrocl_postob: > > > > > operation failure [1] > > > > this doesn't look good. There could be cancelled ops which would be > > > > repeated, but the failing repl op is always with the same csn: > > > > 5806acf7000000600000 > > > > so it seems incoming replication is stuck. > > > > you could try to find out which entry is affected (grep for the csn in > > > > the access log and look at the operation) and what kind of modification > > > > it is to check what could be going wrong. > > > > Here's what was in the access logs for that csn: > > > > access.20161018-113116:[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT > > err=1 tag=103 nentries=0 etime=3 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:17 -0400] conn=15 op=5 RESULT > > err=1 tag=103 nentries=0 etime=24 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:28 -0400] conn=25 op=5 RESULT > > err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:39 -0400] conn=35 op=5 RESULT > > err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:42 -0400] conn=36 op=5 RESULT > > err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:50 -0400] conn=37 op=5 RESULT > > err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:54 -0400] conn=44 op=5 RESULT > > err=1 tag=103 nentries=0 etime=1 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:29:58 -0400] conn=45 op=5 RESULT > > err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:30:06 -0400] conn=46 op=5 RESULT > > err=1 tag=103 nentries=0 etime=4 csn=5806acf7000000600000 > > access.20161018-113116:[19/Oct/2016:09:30:12 -0400] conn=48 op=5 RESULT > > err=0 tag=103 nentries=0 etime=2 csn=5806acf7000000600000 > > > > > > Interestingly, right before the first op=28 was the modification of user > > janedoe, we were testing deleting/preserving this user: > > > > [19/Oct/2016:09:28:43 -0400] conn=9 op=27 MOD > > dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca" > > [19/Oct/2016:09:28:43 -0400] conn=10 op=15 RESULT err=0 tag=103 nentries=0 > > etime=0 csn=58065f7c000300030000 > > [19/Oct/2016:09:28:43 -0400] conn=10 op=16 MODRDN > > dn="uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu" > > newrdn="uid=janedoe" newsuperior="cn=deleted > > users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu" > > [19/Oct/2016:09:28:43 -0400] conn=9 op=27 RESULT err=0 tag=103 nentries=0 > > etime=0 csn=5806a973000000600000 > > [19/Oct/2016:09:28:43 -0400] conn=9 op=28 MOD > > dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca" > > [19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 > > etime=3 csn=5806acf7000000600000 > > [19/Oct/2016:09:28:46 -0400] conn=9 op=-1 fd=88 closed - B4 > Note that janedoe operation and 5806acf7000000600000 operation were on > different suffixes but are sharing the retroCL. > According to > > [19/Oct/2016:09:30:06 -0400] managed-entries-plugin - mep_mod_post_op: Unable > to find config for origin entry "uid=janedoe,cn=deleted > users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu". > > We may think that operation 'conn=10 op=16' completed (but likely failed) on > 9:30:06 and only then 5806acf7000000600000 > was successfully replicated. > > There may be interaction between both operation, 'janedoe' that was to fail > but preventing other updates to complete. > > It would be interesting to get full logs (access/errors): > 19/Oct/2016:09:28:43 -> 19/Oct/2016:09:30:20 > and also the dump of 'janedoe' entry.
I'd be happy to provide more logs. Any chance I can send them off list? > > > > > > > the information of what is the change with csn 5806acf7000000600000, it > > > should be found in the changelog of the server with the replicaid 96. > > Where can I find/query the changelog? > > You may dump it with: dbscan -f > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/xxx.db > > > > > there is also the possibility that your retro changelog got corrupted. > > > could > > > you try to query the retrocl: ldapsearch ..... -b "cn=changelog" dn > > Guessing the changelog is too big or I need to increase some limit? > > > > ldapsearch -Y GSSAPI -b "cn=changelog" dn > > # extended LDIF > > # > > # LDAPv3 > > # base <cn=changelog> with scope subtree > > # filter: (objectclass=*) > > # requesting: dn > > # > > > > # search result > > search: 4 > > result: 11 Administrative limit exceeded > > Can you do the same operation being 'cn=directory manager' ? > > > > > > > and before rebuilding or reimporting the database it would be worth to try > > > to recreate the retro changelog > > I'm not seeing anymore errors in the logs. Is there any way to verify if > > replication has caught up? > > > > Thanks again for all the help. > > > > > > > -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project