On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote:
> 
> On 10/19/2016 05:02 PM, Ludwig Krispenz wrote:
> > 
> > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > We had one of our replicas fail today with the following errors:
> > > > > 
> > > > > 
> > > > > [18/Oct/2016:13:40:47 -0400]
> > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu"
> > > > > (srv-m14-32:389) - Can't locate CSN 58065ef3000100030000 in
> > > > > the changelog (DB rc=-30988). If replication stops, the
> > > > > consumer may need to be reinitialized.
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f74000500040000) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f74000500040000); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > write_changelog_and_ruv: can't add a change for
> > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > changelog csn 58065f74000500040000
> > > > > [18/Oct/2016:13:43:07 -0400] -
> > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but
> > > > > did not set SLAPI_RESULT_CODE
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f74000500040000) error (1).  Aborting replication
> > > > > session(conn=1314106 op=1688559)
> > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify:
> > > > > modified entry is NULL--updating cache just in case
> > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition
> > > > > cn=Password
> > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > Templates found, which should be added before the CoS
> > > > > Definition.
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null
> > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get
> > > > > id for changenumber=30856302,cn=changelog from entryrdn
> > > > > index (-30993)
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching
> > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an
> > > > > error occured while adding change number 30856302, dn =
> > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin -
> > > > > retrocl_postob: operation failure [1]
> > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f9f000000600000) error (1).  Aborting replication
> > > > > session(conn=1901274 op=5)
> > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry
> > > > > BAD 1601, err=0 BDB0062 Successful return: 0
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f7c000a00040000) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f7c000a00040000); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > write_changelog_and_ruv: can't add a change for
> > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > changelog csn 58065f7c000a00040000
> > > > > 
> > > > > 
> > > > > ns-slapd was hung so we restarted and now it's stuck and
> > > > > won't come back up. It
> > > > > hangs up here:
> > > > > 
> > > > > [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition
> > > > > cn=Password
> > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > Templates found, which should be added before the CoS
> > > > > Definition.
> > > > > [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: 
> > > > > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema;
> > > > > NSPR error - -5943
> > > > > [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: 
> > > > > /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema;
> > > > > NSPR error - -5943
> > > > > 
> > > > > 
> > > > > Tried deleting the semaphore files and restarting but no
> > > > > luck. Attached
> > > > > is a stacktrace of the stuck ns-slapd process.
> > > > > 
> > > > > Here's the versions were running:
> > > > > 
> > > > > ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64
> > > > > 389-ds-base-1.3.4.0-33.el7_2.x86_64
> > > > > 
> > > > > FWIW, we were experimenting with the new life-cycle
> > > > > management features,
> > > > > specifically "preserved" users and deleted the user
> > > > > "janedoe" when this
> > > > > happened.  From the errors above looks like this host failed to
> > > > > replicate the change?  Not sure if this is related or not.
> > > > > 
> > > > > Is it possible to recover the database? Thanks in advance
> > > > > for any pointers.
> > > > from the stack trace the process is not hanging, it is trying to
> > > > recover.
> > > > After a crash/kill  the changelog does not contai a RUV and it is
> > > > reconstructed by reading all records in the changelog, if this
> > > > is large it
> > > > can take some time.
> > > > If you look at that part of the stack repeatedly,
> > > > 
> > > > #4  0x00007f4e88daeba5 in cl5DBData2Entry (data=<optimized out>,
> > > > len=<optimized out>, entry=entry@entry=0x7ffff6598910) at
> > > > ldap/servers/plugins/replication/cl5_api.c:2342
> > > >          rc = <optimized out>
> > > >          version = <optimized out>
> > > >          pos = 0x7f4e9839d091 ""
> > > >          strCSN = 0x0
> > > >          op = 0x7ffff6598980
> > > >          add_mods = 0x7f4e983a5e80
> > > >          rawDN = 0x7f4e98396e20 
> > > > "fqdn=cpn-k08-29-02.cbls.ccr.buffalo.edu,cn=computers,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
> > > >          s =
> > > > "\300\037>\230N\177\000\000@\210Y\366\377\177\000\000@\210Y\366\377"
> > > > #5  0x00007f4e88daf5d6 in _cl5GetNextEntry
> > > > (entry=entry@entry=0x7ffff6598910, iterator=0x7f4e983a5e80) at
> > > > ldap/servers/plugins/replication/cl5_api.c:5291
> > > >          rc = 0
> > > >          it = 0x7f4e983a5e80
> > > >          key = {data = 0x0, size = 21, ulen = 0, dlen = 0, doff
> > > > = 0, app_data = 0x0, flags = 16}
> > > >          data = {data = 0x7f4e9839cff0, size = 335, ulen = 0,
> > > > dlen = 0, doff = 0, app_data = 0x0, flags = 16}
> > > > #6  0x00007f4e88dafb34 in _cl5ConstructRUV (purge=1,
> > > > obj=0x7f4e983e1fc0, replGen=0x7ffff6598910
> > > > "\200\211Y\366\377\177") at
> > > > ldap/servers/plugins/replication/cl5_api.c:4306
> > > > 
> > > > 
> > > > you should see some progress in which entry is handled
> > > > 
> > > Ludwig, thanks very much for the help. As you pointed out just
> > > needed to let it
> > > finish.  ns-slapd eventually came back up once it finished reading the
> > > changelog. Still seeing some errors related to the
> > > NSMMReplicationPlugin failed
> > > to apply update and from the managed-entries-plugin. Can these safely be
> > > ignored or are they indicative of a more serious problem?
> > > 
> > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching Null DN
> > > (e73b48a4-95ff11e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:28:46 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856335,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching
> > > changenumber=30856335,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:28:46 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856335, dn =
> > > changenumber=30856335,cn=changelog: Operations error.
> > > [19/Oct/2016:09:28:46 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:28:46 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=9 op=28)
> > > [19/Oct/2016:09:28:54 -0400] - Operation error fetching Null DN
> > > (e73b48a7-95ff11e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:28:59 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856337,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:17 -0400] - Operation error fetching
> > > changenumber=30856337,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:17 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856337, dn =
> > > changenumber=30856337,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:17 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:17 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=15 op=5)
> > > [19/Oct/2016:09:29:20 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:20 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30712389 (rc: 51)
> > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching Null DN
> > > (0afe8e82-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:28 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856351,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching
> > > changenumber=30856351,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:28 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856351, dn =
> > > changenumber=30856351,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:28 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:28 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=25 op=5)
> > > [19/Oct/2016:09:29:31 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:31 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30712865 (rc: 51)
> > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching Null DN
> > > (0afe8e90-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:39 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856364,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching
> > > changenumber=30856364,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:39 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856364, dn =
> > > changenumber=30856364,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:39 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:39 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=35 op=5)
> > > [19/Oct/2016:09:29:42 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30713364 (rc: 51)
> > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching Null DN
> > > (0afe8ea0-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:42 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856379,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching
> > > changenumber=30856379,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856379, dn =
> > > changenumber=30856379,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:42 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:42 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=36 op=5)
> > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching Null DN
> > > (0afe8ea2-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:50 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856380,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching
> > > changenumber=30856380,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:50 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856380, dn =
> > > changenumber=30856380,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:50 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > this doesn't look good. There could be cancelled ops which would be
> > repeated, but the failing repl op is always with the same csn:
> > 5806acf7000000600000
> > so it seems incoming replication is stuck.
> > you could try to find out which entry is affected (grep for the csn in
> > the access log and look at the operation) and what kind of modification
> > it is to check what could be going wrong.


Here's what was in the access logs for that csn:

access.20161018-113116:[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 
tag=103 nentries=0 etime=3 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:17 -0400] conn=15 op=5 RESULT err=1 
tag=103 nentries=0 etime=24 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:28 -0400] conn=25 op=5 RESULT err=1 
tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:39 -0400] conn=35 op=5 RESULT err=1 
tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:42 -0400] conn=36 op=5 RESULT err=1 
tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:50 -0400] conn=37 op=5 RESULT err=1 
tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:54 -0400] conn=44 op=5 RESULT err=1 
tag=103 nentries=0 etime=1 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:58 -0400] conn=45 op=5 RESULT err=1 
tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:30:06 -0400] conn=46 op=5 RESULT err=1 
tag=103 nentries=0 etime=4 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:30:12 -0400] conn=48 op=5 RESULT err=0 
tag=103 nentries=0 etime=2 csn=5806acf7000000600000


Interestingly, right before the first op=28 was the modification of user
janedoe, we were testing deleting/preserving this user:

[19/Oct/2016:09:28:43 -0400] conn=9 op=27 MOD 
dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
[19/Oct/2016:09:28:43 -0400] conn=10 op=15 RESULT err=0 tag=103 nentries=0 
etime=0 csn=58065f7c000300030000
[19/Oct/2016:09:28:43 -0400] conn=10 op=16 MODRDN 
dn="uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu" 
newrdn="uid=janedoe" newsuperior="cn=deleted 
users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
[19/Oct/2016:09:28:43 -0400] conn=9 op=27 RESULT err=0 tag=103 nentries=0 
etime=0 csn=5806a973000000600000
[19/Oct/2016:09:28:43 -0400] conn=9 op=28 MOD 
dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 
etime=3 csn=5806acf7000000600000
[19/Oct/2016:09:28:46 -0400] conn=9 op=-1 fd=88 closed - B4


> the information of what is the change with csn 5806acf7000000600000, it
> should be found in the changelog of the server with the replicaid 96.

Where can I find/query the changelog?

> 
> there is also the possibility that your retro changelog got corrupted. could
> you try to query the retrocl: ldapsearch ..... -b "cn=changelog" dn

Guessing the changelog is too big or I need to increase some limit? 

ldapsearch -Y GSSAPI -b "cn=changelog" dn
# extended LDIF
#
# LDAPv3
# base <cn=changelog> with scope subtree
# filter: (objectclass=*)
# requesting: dn 
#

# search result
search: 4
result: 11 Administrative limit exceeded


> 
> and before rebuilding or reimporting the database it would be worth to try
> to recreate the retro changelog

I'm not seeing anymore errors in the logs. Is there any way to verify if
replication has caught up? 

Thanks again for all the help.



-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Reply via email to