Re: [389-users] changelog deadlock replication failures with DNA
On 06/18/2013 12:35 AM, Mahadevan, Venkat wrote: I do not know why your environment is prone to trigger db_deadlock (lot of replica agreements, VM, slow disks...). I think the best way to progress is that you fill a ticket/bug so that we may track the issue. Note this bug is possibly affecting all operations (ADD/MOD/MODRDN/DEL) --- Hi Thierry, We are running in a VMWare environment, is this known to cause issues and are there any tuning steps you can recommend or things to watch out for? Thanks and sorry for so many questions. cheers, VM Hi Mahadevan, Running in a VM should not be an issue if it follows the usual tuning recommendation. But of course it adds a layer when investigating a RC. What is weird is to hit so frequent (50) deadlock in a single DB call. In addition it looks like it always happen when updating the CL so I can think to only RA threads, CL trimming and Write thread (the one holding the backend lock) being in competition. You may try to disable the RA for a test to check if they are responsible of that. You may also remove the CL maxage to prevent the CL trimming thread to be in the loop. Also you may check if you have pb with disk IO. Or tune the dbcache to be in memory. Sorry I have no easy answer to your question because it requires deeper investigations. best regards thierry -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
On 06/18/2013 01:51 AM, thierry bordaz wrote: On 06/18/2013 12:35 AM, Mahadevan, Venkat wrote: I do not know why your environment is prone to trigger db_deadlock (lot of replica agreements, VM, slow disks...). I think the best way to progress is that you fill a ticket/bug so that we may track the issue. Note this bug is possibly affecting all operations (ADD/MOD/MODRDN/DEL) --- Hi Thierry, We are running in a VMWare environment, is this known to cause issues and are there any tuning steps you can recommend or things to watch out for? Thanks and sorry for so many questions. cheers, VM Hi Mahadevan, Running in a VM should not be an issue if it follows the usual tuning recommendation. But of course it adds a layer when investigating a RC. What is weird is to hit so frequent (50) deadlock in a single DB call. In addition it looks like it always happen when updating the CL so I can think to only RA threads, CL trimming and Write thread (the one holding the backend lock) being in competition. Search threads also. Searches acquire a read lock on the pages used by the entry and the indexes. You may try to disable the RA for a test to check if they are responsible of that. You may also remove the CL maxage to prevent the CL trimming thread to be in the loop. Also you may check if you have pb with disk IO. Or tune the dbcache to be in memory. Sorry I have no easy answer to your question because it requires deeper investigations. best regards thierry -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
Running in a VM should not be an issue if it follows the usual tuning recommendation. But of course it adds a layer when investigating a RC. What is weird is to hit so frequent (50) deadlock in a single DB call. In addition it looks like it always happen when updating the CL so I can think to only RA threads, CL trimming and Write thread (the one holding the backend lock) being in competition. Search threads also. Searches acquire a read lock on the pages used by the entry and the indexes. You may try to disable the RA for a test to check if they are responsible of that. You may also remove the CL maxage to prevent the CL trimming thread to be in the loop. Also you may check if you have pb with disk IO. Or tune the dbcache to be in memory. Sorry I have no easy answer to your question because it requires deeper investigations. Thanks Thierry and Rich for the good suggestions. I will give them both a try. In the meantime, the environment is running well without the DNA plugin, so I will probably go with that for now and assign uid/gid numbers manually via a script after the entry is added to the directory. This seems to be a reasonable workaround right now. Thanks again! cheers, VM -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
Hi Mahadevan, I think you hit a new bug that I was able to reproduce. The problem is an incorrect handling of operation return code when there is an error in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN. When the operation succeeds in the DB, DS (as a master) updates the CL and RUV in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN phase. At this moment the operation is successful. However the update of CL/RUV may fail (your case). It correctly aborts the update of the CL AND the txn related to the operation in the DB (like if the operation never happened)... but it looks like there is a bug where it keeps the operation returned code. The operation return code should be a failure, but it is successful. I reproduced the problem with a debug version where I simulate a DB_DEADLOCK failure [17/Jun/2013:11:07:21 +0200] - slapd started. Listening on All Interfaces port 24502 for LDAP requests [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51bed2070001); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=td,dc=com (uniqid: 74589a01-d72d11e2-b4c2bdbf-b0a22af5, optype: 16) to changelog csn 51bed2070001 [tbordaz@pctbordaz userRoot]$ tail -20 /var/log/dirsrv/slapd-master_cl/access ... [17/Jun/2013:11:08:23 +0200] conn=2 op=1 ADD dn=uid=td,dc=com [17/Jun/2013:11:08:23 +0200] conn=2 op=1 RESULT err=0 tag=105 nentries=0 etime=0 csn=51bed2070001 I do not know why your environment is prone to trigger db_deadlock (lot of replica agreements, VM, slow disks...). I think the best way to progress is that you fill a ticket/bug so that we may track the issue. Note this bug is possibly affecting all operations (ADD/MOD/MODRDN/DEL) best regards thierry On 06/15/2013 01:03 AM, Mahadevan, Venkat wrote: If the operation fails to write into the changelog, the operation fails. In your case, it means that the ldapclient should receive an error. So it is like if the operation never happened and is not replicated. Hi Thierry, That’s what I would expect to, but that does not seem to be the case. In my access log on the master server below, the client successfully adds the entry to the master server and receives an error code return of 0: [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn=cn=Directory Manager method=128 version=3 [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=cn=directory manager [12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD dn=uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca [12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51b8dbec002b02bd In the error log at the same time the above happens: [12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51 [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8dbec002b02bd) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51b8dbec002b02bd); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn 51b8dbec002b02bd The net result is that the entry is successfully added to the master server but will never replication to any of the consumers. cheers, VM -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
I do not know why your environment is prone to trigger db_deadlock (lot of replica agreements, VM, slow disks...). I think the best way to progress is that you fill a ticket/bug so that we may track the issue. Note this bug is possibly affecting all operations (ADD/MOD/MODRDN/DEL) - Hi Thierry, Thanks for the good explanation. I will file a bug report on this. cheers, VM -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
I do not know why your environment is prone to trigger db_deadlock (lot of replica agreements, VM, slow disks...). I think the best way to progress is that you fill a ticket/bug so that we may track the issue. Note this bug is possibly affecting all operations (ADD/MOD/MODRDN/DEL) --- Hi Thierry, We are running in a VMWare environment, is this known to cause issues and are there any tuning steps you can recommend or things to watch out for? Thanks and sorry for so many questions. cheers, VM -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
Re: [389-users] changelog deadlock replication failures with DNA
If the operation fails to write into the changelog, the operation fails. In your case, it means that the ldapclient should receive an error. So it is like if the operation never happened and is not replicated. Hi Thierry, That’s what I would expect to, but that does not seem to be the case. In my access log on the master server below, the client successfully adds the entry to the master server and receives an error code return of 0: [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn=cn=Directory Manager method=128 version=3 [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=cn=directory manager [12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD dn=uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca [12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105 nentries=0 etime=5 csn=51b8dbec002b02bd In the error log at the same time the above happens: [12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51 [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=51b8dbec002b02bd) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (51b8dbec002b02bd); db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn 51b8dbec002b02bd The net result is that the entry is successfully added to the master server but will never replication to any of the consumers. cheers, VM -- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users