Re: [389-users] changelog deadlock replication failures with DNA

2013-06-18 Thread thierry bordaz

On 06/18/2013 12:35 AM, Mahadevan, Venkat wrote:



I do not know why your environment is prone to trigger db_deadlock 
(lot of replica agreements, VM, slow disks...).
I think the best way to progress is that you fill a ticket/bug so that 
we may track the issue. Note this bug is possibly affecting all 
operations (ADD/MOD/MODRDN/DEL)


---

Hi Thierry,

We are running in a VMWare environment, is this known to cause issues 
and are there any tuning steps you can recommend or things


to watch out for? Thanks and sorry for so many questions.

cheers,

VM


Hi Mahadevan,

   Running in a VM should not be an issue if it follows the usual
   tuning recommendation. But of course it adds a layer when
   investigating a RC.
   What is weird is to hit so frequent (50) deadlock in a single DB
   call. In addition it looks like it always happen when updating the
   CL so I can think to only RA threads, CL trimming and Write thread
   (the one holding the backend lock) being in competition. You may try
   to disable the RA for a test to check if they are responsible of
   that. You may also remove the CL maxage to prevent the CL trimming
   thread to be in the loop. Also you may check if you have pb with
   disk IO. Or tune the dbcache to be in memory.
   Sorry I have no easy answer to your question because it requires
   deeper investigations.

best regards
thierry
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-18 Thread Rich Megginson

On 06/18/2013 01:51 AM, thierry bordaz wrote:

On 06/18/2013 12:35 AM, Mahadevan, Venkat wrote:



I do not know why your environment is prone to trigger db_deadlock 
(lot of replica agreements, VM, slow disks...).
I think the best way to progress is that you fill a ticket/bug so 
that we may track the issue. Note this bug is possibly affecting all 
operations (ADD/MOD/MODRDN/DEL)


---

Hi Thierry,

We are running in a VMWare environment, is this known to cause issues 
and are there any tuning steps you can recommend or things


to watch out for? Thanks and sorry for so many questions.

cheers,

VM


Hi Mahadevan,

Running in a VM should not be an issue if it follows the usual
tuning recommendation. But of course it adds a layer when
investigating a RC.
What is weird is to hit so frequent (50) deadlock in a single DB
call. In addition it looks like it always happen when updating the
CL so I can think to only RA threads, CL trimming and Write thread
(the one holding the backend lock) being in competition.

Search threads also. Searches acquire a read lock on the pages used by 
the entry and the indexes.


You may try to disable the RA for a test to check if they are
responsible of that. You may also remove the CL maxage to prevent
the CL trimming thread to be in the loop. Also you may check if
you have pb with disk IO. Or tune the dbcache to be in memory.
Sorry I have no easy answer to your question because it requires
deeper investigations.

best regards
thierry


--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-18 Thread Mahadevan, Venkat
Running in a VM should not be an issue if it follows the usual tuning 
recommendation. But of course it adds a layer when investigating a RC.
What is weird is to hit so frequent (50) deadlock in a single DB call. In 
addition it looks like it always happen when updating the CL so I can think to 
only RA threads, CL trimming and Write thread (the one holding the backend 
lock) being in competition.
Search threads also.  Searches acquire a read lock on the pages used by the 
entry and the indexes.

You may try to disable the RA for a test to check if they are responsible of 
that. You may also remove the CL maxage to prevent the CL trimming thread to be 
in the loop. Also you may check if you have pb with disk IO. Or tune the 
dbcache to be in memory.
Sorry I have no easy answer to your question because it requires deeper 
investigations.


Thanks Thierry and Rich for the good suggestions. I will give them both a try. 
In the meantime, the environment is running well
without the DNA plugin, so I will probably go with that for now and assign 
uid/gid numbers manually via a script after the entry
is added to the directory. This seems to be a reasonable workaround right now. 
Thanks again!

cheers,

VM

--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-17 Thread thierry bordaz

Hi Mahadevan,

   I think you hit a new bug that I was able to reproduce.

   The problem is an incorrect handling of operation return code when
   there is an error in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN.
   When the operation succeeds in the DB, DS (as a master) updates the
   CL and RUV in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN phase.
   At this moment the operation is successful. However the update of
   CL/RUV may fail (your case). It correctly aborts the update of the
   CL AND the txn related to the operation in the DB (like if the
   operation never happened)... but it looks like there is a bug where
   it keeps the operation returned code.
   The operation return code should be a failure, but it is successful.

   I reproduced the problem with a debug version where I simulate a
   DB_DEADLOCK failure

   [17/Jun/2013:11:07:21 +0200] - slapd started. Listening on All
   Interfaces port 24502 for LDAP requests
   [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin - changelog
   program - _cl5WriteOperationTxn: failed to write entry with csn
   (51bed2070001); db error - -30993 BDB0068
   DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
   [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin -
   write_changelog_and_ruv: can't add a change for uid=td,dc=com
   (uniqid: 74589a01-d72d11e2-b4c2bdbf-b0a22af5, optype: 16) to
   changelog csn 51bed2070001


   [tbordaz@pctbordaz userRoot]$ tail -20
   /var/log/dirsrv/slapd-master_cl/access
   ...
   [17/Jun/2013:11:08:23 +0200] conn=2 op=1 ADD dn=uid=td,dc=com
   [17/Jun/2013:11:08:23 +0200] conn=2 op=1 RESULT err=0 tag=105
   nentries=0 etime=0 csn=51bed2070001


   I do not know why your environment is prone to trigger db_deadlock
   (lot of replica agreements, VM, slow disks...).
   I think the best way to progress is that you fill a ticket/bug so
   that we may track the issue. Note this bug is possibly affecting all
   operations (ADD/MOD/MODRDN/DEL)

best regards
thierry
On 06/15/2013 01:03 AM, Mahadevan, Venkat wrote:


 If the operation fails to write into the changelog, the operation 
fails. In your case, it means that the ldapclient should receive an 
error.

 So it is like if the operation never happened and is not replicated.

Hi Thierry,

That’s what I would expect to, but that does not seem to be the case. 
In my access log on the master server below,


the client successfully adds the entry to the master server and 
receives an error code return of 0:


[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn=cn=Directory 
Manager method=128 version=3


[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97 
nentries=0 etime=0 dn=cn=directory manager


[12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD 
dn=uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca


[12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105 
nentries=0 etime=5 csn=51b8dbec002b02bd


In the error log at the same time the above happens:

[12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin - 
replica_replace_ruv_tombstone: failed to update replication update 
vector for replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51


[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program 
- _cl5WriteOperationTxn: retry (49) the transaction 
(csn=51b8dbec002b02bd) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker 
killed to resolve a deadlock))


[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program 
- _cl5WriteOperationTxn: failed to write entry with csn 
(51b8dbec002b02bd); db error - -30994 DB_LOCK_DEADLOCK: Locker 
killed to resolve a deadlock


[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - 
write_changelog_and_ruv: can't add a change for 
uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: 
c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn 
51b8dbec002b02bd


The net result is that the entry is successfully added to the master 
server but will never replication to any of the consumers.


cheers,

VM



--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-17 Thread Mahadevan, Venkat
I do not know why your environment is prone to trigger db_deadlock (lot of 
replica agreements, VM, slow disks...).
I think the best way to progress is that you fill a ticket/bug so that we may 
track the issue. Note this bug is possibly affecting all operations 
(ADD/MOD/MODRDN/DEL)
-
Hi Thierry,

Thanks for the good explanation. I will file a bug report on this.

cheers,

VM
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-17 Thread Mahadevan, Venkat

I do not know why your environment is prone to trigger db_deadlock (lot of 
replica agreements, VM, slow disks...).
I think the best way to progress is that you fill a ticket/bug so that we may 
track the issue. Note this bug is possibly affecting all operations 
(ADD/MOD/MODRDN/DEL)
---

Hi Thierry,

We are running in a VMWare environment, is this known to cause issues and are 
there any tuning steps you can recommend or things
to watch out for? Thanks and sorry for so many questions.

cheers,

VM
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Re: [389-users] changelog deadlock replication failures with DNA

2013-06-14 Thread Mahadevan, Venkat
 If the operation fails to write into the changelog, the operation fails. In 
 your case, it means that the ldapclient should receive an error.
 So it is like if the operation never happened and is not replicated.

Hi Thierry,

That’s what I would expect to, but that does not seem to be the case. In my 
access log on the master server below,
the client successfully adds the entry to the master server and receives an 
error code return of 0:

[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn=cn=Directory Manager 
method=128 version=3
[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97 nentries=0 
etime=0 dn=cn=directory manager
[12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD 
dn=uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca
[12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105 nentries=0 
etime=5 csn=51b8dbec002b02bd

In the error log at the same time the above happens:

[12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin - 
replica_replace_ruv_tombstone: failed to update replication update vector for 
replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: retry (49) the transaction (csn=51b8dbec002b02bd) 
failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program - 
_cl5WriteOperationTxn: failed to write entry with csn (51b8dbec002b02bd); 
db error - -30994 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - write_changelog_and_ruv: 
can't add a change for uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: 
c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn 
51b8dbec002b02bd

The net result is that the entry is successfully added to the master server but 
will never replication to any of the consumers.

cheers,

VM
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users