Hello,

   Thanks Oleg for the access on the VMs.
   I confirm that you hit https://fedorahosted.org/389/ticket/47788
   I updated this ticket with the details found in your tests.

   Unfortunately we have no fix yet for this ticket although it is an
   important one.
   In your test,on a master a user entry (and its related group...)
   were successfully deleted/updated but on one replica the deletion of
   the user entry was skipped but related updates (group) were
   successful. I agree that it is a timing problem and should occur rarely.

   thanks
   thierry

On 06/17/2015 12:58 PM, Oleg Fayans wrote:
Hi Ludwig,

On 06/17/2015 11:06 AM, Ludwig Krispenz wrote:
Hi Oleg,

can you give a bit more info on the scenarios when this happens. Always or is it a timing problem ?
I guess it is a timing problem. It happened yesterday, today I was unable to reproduce this. The scenario is very simple: create a user1, make sure it's there turn off a replica, then create another user on master and delete user1 on master, then turn replica back on. I still have an infrastructure with 2 replicas having a user that was deleted on master. Now all the user (and other data) manipulations on this very setup work as intended.

Ludwig

On 06/16/2015 07:02 PM, thierry bordaz wrote:
Hello


On Master:
    User 'onmaster' was deleted

[16/Jun/2015:10:16:45 -0400] conn=402 op=19 SRCH base="cn=otp,dc=bagam,dc=net" scope=1 filter="(&(objectClass=ipatoken)(ipatokenOwner=uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net))" attrs="ipatokenNotAfter description ipatokenOwner objectClass ipatokenDisabled ipatokenVendor managedBy ipatokenModel ipatokenNotBefore ipatokenUniqueID ipatokenSerial" [16/Jun/2015:10:16:45 -0400] conn=402 op=19 RESULT err=0 tag=101 nentries=0 etime=0 [16/Jun/2015:10:16:45 -0400] conn=402 op=20 DEL dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net"
[16/Jun/2015:10:16:45 -0400] conn=402 op=21 UNBIND
[16/Jun/2015:10:16:45 -0400] conn=402 op=21 fd=120 closed - U1
[16/Jun/2015:10:16:45 -0400] conn=402 op=20 RESULT err=0 tag=107 nentries=0 etime=0 csn=55802fcf000300040000

    Replication agreement failed to replicate it to the replica2
[16/Jun/2015:10:18:36 -0400] NSMMReplicationPlugin - agmt="cn=f22master.bagam.net-to-f22replica2.bagam.net" (f22replica2:389): Consumer failed to replay change (uniqueid b8242e18-143111e5-b1d0d0c3-ae5854ff, CSN 55802fcf000300040000): Operations error (1). Will retry later.


On replica2:

    The replicated operation failed
[16/Jun/2015:10:18:27 -0400] conn=8 op=4 RESULT err=0 tag=101 nentries=1 etime=0 [16/Jun/2015:10:18:27 -0400] conn=8 op=5 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop" [16/Jun/2015:10:18:27 -0400] conn=8 op=5 RESULT err=0 tag=120 nentries=0 etime=0 [16/Jun/2015:10:18:27 -0400] conn=8 op=6 DEL dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net" [16/Jun/2015:10:18:35 -0400] conn=8 op=6 RESULT err=1 tag=107 nentries=0 etime=8 csn=55802fcf000300040000

    because of DB failures to update.
The failures were E_AGAIN or E_DB_DEADLOCK. In such situation, DS retries after a small delay.
    The problem is that it retried 50 times without success.
[16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=55802fcf000300040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (55802fcf000300040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [16/Jun/2015:10:18:34 -0400] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net (uniqid: b8242e18-143111e5-b1d0d0c3-ae5854ff, optype: 32) to changelog csn 55802fcf000300040000 [16/Jun/2015:10:18:34 -0400] - SLAPI_PLUGIN_BE_TXN_POST_DELETE_FN plugin returned error code but did not set SLAPI_RESULT_CODE


The MAIN issue here is that replica2 successfully applied others updates after 55802fcf000300040000 from the same replica (e.g csn=55802fcf000400040000) I do not know if master was able to detect this failure and to replay this update. but I am afraid it did not !!
It is looking like you hit https://fedorahosted.org/389/ticket/47788
Is it possible to access your VM ?

[16/Jun/2015:10:18:27 -0400] conn=8 op=6 DEL dn="uid=onmaster,cn=users,cn=accounts,dc=bagam,dc=net" [16/Jun/2015:10:18:35 -0400] conn=8 op=6 RESULT err=1 tag=107 nentries=0 etime=8 csn=55802fcf000300040000 [16/Jun/2015:10:18:35 -0400] conn=8 op=7 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net" [16/Jun/2015:10:18:36 -0400] conn=8 op=7 RESULT err=0 tag=103 nentries=0 etime=1 csn=55802fcf000400040000 [16/Jun/2015:10:18:36 -0400] conn=8 op=8 DEL dn="cn=onmaster,cn=groups,cn=accounts,dc=bagam,dc=net" [16/Jun/2015:10:18:37 -0400] conn=8 op=8 RESULT err=0 tag=107 nentries=0 etime=1 csn=55802fcf000700040000 [16/Jun/2015:10:18:37 -0400] conn=8 op=9 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=bagam,dc=net" [16/Jun/2015:10:18:37 -0400] conn=8 op=9 RESULT err=0 tag=103 nentries=0 etime=0 csn=55802fd0000000060000




On 06/16/2015 04:49 PM, Oleg Fayans wrote:
Hi all,

I've bumped into a strange problem with only a part of changes implemented on master during replica outage get replicated after replica recovery.

Namely: when I delete an existing user on the master while the node is offline, these changes do not get to the node when it's back online. User creation, however, gets replicated as expected.

Steps to reproduce:

1. Create the following tolopogy:

replica1 <-> master <-> replica2 <-> replica3

2. Create user1 on master, make sure it appears on all replicas
3. Turn off replica2
4. On master delete user1 and create user2, make sure the changes get replicated to replica1
5. Turn on replica2

Expected results:

A minute or so after repica2 is back up,
1. user1 does not exist neither on replica2 nor on replica3
2. user2 exists both on replica2 and replica3

Actual results:
1. user1 coexist with user2 on replica2 and replica3
2. master and replica1 have only user2


In my case, though, the topology was as follows:
$ ipa topologysegment-find realm
------------------
3 segments matched
------------------
  Segment name: f22master.bagam.net-to-f22replica3.bagam.net
  Left node: f22master.bagam.net
  Right node: f22replica3.bagam.net
  Connectivity: both

  Segment name: replica1-to-replica2
  Left node: f22replica1.bagam.net
  Right node: f22replica2.bagam.net
  Connectivity: both

  Segment name: replica2-to-master
  Left node: f22replica2.bagam.net
  Right node: f22master.bagam.net
  Connectivity: both
----------------------------
Number of entries returned 3
----------------------------
And I was turning off replica2, leaving replica1 offline, but that does not really matter.

The dirsrv error message, most likely to be relevant is:
----------------------------------------------------------------------------------------------------------------------------------------------------- Consumer failed to replay change (uniqueid b8242e18-143111e5-b1d0d0c3-ae5854ff, CSN 55802fcf000300040000): Operations error (1). Will retry later -----------------------------------------------------------------------------------------------------------------------------------------------------

I attach dirsrv error and access logs from all nodes, in case they could be useful










--
Oleg Fayans
Quality Engineer
FreeIPA team
RedHat.



-- 
Manage your subscription for the Freeipa-devel mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-devel
Contribute to FreeIPA: http://www.freeipa.org/page/Contribute/Code

Reply via email to