On 09/06/2012 06:09 PM, Martin Kosek wrote: > On 09/06/2012 06:05 PM, Martin Kosek wrote: >> On 09/06/2012 05:55 PM, Rob Crittenden wrote: >>> Rob Crittenden wrote: >>>> Rob Crittenden wrote: >>>>> Martin Kosek wrote: >>>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote: >>>>>>> Rob Crittenden wrote: >>>>>>>> Martin Kosek wrote: >>>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote: >>>>>>>>>> Martin Kosek wrote: >>>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote: >>>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the >>>>>>>>>>>> other servers. >>>>>>>>>>>> This can confuse things if the replica is re-added, and it also >>>>>>>>>>>> causes the >>>>>>>>>>>> server to calculate changes against a server that may no longer >>>>>>>>>>>> exist. >>>>>>>>>>>> >>>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all >>>>>>>>>>>> available >>>>>>>>>>>> replicas to clean this RUV data. >>>>>>>>>>>> >>>>>>>>>>>> This patch will create this task at deletion time to hopefully >>>>>>>>>>>> clean things up. >>>>>>>>>>>> >>>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the >>>>>>>>>>>> time >>>>>>>>>>>> the >>>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data >>>>>>>>>>>> may be >>>>>>>>>>>> re-propogated around. >>>>>>>>>>>> >>>>>>>>>>>> To make things easier in this case I've added two new commands to >>>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the >>>>>>>>>>>> servers we >>>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the >>>>>>>>>>>> replication id of a >>>>>>>>>>>> server that no longer exists to try the cleanallruv step again. >>>>>>>>>>>> >>>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a >>>>>>>>>>>> replica id that >>>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in >>>>>>>>>>>> enough scary >>>>>>>>>>>> warnings about this. >>>>>>>>>>>> >>>>>>>>>>>> rob >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Good work there, this should make cleaning RUVs much easier than >>>>>>>>>>> with the >>>>>>>>>>> previous version. >>>>>>>>>>> >>>>>>>>>>> This is what I found during review: >>>>>>>>>>> >>>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I >>>>>>>>>>> think >>>>>>>>>>> it would >>>>>>>>>>> help if we for example have all info for commands indented. This >>>>>>>>>>> way >>>>>>>>>>> user could >>>>>>>>>>> simply over-look the new commands in the man page. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make >>>>>>>>>>> them >>>>>>>>>>> consistent with the rest of the commands (re-initialize, >>>>>>>>>>> force-sync). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an >>>>>>>>>>> unattended way >>>>>>>>>>> (for better testing), i.e. respect --force option as we already >>>>>>>>>>> do for >>>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the >>>>>>>>>>> future. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 4) (minor) The new question (and the del too) does not react too >>>>>>>>>>> well for >>>>>>>>>>> CTRL+D: >>>>>>>>>>> >>>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force >>>>>>>>>>> Clean the Replication Update Vector for >>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>>>> >>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>>>> consistency. Be very careful. >>>>>>>>>>> Continue to clean? [no]: unexpected error: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite >>>>>>>>>>> confusing >>>>>>>>>>> as it reports that command is wrong and not the parameter: >>>>>>>>>>> >>>>>>>>>>> # ipa-replica-manage clean_ruv >>>>>>>>>>> Usage: ipa-replica-manage [options] >>>>>>>>>>> >>>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv | >>>>>>>>>>> force-sync | >>>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv] >>>>>>>>>>> >>>>>>>>>>> It seems you just forgot to specify the error message in the >>>>>>>>>>> command >>>>>>>>>>> definition >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails >>>>>>>>>>> with an >>>>>>>>>>> unexpected error: >>>>>>>>>>> >>>>>>>>>>> [root@vm-086 ~]# ipa-replica-manage clean_ruv 5 >>>>>>>>>>> Clean the Replication Update Vector for >>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389 >>>>>>>>>>> >>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no >>>>>>>>>>> longer replicate so it may miss updates while the process >>>>>>>>>>> is running. It would need to be re-initialized to maintain >>>>>>>>>>> consistency. Be very careful. >>>>>>>>>>> Continue to clean? [no]: y >>>>>>>>>>> unexpected error: {'desc': 'Operations error'} >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors: >>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>>> cleanAllRUV_task: failed >>>>>>>>>>> to connect to repl agreement connection >>>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica, >>>>>>>>>>> >>>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> tree,cn=config), error 105 >>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>>> cleanAllRUV_task: replica >>>>>>>>>>> (cn=meTovm-055.idm.lab. >>>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> tree, cn=config) has not been cleaned. You will need to rerun >>>>>>>>>>> the >>>>>>>>>>> CLEANALLRUV task on this replica. >>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin - >>>>>>>>>>> cleanAllRUV_task: Task >>>>>>>>>>> failed (1) >>>>>>>>>>> >>>>>>>>>>> In this case I think we should inform user that the command failed, >>>>>>>>>>> possibly >>>>>>>>>>> because of disconnected replicas and that they could enable the >>>>>>>>>>> replicas and >>>>>>>>>>> try again. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py: >>>>>>>>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>>>>>>>> + # We can't make the server we're removing read-only >>>>>>>>>>> but >>>>>>>>>>> + # this isn't a show-stopper >>>>>>>>>>> + root_logger.debug("No permission to switch replica to >>>>>>>>>>> read-only, >>>>>>>>>>> continuing anyway") >>>>>>>>>>> + pass >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think this addresses everything. >>>>>>>>>> >>>>>>>>>> rob >>>>>>>>> >>>>>>>>> Thanks, almost there! I just found one more issue which needs to be >>>>>>>>> fixed >>>>>>>>> before we push: >>>>>>>>> >>>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force >>>>>>>>> Directory Manager password: >>>>>>>>> >>>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing >>>>>>>>> removal >>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>>>> "Can't >>>>>>>>> contact LDAP server"} >>>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>>>>>>> >>>>>>>>> There were issues removing a connection: %d format: a number is >>>>>>>>> required, not str >>>>>>>>> >>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': >>>>>>>>> "Can't >>>>>>>>> contact LDAP server"} >>>>>>>>> >>>>>>>>> This is a traceback I retrieved: >>>>>>>>> Traceback (most recent call last): >>>>>>>>> File "/sbin/ipa-replica-manage", line 425, in del_master >>>>>>>>> del_link(realm, r, hostname, options.dirman_passwd, force=True) >>>>>>>>> File "/sbin/ipa-replica-manage", line 271, in del_link >>>>>>>>> repl1.cleanallruv(replica_id) >>>>>>>>> File >>>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", >>>>>>>>> line 1094, in cleanallruv >>>>>>>>> root_logger.debug("Creating CLEANALLRUV task for replica id >>>>>>>>> %d" % >>>>>>>>> replicaId) >>>>>>>>> >>>>>>>>> >>>>>>>>> The problem here is that you don't convert replica_id to int in this >>>>>>>>> part: >>>>>>>>> + replica_id = None >>>>>>>>> + if repl2: >>>>>>>>> + replica_id = repl2._get_replica_id(repl2.conn, None) >>>>>>>>> + else: >>>>>>>>> + servers = get_ruv(realm, replica1, dirman_passwd) >>>>>>>>> + for (netloc, rid) in servers: >>>>>>>>> + if netloc.startswith(replica2): >>>>>>>>> + replica_id = rid >>>>>>>>> + break >>>>>>>>> >>>>>>>>> Martin >>>>>>>>> >>>>>>>> >>>>>>>> Updated patch using new mechanism in 389-ds-base. This should more >>>>>>>> thoroughly clean out RUV data when a replica is being deleted, and >>>>>>>> provide for a way to delete RUV data afterwards too if necessary. >>>>>>>> >>>>>>>> rob >>>>>>> >>>>>>> Rebased patch >>>>>>> >>>>>>> rob >>>>>>> >>>>>> >>>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped >>>>>> elsewhere. >>>>>> >>>>>> 1) The following KeyboardInterrupt except class looks suspicious. I >>>>>> know why >>>>>> you have it there, but since it is generally a bad thing to do, some >>>>>> comment >>>>>> why it is needed would be useful. >>>>>> >>>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, >>>>>> dirman_passwd, >>>>>> force=False): >>>>>> repl1.delete_agreement(replica2) >>>>>> repl1.delete_referral(replica2) >>>>>> >>>>>> + if type1 == replication.IPA_REPLICA: >>>>>> + if repl2: >>>>>> + ruv = repl2._get_replica_id(repl2.conn, None) >>>>>> + else: >>>>>> + ruv = get_ruv_by_host(realm, replica1, replica2, >>>>>> dirman_passwd) >>>>>> + >>>>>> + try: >>>>>> + repl1.cleanallruv(ruv) >>>>>> + except KeyboardInterrupt: >>>>>> + pass >>>>>> + >>>>>> >>>>>> Maybe you just wanted to do some cleanup and then "raise" again? >>>>> >>>>> No, it is there because it is safe to break out of it. The task will >>>>> continue to run. I added some verbiage. >>>>> >>>>>> >>>>>> 2) This is related to 1), but when some replica is down, >>>>>> "ipa-replica-manage >>>>>> del" may wait indefinitely when some remote replica is down, right? >>>>>> >>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>>>> Deleting a master is irreversible. >>>>>> To reconnect to the remote master you will need to prepare a new >>>>>> replica file >>>>>> and re-install. >>>>>> Continue to delete? [no]: y >>>>>> ipa: INFO: Setting agreement >>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>>>> >>>>>> >>>>>> >>>>>> tree,cn=config schedule to 2358-2359 0 to force synch >>>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement >>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping >>>>>> >>>>>> >>>>>> >>>>>> tree,cn=config >>>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica >>>>>> acquired >>>>>> successfully: Incremental update succeeded: start: 0: end: 0 >>>>>> Background task created to clean replication data >>>>>> >>>>>> ... after about a minute I hit CTRL+C >>>>>> >>>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to >>>>>> 'vm-055.idm.lab.bos.redhat.com' >>>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record >>>>>> does not >>>>>> contain 'vm-055.idm.lab.bos.redhat.com.' >>>>>> You may need to manually remove them from the tree >>>>>> >>>>>> I think it would be better to inform user that some remote replica is >>>>>> down or >>>>>> at least that we are waiting for the task to complete. Something like >>>>>> that: >>>>>> >>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com >>>>>> ... >>>>>> Background task created to clean replication data >>>>>> Replication data clean up may take very long time if some replica is >>>>>> unreachable >>>>>> Hit CTRL+C to interrupt the wait >>>>>> ^C Clean up wait interrupted >>>>>> .... >>>>>> [continue with del] >>>>> >>>>> Yup, did this in #1. >>>>> >>>>>> >>>>>> 3) (minor) When there is a cleanruv task running and you run >>>>>> "ipa-replica-manage del", there is a unexpected error message with >>>>>> duplicate >>>>>> task object in LDAP: >>>>>> >>>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force >>>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing >>>>>> removal >>>>>> FAIL >>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>>>> contact LDAP server"} >>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com' >>>>>> >>>>>> There were issues removing a connection: This entry already exists >>>>>> <<<<<<<<< >>>>>> >>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't >>>>>> contact LDAP server"} >>>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record >>>>>> does not >>>>>> contain 'vm-072.idm.lab.bos.redhat.com.' >>>>>> You may need to manually remove them from the tree >>>>>> >>>>>> >>>>>> I think it should be enough to just catch for "entry already exists" in >>>>>> cleanallruv function, and in such case print a relevant error message >>>>>> bail out. >>>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too. >>>>> >>>>> Good catch, fixed. >>>>> >>>>>> >>>>>> >>>>>> 4) (minor): In make_readonly function, there is a redundant "pass" >>>>>> statement: >>>>>> >>>>>> + def make_readonly(self): >>>>>> + """ >>>>>> + Make the current replication agreement read-only. >>>>>> + """ >>>>>> + dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'), >>>>>> + ('cn', 'plugins'), ('cn', 'config')) >>>>>> + >>>>>> + mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')] >>>>>> + try: >>>>>> + self.conn.modify_s(dn, mod) >>>>>> + except ldap.INSUFFICIENT_ACCESS: >>>>>> + # We can't make the server we're removing read-only but >>>>>> + # this isn't a show-stopper >>>>>> + root_logger.debug("No permission to switch replica to >>>>>> read-only, >>>>>> continuing anyway") >>>>>> + pass <<<<<<<<<<<<<<< >>>>> >>>>> Yeah, this is one of my common mistakes. I put in a pass initially, then >>>>> add logging in front of it and forget to delete the pass. Its gone now. >>>>> >>>>>> >>>>>> >>>>>> 5) In clean_ruv, I think allowing a --force option to bypass the >>>>>> user_input >>>>>> would be helpful (at least for test automation): >>>>>> >>>>>> + if not ipautil.user_input("Continue to clean?", False): >>>>>> + sys.exit("Aborted") >>>>> >>>>> Yup, added. >>>>> >>>>> rob >>>> >>>> Slightly revised patch. I still had a window open with one unsaved change. >>>> >>>> rob >>>> >>> >>> Apparently there were two unsaved changes, one of which was lost. This adds >>> in >>> the 'entry already exists' fix. >>> >>> rob >>> >> >> Just one last thing (otherwise the patch is OK) - I don't think this is what >> we >> want :-) >> >> # ipa-replica-manage clean-ruv 8 >> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 >> >> Cleaning the wrong replica ID will cause that server to no >> longer replicate so it may miss updates while the process >> is running. It would need to be re-initialized to maintain >> consistency. Be very careful. >> Continue to clean? [no]: y <<<<<< >> Aborted >> >> >> Nor this exception, (your are checking for wrong exception): >> >> # ipa-replica-manage clean-ruv 8 >> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389 >> >> Cleaning the wrong replica ID will cause that server to no >> longer replicate so it may miss updates while the process >> is running. It would need to be re-initialized to maintain >> consistency. Be very careful. >> Continue to clean? [no]: >> unexpected error: This entry already exists >> >> This is the exception: >> >> Traceback (most recent call last): >> File "/sbin/ipa-replica-manage", line 651, in <module> >> main() >> File "/sbin/ipa-replica-manage", line 648, in main >> clean_ruv(realm, args[1], options) >> File "/sbin/ipa-replica-manage", line 373, in clean_ruv >> thisrepl.cleanallruv(ruv) >> File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", >> line 1136, in cleanallruv >> self.conn.addEntry(e) >> File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in >> addEntry >> self.__handle_errors(e, arg_desc=arg_desc) >> File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in >> __handle_errors >> raise errors.DuplicateEntry() >> ipalib.errors.DuplicateEntry: This entry already exists >> >> Martin >> > > > On another matter, I just noticed that CLEANRUV is not proceeding if I have a > winsync replica defined (and it is even up): > > # ipa-replica-manage list > dc.ad.test: winsync <<<<<<< > vm-072.idm.lab.bos.redhat.com: master > vm-086.idm.lab.bos.redhat.com: master > > [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Waiting > for all the replicas to receive all the deleted replica updates... > [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed > to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry > later. > [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all > replicas caught up, retrying in 10 seconds > [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed > to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry > later. > [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all > replicas caught up, retrying in 20 seconds > [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed > to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry > later. > [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all > replicas caught up, retrying in 40 seconds > > I don't think this is OK. Adding Rich to CC to follow on this one. > > Martin >
And now the actual CC. Martin _______________________________________________ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel