Ludwig, that was perfect. I found some entries that had seemingly had certs added very frequently, which I think was certmonger either going rogue or, more likely, down to a misconfiguration. Removing these and their corresponding tombstone entries reduced the directory size from 120MB to about 2MB. After that the replica installation proceeded without a problem.
In terms of the large db file under cldb/ directory, I've enabled changelog trimming and will wait and see what happens. The next step is placing CA role on the replica server, but I'll start a new thread if there are issues there. Many thanks again. Mike On 16 November 2017 at 12:34, Ludwig Krispenz <lkris...@redhat.com> wrote: > > On 11/15/2017 04:55 PM, Mike Johnson wrote: >> >> Thank you Ludwig. I did ask on #389 on freenode. The first response I >> got said lkrispen (presumably you) you were the expert in this area. > > :-) >> >> >> I have since cleaned up some nsTombstone/nsds5ReplConflict records >> according to the docs: >> >> https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/html/administration_guide/managing_replication-solving_common_replication_conflicts >> >> This allowed me to raise the domain level on the master to 1. >> >> I'l revert to a clean snapshot of the replica and capture logs from both >> sides. >> Mike > > I looked into the data you sent (off-list) and it looks like you really have > a problem with a large entry (or maybe more). In the consumer error log we > see again: > > [15/Nov/2017:18:03:47.578017800 +0000] - ERR - sasl_io_start_packet - SASL > encrypted packet length exceeds maximum allowed limit (length=16777279, > limit=2097152). Change the nsslapd-maxsasliosize attribute in cn=config to > increase limit. > > and in the corresponding access log: > > [15/Nov/2017:18:03:46.868648393 +0000] conn=5 op=510 EXT > oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop" > [15/Nov/2017:18:03:46.868737845 +0000] conn=5 op=510 RESULT err=0 tag=120 > nentries=0 etime=0 > [15/Nov/2017:18:03:46.868854476 +0000] conn=5 op=511 EXT > oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop" > [15/Nov/2017:18:03:46.868924086 +0000] conn=5 op=511 RESULT err=0 tag=120 > nentries=0 etime=0 > [15/Nov/2017:18:03:47.579925711 +0000] conn=5 op=-1 fd=64 closed - The value > requested is too large to be stored in the data buffer provided. > > so the total init was progressing and 506 entries were successfully sent. > > You can try to confirm that there is a lareg entry or try to find the > largest entry in the database and then follow the suggestion and raise > nsslapd-maxsasliosize, maybe you will also run into the limit of maxbersize > then. > > To see the order total init sends entries you can do the following search > ldapsearch -D cn=directory manager -w ... -b "<your suffix>" "parentid>=1" > > >> >> On 15 November 2017 at 15:17, Ludwig Krispenz via FreeIPA-users >> <freeipa-users@lists.fedorahosted.org> wrote: >>> >>> On 11/15/2017 07:40 AM, Mike Johnson via FreeIPA-users wrote: >>>> >>>> I should add that I deleted/moved the large DB file as it was on the >>>> single remaining master, with no replication agreements left. >>> >>> yes, but that should be unrelated. >>> >>>> Is it worth asking on the 389-users list as well? >>> >>> you can d othis to get anotehr audience, but I think you also need >>> feedback >>> from the IPA people. >>> >>> The basic failure seems to be the failure of teh total init, and that >>> seems >>> to fail because of: >>> [14/Nov/2017:16:18:51.936433927 +0000] - ERR - sasl_io_start_packet - >>> SASL >>> encrypted packet length exceeds maximum allowed limit (length=16777279, >>> limit=2097152). Change the nsslapd-maxsasliosize attribute in cn=config >>> to >>> increase limit. >>> >>> now you can try to increase the settings and retry the reinit, but if it >>> is >>> in the replica install phase I do not know if there is a way to change >>> the >>> default during install. >>> >>> For the next occurrence, could you provide access and error logs from >>> both >>> instances for the time of failure >>> >>> Regards, >>> Ludwig >>> >>>> Thanks >>>> Mike >>>> >>>> On 14 November 2017 at 16:48, Mike Johnson <m.d.john...@kuub.org> wrote: >>>>> >>>>> Pastebin for dirsrv/errors log file during/after failed join -- >>>>> https://pastebin.com/gJR1SZWZ >>>>> >>>>> On 14 November 2017 at 16:40, Mike Johnson <m.d.john...@kuub.org> >>>>> wrote: >>>>>> >>>>>> Ludwig, thank you for the prompt, helpful reply. >>>>>> >>>>>> I've deleted the stale replication agreements, cleaned the dangling >>>>>> RUVs and renamed the huge file. It recreated the file but it's >>>>>> nowhere near as big as it was. >>>>>> >>>>>> Now, on the second issue, it doesn't appear to be listening on port >>>>>> 636. >>>>>> >>>>>> The steps I'm following are, broadly: >>>>>> >>>>>> yum install ipa-server >>>>>> ipa-replica-install ./replica-info-id5.prod.mydomain.com.gpg >>>>>> >>>>>> I did not join the replica machine as a client before initiating the >>>>>> replication, I understand this is correct? >>>>>> >>>>>> Presumably the directory starts on the replica during the >>>>>> replica-install process? >>>>>> >>>>>> journalctl on the replica shows many of the following after I try to >>>>>> install: >>>>>> ERR - NSMMReplicationPlugin - replica_replace_ruv_tombstone - Failed >>>>>> to update replication update vector for replica >>>>>> dc=prod,dc=mydomain,dc=com: LDAP error - 1 >>>>>> >>>>>> This is the state of things after trying to install the replica: >>>>>> [root@id5 ~]# netstat -ltnp >>>>>> Active Internet connections (only servers) >>>>>> Proto Recv-Q Send-Q Local Address Foreign Address >>>>>> State PID/Program name >>>>>> tcp 0 0 0.0.0.0:111 0.0.0.0:* >>>>>> LISTEN 1/systemd >>>>>> tcp 0 0 0.0.0.0:22 0.0.0.0:* >>>>>> LISTEN 1139/sshd >>>>>> tcp 0 0 127.0.0.1:25 0.0.0.0:* >>>>>> LISTEN 1332/master >>>>>> tcp6 0 0 :::111 :::* >>>>>> LISTEN 1/systemd >>>>>> tcp6 0 0 :::22 :::* >>>>>> LISTEN 1139/sshd >>>>>> tcp6 0 0 ::1:25 :::* >>>>>> LISTEN 1332/master >>>>>> tcp6 0 0 :::389 :::* >>>>>> LISTEN 1964/ns-slapd >>>>>> >>>>>> I note that port 389 is showing as tcp6 but I can see it with v4 from >>>>>> the master >>>>>> >>>>>> What I have noticed is that the master is very, very slow. In >>>>>> particular the httpd process running under the ipaapi user is sitting >>>>>> at 100% load most of the time. I suspect timeouts may be occurring if >>>>>> it's taking a long time for the master to respond to requests. >>>>>> >>>>>> Grateful for any more guidance >>>>>> Mike >>>>>> >>>>>> >>>>>> >>>>>> On 14 November 2017 at 12:23, Ludwig Krispenz via FreeIPA-users >>>>>> <freeipa-users@lists.fedorahosted.org> wrote: >>>>>>> >>>>>>> On 11/14/2017 11:40 AM, Mike Johnson via FreeIPA-users wrote: >>>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> I've got a small environment which had until recently 2 IPA servers. >>>>>>>> Both CentOS 7.4.1708 >>>>>>>> >>>>>>>> Version info: >>>>>>>> >>>>>>>> id1: >>>>>>>> Name : ipa-server >>>>>>>> Version : 4.5.0 >>>>>>>> Release : 21.el7.centos.2.2 >>>>>>>> Kernel: 3.10.0-693.5.2.el7.x86_64 >>>>>>>> 389-ds-base is at version 1.3.6.1 >>>>>>>> >>>>>>>> id5: >>>>>>>> Name : ipa-server >>>>>>>> Version : 4.5.0 >>>>>>>> Release : 21.el7.centos.2.2 >>>>>>>> Kernel: 3.10.0-693.5.2.el7.x86_64 >>>>>>>> 389-ds-base is at version 1.3.6.1 >>>>>>>> >>>>>>>> I recently had an issue with high IO/load, and noted that the >>>>>>>> following >>>>>>>> file: >>>>>>>> /var/lib/dirsrv/slapd-PROD-MYDOMAIN-COM/cldb/<long-filename>.db >>>>>>>> was huge (5GB-ish) in a very small 2-master environment. This is on >>>>>>>> the master. My understanding is that the entries in this file, >>>>>>>> which >>>>>>>> have timestamps from months ago, exist because of failed >>>>>>>> replication. >>>>>>>> I don't understand how to clear this without breaking things. >>>>>>> >>>>>>> looks like you have changelog trimming not enabled, if you enable >>>>>>> trimming >>>>>>> now this would reduce the content, but not necessary reduce the file >>>>>>> size, >>>>>>> but it would prevent it to grow. >>>>>>> If you stop the server and remove it, it will be recreated. What can >>>>>>> happen >>>>>>> then is that required changes to update another replica are missing >>>>>>> and >>>>>>> repl >>>>>>> will ask you to reinit the other server. >>>>>>> >>>>>>> Now, the second problem should be unrelated. Looks like total init >>>>>>> tries to >>>>>>> connect to port 636 and fails, the normal repl session fals because >>>>>>> the >>>>>>> init >>>>>>> didn't happen. Could you verify that id5 is listening on 636 or if >>>>>>> you >>>>>>> have >>>>>>> any errors in its error logs. >>>>>>>> >>>>>>>> >>>>>>>> Second issue; not sure if related: >>>>>>>> >>>>>>>> I've since lost the replica (id2) but I've prepared a new machine >>>>>>>> (id5) to be a new replica of id1. I've cleaned the RUVs and deleted >>>>>>>> the replication agreements but when I join the new machine to the >>>>>>>> existing one using `ipa-replica-install` then I get the following on >>>>>>>> the replica: >>>>>>>> >>>>>>>> ################ >>>>>>>> Starting replication, please wait until this has completed. >>>>>>>> Update in progress, 10 seconds elapsed >>>>>>>> [ldap://id1.prod.mydomain.com:389] reports: Update failed! Status: >>>>>>>> [-11 connection error: Unknown connection error (-11) - Total update >>>>>>>> aborted] >>>>>>>> >>>>>>>> [error] RuntimeError: Failed to start replication >>>>>>>> Your system may be partly configured. >>>>>>>> Run /usr/sbin/ipa-server-install --uninstall to clean up. >>>>>>>> >>>>>>>> ipa.ipapython.install.cli.install_tool(CompatServerReplicaInstall): >>>>>>>> ERROR Failed to start replication >>>>>>>> ipa.ipapython.install.cli.install_tool(CompatServerReplicaInstall): >>>>>>>> ERROR The ipa-replica-install command failed. See >>>>>>>> /var/log/ipareplica-install.log for more information >>>>>>>> [root@id5 ~]# ipa-replica-manage re-initialize --from >>>>>>>> id1.prod.mydomain.com >>>>>>>> Re-run /usr/sbin/ipa-replica-manage with --verbose option to get >>>>>>>> more >>>>>>>> information >>>>>>>> Unexpected error: cannot connect to >>>>>>>> 'ldaps://id5.prod.mydomain.com:636': >>>>>>>> ################ >>>>>>>> >>>>>>>> and the following on the master: >>>>>>>> >>>>>>>> ################ >>>>>>>> [14/Nov/2017:10:05:28.671905981 +0000] - INFO - >>>>>>>> NSMMReplicationPlugin >>>>>>>> - repl5_tot_run - Beginning total update of replica >>>>>>>> "agmt="cn=meToid5.prod.mydomain.com" (id5:389)". >>>>>>>> [14/Nov/2017:10:05:38.031033860 +0000] - ERR - NSMMReplicationPlugin >>>>>>>> - >>>>>>>> repl5_tot_log_operation_failure - >>>>>>>> agmt="cn=meToid5.prod.mydomain.com" >>>>>>>> (id5:389): Received error -1 (Can't contact LDAP server): for total >>>>>>>> update operation >>>>>>>> [14/Nov/2017:10:05:38.032272148 +0000] - ERR - NSMMReplicationPlugin >>>>>>>> - >>>>>>>> release_replica - agmt="cn=meToid5.prod.mydomain.com" (id5:389): >>>>>>>> Unable to send endReplication extended operation (Can't contact LDAP >>>>>>>> server) >>>>>>>> [14/Nov/2017:10:05:38.095893236 +0000] - ERR - NSMMReplicationPlugin >>>>>>>> - >>>>>>>> repl5_tot_run - Total update failed for replica >>>>>>>> "agmt="cn=meToid5.prod.mydomain.com" (id5:389)", error (-11) >>>>>>>> [14/Nov/2017:10:05:38.113388624 +0000] - INFO - >>>>>>>> NSMMReplicationPlugin >>>>>>>> - bind_and_check_pwp - agmt="cn=meToid5.prod.mydomain.com" >>>>>>>> (id5:389): >>>>>>>> Replication bind with GSSAPI auth resumed >>>>>>>> [14/Nov/2017:10:05:38.425682940 +0000] - WARN - >>>>>>>> NSMMReplicationPlugin >>>>>>>> - repl5_inc_run - agmt="cn=meToid5.prod.mydomain.com" (id5:389): The >>>>>>>> remote replica has a different database generation ID than the local >>>>>>>> database. You may have to reinitialize the remote replica, or the >>>>>>>> local replica. >>>>>>>> ################ >>>>>>>> >>>>>>>> I've checked the firewalls on both machines, and gone as far as to >>>>>>>> flush all the iptables rules to get it to work. No luck. >>>>>>>> >>>>>>>> I'm also getting hundreds of the last line "different database >>>>>>>> generation ID" but my understanding is that this is only logged >>>>>>>> because the replica is yet to be set up. >>>>>>>> >>>>>>>> Would anyone please be able to provide some guidance? I've been at >>>>>>>> this for a few days now! >>>>>>>> >>>>>>>> Thanks! >>>>>>>> MIke >>>>>>>> _______________________________________________ >>>>>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org >>>>>>>> To unsubscribe send an email to >>>>>>>> freeipa-users-le...@lists.fedorahosted.org >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, >>>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>>>>>> Managing Directors: Charles Cachera, Michael Cunningham, Michael >>>>>>> O'Neill, >>>>>>> Eric Shander >>>>>>> _______________________________________________ >>>>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org >>>>>>> To unsubscribe send an email to >>>>>>> freeipa-users-le...@lists.fedorahosted.org >>>> >>>> _______________________________________________ >>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org >>>> To unsubscribe send an email to >>>> freeipa-users-le...@lists.fedorahosted.org >>> >>> >>> -- >>> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, >>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, >>> Eric Shander >>> _______________________________________________ >>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org >>> To unsubscribe send an email to >>> freeipa-users-le...@lists.fedorahosted.org > > > -- > Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, > Eric Shander > _______________________________________________ FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org