[Freeipa-users] Re: Can't sync a new replica, large db file,

Mike Johnson via FreeIPA-users Thu, 16 Nov 2017 08:39:06 -0800

Ludwig, that was perfect.  I found some entries that had seemingly had
certs added very frequently, which I think was certmonger either going
rogue or, more likely, down to a misconfiguration. Removing these and
their corresponding tombstone entries reduced the directory size from
120MB to about 2MB.  After that the replica installation proceeded
without a problem.


In terms of the large db file under cldb/ directory, I've enabled
changelog trimming and will wait and see what happens.

The next step is placing CA role on the replica server, but I'll start
a new thread if there are issues there.

Many thanks again.
Mike

On 16 November 2017 at 12:34, Ludwig Krispenz <lkris...@redhat.com> wrote:
>
> On 11/15/2017 04:55 PM, Mike Johnson wrote:
>>
>> Thank you Ludwig.  I did ask on #389 on freenode. The first response I
>> got said lkrispen (presumably you) you were the expert in this area.
>
> :-)
>>
>>
>> I have since cleaned up some nsTombstone/nsds5ReplConflict records
>> according to the docs:
>>
>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/html/administration_guide/managing_replication-solving_common_replication_conflicts
>>
>> This allowed me to raise the domain level on the master to 1.
>>
>> I'l revert to a clean snapshot of the replica and capture logs from both
>> sides.
>> Mike
>
> I looked into the data you sent (off-list) and it looks like you really have
> a problem with a large entry (or maybe more). In the consumer error log we
> see again:
>
> [15/Nov/2017:18:03:47.578017800 +0000] - ERR - sasl_io_start_packet - SASL
> encrypted packet length exceeds maximum allowed limit (length=16777279,
> limit=2097152).  Change the nsslapd-maxsasliosize attribute in cn=config to
> increase limit.
>
> and in the corresponding access log:
>
> [15/Nov/2017:18:03:46.868648393 +0000] conn=5 op=510 EXT
> oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
> [15/Nov/2017:18:03:46.868737845 +0000] conn=5 op=510 RESULT err=0 tag=120
> nentries=0 etime=0
> [15/Nov/2017:18:03:46.868854476 +0000] conn=5 op=511 EXT
> oid="2.16.840.1.113730.3.5.6" name="replication-multimaster-extop"
> [15/Nov/2017:18:03:46.868924086 +0000] conn=5 op=511 RESULT err=0 tag=120
> nentries=0 etime=0
> [15/Nov/2017:18:03:47.579925711 +0000] conn=5 op=-1 fd=64 closed - The value
> requested is too large to be stored in the data buffer provided.
>
> so the total init was progressing and 506 entries were successfully sent.
>
> You can try to confirm that there is a lareg entry or try to find the
> largest entry in the database and then follow the suggestion and raise
> nsslapd-maxsasliosize, maybe you will also run into the limit of maxbersize
> then.
>
> To see the order total init sends entries you can do the following search
> ldapsearch -D cn=directory manager -w ... -b "<your suffix>" "parentid>=1"
>
>
>>
>> On 15 November 2017 at 15:17, Ludwig Krispenz via FreeIPA-users
>> <freeipa-users@lists.fedorahosted.org> wrote:
>>>
>>> On 11/15/2017 07:40 AM, Mike Johnson via FreeIPA-users wrote:
>>>>
>>>> I should add that I deleted/moved the large DB file as it was on the
>>>> single remaining master, with no replication agreements left.
>>>
>>> yes, but that should be unrelated.
>>>
>>>> Is it worth asking on the 389-users list as well?
>>>
>>> you can d othis to get anotehr audience, but I think you also need
>>> feedback
>>> from the IPA people.
>>>
>>> The basic failure seems to be the failure of teh total init, and that
>>> seems
>>> to fail because of:
>>> [14/Nov/2017:16:18:51.936433927 +0000] - ERR - sasl_io_start_packet -
>>> SASL
>>> encrypted packet length exceeds maximum allowed limit (length=16777279,
>>> limit=2097152).  Change the nsslapd-maxsasliosize attribute in cn=config
>>> to
>>> increase limit.
>>>
>>> now you can try to increase the settings and retry the reinit, but if it
>>> is
>>> in the replica install phase I do not know if there is a way to change
>>> the
>>> default during install.
>>>
>>> For the next occurrence, could you provide access and error logs from
>>> both
>>> instances for the time of failure
>>>
>>> Regards,
>>> Ludwig
>>>
>>>> Thanks
>>>> Mike
>>>>
>>>> On 14 November 2017 at 16:48, Mike Johnson <m.d.john...@kuub.org> wrote:
>>>>>
>>>>> Pastebin for dirsrv/errors log file during/after failed join --
>>>>> https://pastebin.com/gJR1SZWZ
>>>>>
>>>>> On 14 November 2017 at 16:40, Mike Johnson <m.d.john...@kuub.org>
>>>>> wrote:
>>>>>>
>>>>>> Ludwig, thank you for the prompt, helpful reply.
>>>>>>
>>>>>> I've deleted the stale replication agreements, cleaned the dangling
>>>>>> RUVs and renamed the huge file.  It recreated the file but it's
>>>>>> nowhere near as big as it was.
>>>>>>
>>>>>> Now, on the second issue, it doesn't appear to be listening on port
>>>>>> 636.
>>>>>>
>>>>>> The steps I'm following are, broadly:
>>>>>>
>>>>>> yum install ipa-server
>>>>>> ipa-replica-install ./replica-info-id5.prod.mydomain.com.gpg
>>>>>>
>>>>>> I did not join the replica machine as a client before initiating the
>>>>>> replication, I understand this is correct?
>>>>>>
>>>>>> Presumably the directory starts on the replica during the
>>>>>> replica-install process?
>>>>>>
>>>>>> journalctl on the replica shows many of the following after I try to
>>>>>> install:
>>>>>> ERR - NSMMReplicationPlugin - replica_replace_ruv_tombstone - Failed
>>>>>> to update replication update vector for replica
>>>>>> dc=prod,dc=mydomain,dc=com: LDAP error - 1
>>>>>>
>>>>>> This is the state of things after trying to install the replica:
>>>>>> [root@id5 ~]# netstat -ltnp
>>>>>> Active Internet connections (only servers)
>>>>>> Proto Recv-Q Send-Q Local Address           Foreign Address
>>>>>> State       PID/Program name
>>>>>> tcp        0      0 0.0.0.0:111             0.0.0.0:*
>>>>>> LISTEN      1/systemd
>>>>>> tcp        0      0 0.0.0.0:22              0.0.0.0:*
>>>>>> LISTEN      1139/sshd
>>>>>> tcp        0      0 127.0.0.1:25            0.0.0.0:*
>>>>>> LISTEN      1332/master
>>>>>> tcp6       0      0 :::111                  :::*
>>>>>> LISTEN      1/systemd
>>>>>> tcp6       0      0 :::22                   :::*
>>>>>> LISTEN      1139/sshd
>>>>>> tcp6       0      0 ::1:25                  :::*
>>>>>> LISTEN      1332/master
>>>>>> tcp6       0      0 :::389                  :::*
>>>>>> LISTEN      1964/ns-slapd
>>>>>>
>>>>>> I note that port 389 is showing as tcp6 but I can see it with v4 from
>>>>>> the master
>>>>>>
>>>>>> What I have noticed is that the master is very, very slow.  In
>>>>>> particular the httpd process running under the ipaapi user is sitting
>>>>>> at 100% load most of the time.  I suspect timeouts may be occurring if
>>>>>> it's taking a long time for the master to respond to requests.
>>>>>>
>>>>>> Grateful for any more guidance
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 November 2017 at 12:23, Ludwig Krispenz via FreeIPA-users
>>>>>> <freeipa-users@lists.fedorahosted.org> wrote:
>>>>>>>
>>>>>>> On 11/14/2017 11:40 AM, Mike Johnson via FreeIPA-users wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I've got a small environment which had until recently 2 IPA servers.
>>>>>>>> Both CentOS 7.4.1708
>>>>>>>>
>>>>>>>> Version info:
>>>>>>>>
>>>>>>>> id1:
>>>>>>>> Name        : ipa-server
>>>>>>>> Version     : 4.5.0
>>>>>>>> Release     : 21.el7.centos.2.2
>>>>>>>> Kernel: 3.10.0-693.5.2.el7.x86_64
>>>>>>>> 389-ds-base is at version 1.3.6.1
>>>>>>>>
>>>>>>>> id5:
>>>>>>>> Name        : ipa-server
>>>>>>>> Version     : 4.5.0
>>>>>>>> Release     : 21.el7.centos.2.2
>>>>>>>> Kernel: 3.10.0-693.5.2.el7.x86_64
>>>>>>>> 389-ds-base is at version 1.3.6.1
>>>>>>>>
>>>>>>>> I recently had an issue with high IO/load, and noted that the
>>>>>>>> following
>>>>>>>> file:
>>>>>>>> /var/lib/dirsrv/slapd-PROD-MYDOMAIN-COM/cldb/<long-filename>.db
>>>>>>>> was huge (5GB-ish) in a very small 2-master environment.  This is on
>>>>>>>> the master.  My understanding is that the entries in this file,
>>>>>>>> which
>>>>>>>> have timestamps from months ago, exist because of failed
>>>>>>>> replication.
>>>>>>>> I don't understand how to clear this without breaking things.
>>>>>>>
>>>>>>> looks like you have changelog trimming not enabled, if you enable
>>>>>>> trimming
>>>>>>> now this would reduce the content, but not necessary reduce the file
>>>>>>> size,
>>>>>>> but it would prevent it to grow.
>>>>>>> If you stop the server and remove it, it will be recreated. What can
>>>>>>> happen
>>>>>>> then is that required changes to update another replica are missing
>>>>>>> and
>>>>>>> repl
>>>>>>> will ask you to reinit the other server.
>>>>>>>
>>>>>>> Now, the second problem should be unrelated. Looks like total init
>>>>>>> tries to
>>>>>>> connect to port 636 and fails, the normal repl session fals because
>>>>>>> the
>>>>>>> init
>>>>>>> didn't happen. Could you verify that id5 is listening on 636 or if
>>>>>>> you
>>>>>>> have
>>>>>>> any errors in its error logs.
>>>>>>>>
>>>>>>>>
>>>>>>>> Second issue; not sure if related:
>>>>>>>>
>>>>>>>> I've since lost the replica (id2) but I've prepared a new machine
>>>>>>>> (id5) to be a new replica of id1.  I've cleaned the RUVs and deleted
>>>>>>>> the replication agreements but when I join the new machine to the
>>>>>>>> existing one using `ipa-replica-install` then I get the following on
>>>>>>>> the replica:
>>>>>>>>
>>>>>>>> ################
>>>>>>>> Starting replication, please wait until this has completed.
>>>>>>>> Update in progress, 10 seconds elapsed
>>>>>>>> [ldap://id1.prod.mydomain.com:389] reports: Update failed! Status:
>>>>>>>> [-11 connection error: Unknown connection error (-11) - Total update
>>>>>>>> aborted]
>>>>>>>>
>>>>>>>>      [error] RuntimeError: Failed to start replication
>>>>>>>> Your system may be partly configured.
>>>>>>>> Run /usr/sbin/ipa-server-install --uninstall to clean up.
>>>>>>>>
>>>>>>>> ipa.ipapython.install.cli.install_tool(CompatServerReplicaInstall):
>>>>>>>> ERROR    Failed to start replication
>>>>>>>> ipa.ipapython.install.cli.install_tool(CompatServerReplicaInstall):
>>>>>>>> ERROR    The ipa-replica-install command failed. See
>>>>>>>> /var/log/ipareplica-install.log for more information
>>>>>>>> [root@id5 ~]# ipa-replica-manage re-initialize --from
>>>>>>>> id1.prod.mydomain.com
>>>>>>>> Re-run /usr/sbin/ipa-replica-manage with --verbose option to get
>>>>>>>> more
>>>>>>>> information
>>>>>>>> Unexpected error: cannot connect to
>>>>>>>> 'ldaps://id5.prod.mydomain.com:636':
>>>>>>>> ################
>>>>>>>>
>>>>>>>> and the following on the master:
>>>>>>>>
>>>>>>>> ################
>>>>>>>> [14/Nov/2017:10:05:28.671905981 +0000] - INFO -
>>>>>>>> NSMMReplicationPlugin
>>>>>>>> - repl5_tot_run - Beginning total update of replica
>>>>>>>> "agmt="cn=meToid5.prod.mydomain.com" (id5:389)".
>>>>>>>> [14/Nov/2017:10:05:38.031033860 +0000] - ERR - NSMMReplicationPlugin
>>>>>>>> -
>>>>>>>> repl5_tot_log_operation_failure -
>>>>>>>> agmt="cn=meToid5.prod.mydomain.com"
>>>>>>>> (id5:389): Received error -1 (Can't contact LDAP server):  for total
>>>>>>>> update operation
>>>>>>>> [14/Nov/2017:10:05:38.032272148 +0000] - ERR - NSMMReplicationPlugin
>>>>>>>> -
>>>>>>>> release_replica - agmt="cn=meToid5.prod.mydomain.com" (id5:389):
>>>>>>>> Unable to send endReplication extended operation (Can't contact LDAP
>>>>>>>> server)
>>>>>>>> [14/Nov/2017:10:05:38.095893236 +0000] - ERR - NSMMReplicationPlugin
>>>>>>>> -
>>>>>>>> repl5_tot_run - Total update failed for replica
>>>>>>>> "agmt="cn=meToid5.prod.mydomain.com" (id5:389)", error (-11)
>>>>>>>> [14/Nov/2017:10:05:38.113388624 +0000] - INFO -
>>>>>>>> NSMMReplicationPlugin
>>>>>>>> - bind_and_check_pwp - agmt="cn=meToid5.prod.mydomain.com"
>>>>>>>> (id5:389):
>>>>>>>> Replication bind with GSSAPI auth resumed
>>>>>>>> [14/Nov/2017:10:05:38.425682940 +0000] - WARN -
>>>>>>>> NSMMReplicationPlugin
>>>>>>>> - repl5_inc_run - agmt="cn=meToid5.prod.mydomain.com" (id5:389): The
>>>>>>>> remote replica has a different database generation ID than the local
>>>>>>>> database.  You may have to reinitialize the remote replica, or the
>>>>>>>> local replica.
>>>>>>>> ################
>>>>>>>>
>>>>>>>> I've checked the firewalls on both machines, and gone as far as to
>>>>>>>> flush all the iptables rules to get it to work.  No luck.
>>>>>>>>
>>>>>>>> I'm also getting hundreds of the last line "different database
>>>>>>>> generation ID" but my understanding is that this is only logged
>>>>>>>> because the replica is yet to be set up.
>>>>>>>>
>>>>>>>> Would anyone please be able to provide some guidance?  I've been at
>>>>>>>> this for a few days now!
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> MIke
>>>>>>>> _______________________________________________
>>>>>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>>>>>>> To unsubscribe send an email to
>>>>>>>> freeipa-users-le...@lists.fedorahosted.org
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
>>>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>>>>>> Managing Directors: Charles Cachera, Michael Cunningham, Michael
>>>>>>> O'Neill,
>>>>>>> Eric Shander
>>>>>>> _______________________________________________
>>>>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>>>>>> To unsubscribe send an email to
>>>>>>> freeipa-users-le...@lists.fedorahosted.org
>>>>
>>>> _______________________________________________
>>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>>> To unsubscribe send an email to
>>>> freeipa-users-le...@lists.fedorahosted.org
>>>
>>>
>>> --
>>> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill,
>>> Eric Shander
>>> _______________________________________________
>>> FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
>>> To unsubscribe send an email to
>>> freeipa-users-le...@lists.fedorahosted.org
>
>
> --
> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill,
> Eric Shander
>
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-le...@lists.fedorahosted.org

[Freeipa-users] Re: Can't sync a new replica, large db file,

Reply via email to