On 11/11/2015 04:20 PM, Andrew Krause wrote:
Yesterday I came in to 3 of my 4 freeipa replicas in an unusable state and
replication was not connecting any of the hosts to each other. My
first/primary host was still servicing authentication requests, but the others
were in varying states of usability. I've investigated logs on all 4 nodes and
the only thing I can see is messages like this from when the problem started
until I restarted all 4 with ipactl stop/ipactl start:
[09/Nov/2015:19:17:16 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:19:16 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:21:19 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:23:19 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:25:21 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:27:21 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:29:26 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:31:26 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:32:37 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Warning:
Attempting to release replica, but unable to receive endReplication extended operation
response from the replica. Error -5 (Timed out)
[09/Nov/2015:19:33:29 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:34:37 -0700] NSMMReplicationPlugin -
agmt="cn=meToa.somedomain.com" (abcloc2papp08:389): Unable to receive the
response for a startReplication extended operation to consumer (Timed out). Will retry
later.
[09/Nov/2015:19:35:28 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:36:41 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
We've already looked into our network and there was no outage/interruption between sites
during the timeframe in question. The only corrective action that was taken was to
restart each node. Does anyone know any way I can investigate further what caused this
issue? I don't like giving "I don't know" answers for why replication stopped
working and did not resume by itself.
Hi Andrew,
There are quite periodic (each min or couple of min) networking issues
where the primary host fails to process the replication protocol with
bcloc[12]papp08.
There may be problem with the 3rd replica but it is present in this
portion of logs.
Most of the time it prevents primary master to establish a replication
session so these replica are likely late.
The replicas are reachable but do not answer fast enough and the
protocol times out.
Default replication timeout is 10m but can be tuned in each replica
agreement nsds5ReplicaTimeout.
Is the value set ?
As it was working fine before, it would be interesting to check the
replica logs (may be enable replication logging for them) when the
timeout occurs.
Also, if the problem continue take periodic (under the
nsds5ReplicaTimeout value) pstacks of the replica because there may be
something that make them busy and unable to answer fast enough.
thanks
thierry
--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project