Hi Andrew,
The primary was able to reach the replicas so I agree there is no
networking issue.
But the fact that the replicas did not answered in a given delay to the
primary master, is seen by the primary master as a network issue and
then it backoff for some time.
Unless the timeout is incorrectly tuned (I doubt), I think the problem
came from the replicas. For some reasons they are not responding fast
enough. There is no obvious reason for that. Periodic pstacks of the
replicas if the problem happens again, would likely give us the culprit.
thanks
thierry
On 11/12/2015 10:03 PM, Andrew Krause wrote:
There were 0 networking issues. These errors reported for approximately 10
hours in logs, but there were no instances of connectivity loss. One of the
replicas that experienced this issue is actually on the same physical hardware
as the single node that had no issues. The errors continued to report
constantly in logs from the start of the incident until I restarted freeIPA
services. Replication immediately resumed and has had no such issue since.
I’m fairly confident this also was not caused by load since the node that
continued to work services 90% or more of our authentication requests. The
other 3 nodes are basically just a hot standby. At this point we’re hoping it
was a fluke, we’ve tightened our monitoring and awareness since we have no way
to explain the root cause.
On Nov 12, 2015, at 2:38 AM, thierry bordaz <[email protected]> wrote:
On 11/11/2015 04:20 PM, Andrew Krause wrote:
Yesterday I came in to 3 of my 4 freeipa replicas in an unusable state and
replication was not connecting any of the hosts to each other. My
first/primary host was still servicing authentication requests, but the others
were in varying states of usability. I’ve investigated logs on all 4 nodes and
the only thing I can see is messages like this from when the problem started
until I restarted all 4 with ipactl stop/ipactl start:
[09/Nov/2015:19:17:16 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:19:16 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:21:19 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:23:19 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:25:21 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:27:21 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:29:26 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:31:26 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:32:37 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Warning:
Attempting to release replica, but unable to receive endReplication extended operation
response from the replica. Error -5 (Timed out)
[09/Nov/2015:19:33:29 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:34:37 -0700] NSMMReplicationPlugin -
agmt="cn=meToa.somedomain.com" (abcloc2papp08:389): Unable to receive the
response for a startReplication extended operation to consumer (Timed out). Will retry
later.
[09/Nov/2015:19:35:28 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
[09/Nov/2015:19:36:41 -0700] NSMMReplicationPlugin -
agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Unable to
receive the response for a startReplication extended operation to consumer (Timed out).
Will retry later.
We’ve already looked into our network and there was no outage/interruption
between sites during the timeframe in question. The only corrective action
that was taken was to restart each node. Does anyone know any way I can
investigate further what caused this issue? I don’t like giving “I don’t know”
answers for why replication stopped working and did not resume by itself.
Hi Andrew,
There are quite periodic (each min or couple of min) networking issues where
the primary host fails to process the replication protocol with bcloc[12]papp08.
There may be problem with the 3rd replica but it is present in this portion of
logs.
Most of the time it prevents primary master to establish a replication session
so these replica are likely late.
The replicas are reachable but do not answer fast enough and the protocol times
out.
Default replication timeout is 10m but can be tuned in each replica agreement
nsds5ReplicaTimeout.
Is the value set ?
As it was working fine before, it would be interesting to check the replica
logs (may be enable replication logging for them) when the timeout occurs.
Also, if the problem continue take periodic (under the nsds5ReplicaTimeout
value) pstacks of the replica because there may be something that make them
busy and unable to answer fast enough.
thanks
thierry
--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project