Yesterday I came in to 3 of my 4 freeipa replicas in an unusable state and replication was not connecting any of the hosts to each other. My first/primary host was still servicing authentication requests, but the others were in varying states of usability. I’ve investigated logs on all 4 nodes and the only thing I can see is messages like this from when the problem started until I restarted all 4 with ipactl stop/ipactl start:
[09/Nov/2015:19:17:16 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:19:16 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:21:19 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:23:19 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:25:21 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:27:21 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:29:26 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:31:26 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:32:37 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Warning: Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) [09/Nov/2015:19:33:29 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:34:37 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:35:28 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc1papp08.somedomain.com" (abcloc1papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [09/Nov/2015:19:36:41 -0700] NSMMReplicationPlugin - agmt="cn=meToabcloc2papp08.somedomain.com" (abcloc2papp08:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. We’ve already looked into our network and there was no outage/interruption between sites during the timeframe in question. The only corrective action that was taken was to restart each node. Does anyone know any way I can investigate further what caused this issue? I don’t like giving “I don’t know” answers for why replication stopped working and did not resume by itself. -- Manage your subscription for the Freeipa-users mailing list: https://www.redhat.com/mailman/listinfo/freeipa-users Go to http://freeipa.org for more info on the project