Like the ACL manager, neutron-server can bounce from one host to another. These services have floating IPs that move between hosts, so they can accept reconnects from other services. At any one point in time, there will be only one ACL manager and one neutron-server process running.
As long as reconnects are done properly, then things should heal themselves automatically when the service moves from one host to another. On Mon, Mar 9, 2015 at 1:24 PM, Matthew Dupre <[email protected]> wrote: > Good bug that. > > I took a look through this latest set of diags, and I think I've > identified the problem. In short, it looks like ACL Manager (10.16.0.13, > started at 17:55:09) is connecting to two different plugins (10.16.0.13 on > the REQ-REP socket that does the start of day synchronization, and > 10.16.0.11 on the PUB-SUB that actually carries the data), and thereby > fails to receive the security rules. > > This sounds believable to me, given the comment below about floating IPs > that move around. Does this sound plausible to you Nick? > > You could probably confirm by looking at ACL Manager's connections with > netstat or similar while the VMs are improperly programed. We'll have a > think about how we can stop this from happening tomorrow, but in the > meantime it would be helpful if you could confirm the above hypothesis. > > There's one other thing I'd like to pull out that might be important - the > neutron-server on 10.16.0.13 which received the start of day query failed > to query the DB (connection refused) and was never heard from again at > 17:55:11.9. I think this is OK - if ACL Manager wasn't confused it should > have been able to restart itself and get the right state, but just felt I > should flag it. > > I've put a detailed breakdown of what I think the important points in the > log are in the github issue Cory raised ( > https://github.com/Metaswitch/calico/issues/255). > > Thanks, > > Matt > > > -----Original Message----- > > From: [email protected] [mailto:calico- > > [email protected]] On Behalf Of Nick Bartos > > Sent: 09 March 2015 16:10 > > To: Cory Benfield > > Cc: [email protected] > > Subject: Re: [Calico] Intermittent broken iptables rules > > > > Thanks! As far as I can tell, this is the last show stopper bug we > > have. > > > > On Mon, Mar 9, 2015 at 2:28 AM, Cory Benfield > > <[email protected]> > > wrote: > > > > > On Sat, Mar 07, 2015 at 18:06:35, Nick Bartos wrote: > > > > Yup, that did it. Here is a log of all cluster nodes during the > > > > test and acl manager restart: > > > > > > > > https://gist.github.com/nbartos/5d5651bef6f0bed2ddd4/raw/31bbab85 > > > > 77556f > > > > 7b131f408e986c16788d2cf9d6/acl-manager-restart-cmessages.xz > > > > > > > > > > > > Note that part of the node evacuation test does end up restarting > > > > the acl manager on another node (with a floating IP address that > > > > migrates with it). I wonder if perhaps when the acl manager comes > > > > up after a restart on another node, some other service it needs > > was > > > > also getting migrated at the same time and is not available or > > something. > > > > > > It's certainly possible. > > > > > > I'll dive into the code today and see if I can find anything that > > > matches your symptoms and logs. > > > > > > Cory > > > > > _______________________________________________ > > Calico mailing list > > [email protected] > > http://lists.projectcalico.org/listinfo/calico > _______________________________________________ Calico mailing list [email protected] http://lists.projectcalico.org/listinfo/calico
