Hi Nick,

​

OK, thanks for the clarification.  I've made a fix to the mechanism driver to 
ensure that ACL Manager doesn't think the start of day audit is complete when 
it hasn't succeeded - it's available in the branch issue255sync.  I'm not 
sufficiently clear on exactly what's going on during the failover window to be 
sure it'll always fix this, but I do expect it would have worked in at least 
the set of diags I looked at (the most recent).


Please could you give this (https://github.com/Metaswitch/calico/pull/257) a 
spin, and let us know if the issue is still present.


Thanks,

Matt​

________________________________
From: Nick Bartos <[email protected]>
Sent: 09 March 2015 20:31
To: Matthew Dupre
Cc: Cory Benfield; [email protected]
Subject: Re: [Calico] Intermittent broken iptables rules

Like the ACL manager, neutron-server can bounce from one host to another.  
These services have floating IPs that move between hosts, so they can accept 
reconnects from other services.  At any one point in time, there will be only 
one ACL manager and one neutron-server process running.

As long as reconnects are done properly, then things should heal themselves 
automatically when the service moves from one host to another.

On Mon, Mar 9, 2015 at 1:24 PM, Matthew Dupre 
<[email protected]<mailto:[email protected]>> wrote:
Good bug that.

I took a look through this latest set of diags, and I think I've identified the 
problem.  In short, it looks like ACL Manager (10.16.0.13, started at 17:55:09) 
is connecting to two different plugins (10.16.0.13 on the REQ-REP socket that 
does the start of day synchronization, and 10.16.0.11 on the PUB-SUB that 
actually carries the data), and thereby fails to receive the security rules.

This sounds believable to me, given the comment below about floating IPs that 
move around.  Does this sound plausible to you Nick?

You could probably confirm by looking at ACL Manager's connections with netstat 
or similar while the VMs are improperly programed.  We'll have a think about 
how we can stop this from happening tomorrow, but in the meantime it would be 
helpful if you could confirm the above hypothesis.

There's one other thing I'd like to pull out that might be important - the 
neutron-server on 10.16.0.13 which received the start of day query failed to 
query the DB (connection refused) and was never heard from again at 17:55:11.9. 
 I think this is OK - if ACL Manager wasn't confused it should have been able 
to restart itself and get the right state, but just felt I should flag it.

I've put a detailed breakdown of what I think the important points in the log 
are in the github issue Cory raised 
(https://github.com/Metaswitch/calico/issues/255).

Thanks,

Matt

> -----Original Message-----
> From: 
> [email protected]<mailto:[email protected]>
>  [mailto:calico-<mailto:calico->
> [email protected]<mailto:[email protected]>] On 
> Behalf Of Nick Bartos
> Sent: 09 March 2015 16:10
> To: Cory Benfield
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [Calico] Intermittent broken iptables rules
>
> Thanks!  As far as I can tell, this is the last show stopper bug we
> have.
>
> On Mon, Mar 9, 2015 at 2:28 AM, Cory Benfield
> <[email protected]<mailto:[email protected]>>
> wrote:
>
> > On Sat, Mar 07, 2015 at 18:06:35, Nick Bartos wrote:
> > > Yup, that did it.  Here is a log of all cluster nodes during the
> > > test and acl manager restart:
> > >
> > > https://gist.github.com/nbartos/5d5651bef6f0bed2ddd4/raw/31bbab85
> > > 77556f
> > > 7b131f408e986c16788d2cf9d6/acl-manager-restart-cmessages.xz
> > >
> > >
> > > Note that part of the node evacuation test does end up restarting
> > > the acl manager on another node (with a floating IP address that
> > > migrates with it).  I wonder if perhaps when the acl manager comes
> > > up after a restart on another node, some other service it needs
> was
> > > also getting migrated at the same time and is not available or
> something.
> >
> > It's certainly possible.
> >
> > I'll dive into the code today and see if I can find anything that
> > matches your symptoms and logs.
> >
> > Cory
> >
> _______________________________________________
> Calico mailing list
> [email protected]<mailto:[email protected]>
> http://lists.projectcalico.org/listinfo/calico

_______________________________________________
Calico mailing list
[email protected]
http://lists.projectcalico.org/listinfo/calico

Reply via email to