The VMs have different security groups (none of them working). Would the one in the icmp255 security group explain why the other ones in the default (which I believe allows tcp 22,80 & icmp -1 to 0.0.0.0/0) security group have problems as well?
I believe the purpose of the icmp255 security group is to test and make sure that security groups are working by blocking access. I can look at how that test creates that group to see if there is a way of specifying an icmp type correctly. Although there has to be something more going on here, otherwise failures would be consistent (since that security group is always being used for one of the tests) instead of intermittent. On Sat, Mar 7, 2015 at 8:09 AM, Cory Benfield <[email protected]> wrote: > Right, so here are the ACLs Felix believes it has for one of your VMs, > formatted vaguely cleanly: > > > { > > u'inbound': [ > > {u'protocol': u'icmp', u'cidr': u'10.16.3.4/32', u'group': None, > u'port': 255} > > ], > > u'outbound': [ > > {u'protocol': None, u'cidr': u'0.0.0.0/0', u'group': None, u'port': > None} > > ], > > u'inbound_default': u'deny', > > u'outbound_default': u'deny' > > } > > > This corresponds to the security group: allow all egress to anywhere, > allow ICMP ingress from 10.16.3.4/32 to 'port' 255. I put quotes around > 'port' because in the ICMP case the 'port' actually corresponds to the ICMP > type. > > > Felix later says: > > > Invalid IPv4 inbound rule for 09f79ce5-a7 with port for protocol icmp : > {u'protocol': u'icmp', u'cidr': u'10.16.3.4/32', u'group': None, u'port': > 255} > > > Felix doesn't expect messages of this form for ICMP: it expects that > it'll get icmp_type fields instead. That's somewhat surprising, because I > can't find any indication in the code that any other component ever sends > such a field. This appears to simply be a bug to me. I'm tracking it under > GitHub issue #248. > ------------------------------ > *From:* Nick Bartos <[email protected]> > *Sent:* 07 March 2015 15:35 > *To:* Cory Benfield > *Cc:* [email protected] > > *Subject:* Re: [Calico] Intermittent broken iptables rules > > All of the VMs were broken in both runs (and at the time, all VMs were > on a single hypervisor). The IPs for the VMs were all in the 10.x.3.x/24 > range. > > On Sat, Mar 7, 2015 at 2:04 AM, Cory Benfield < > [email protected]> wrote: > >> Nick, >> >> So that I can try to filter to the relevant logs, what's the IP address >> of the VM that got broken? >> >> Cory >> >> ________________________________________ >> From: [email protected] < >> [email protected]> on behalf of Nick Bartos < >> [email protected]> >> Sent: 07 March 2015 00:11 >> To: [email protected] >> Subject: Re: [Calico] Intermittent broken iptables rules >> >> I spoke too soon. I was able to get it to happen on 0.12.1 (seemingly >> exact same problem as before, all VMs on the host are inaccessible). >> Although it did take a lot more tries to get it to fail. Here is the diag >> for that run: >> >> >> https://gist.github.com/nbartos/5d5651bef6f0bed2ddd4/raw/aaf82a08332d02d2b00fc473383b7be2174364b0/calico-0.12.1-2015-03-07_00-06-04.tar.xz >> >> On Fri, Mar 6, 2015 at 3:49 PM, Nick Bartos <[email protected]> wrote: >> >> > I'm attempting to recreate the problem with 0.12.1 (with the two >> > "mech_calico: Handle updates that indicate port migration" patches), >> but so >> > far I cannot. >> > >> > Since 0.13 was just released without any apparently relevant code >> changes >> > since 5332761, it appears that this is a regression in 0.13 from 0.12.1. >> > >> > On Fri, Mar 6, 2015 at 2:19 PM, Nick Bartos <[email protected]> >> wrote: >> > >> >> A bit more information: I'm running calico/master (hash 5332761). >> Also, >> >> the problem seems to actually be the ipset. >> >> >> >> This section of iptables: >> >> -A felix-to-3b2c20ab-b7 -m conntrack --ctstate INVALID -j DROP >> >> -A felix-to-3b2c20ab-b7 -m conntrack --ctstate RELATED,ESTABLISHED -j >> >> RETURN >> >> -A felix-to-3b2c20ab-b7 -m set --match-set felix-to-port-3b2c20ab-b7 >> >> src,dst -j RETURN >> >> -A felix-to-3b2c20ab-b7 -m set --match-set felix-to-addr-3b2c20ab-b7 >> src >> >> -j RETURN >> >> -A felix-to-3b2c20ab-b7 -p icmp -m set --match-set >> >> felix-to-icmp-3b2c20ab-b7 src -j RETURN >> >> -A felix-to-3b2c20ab-b7 -j DROP >> >> >> >> References these ipsets that do not have any members: >> >> >> >> Name: felix-to-port-3b2c20ab-b7 >> >> Type: hash:net,port >> >> Revision: 5 >> >> Header: family inet hashsize 1024 maxelem 65536 >> >> Size in memory: 16760 >> >> References: 1 >> >> Members: >> >> >> >> Name: felix-to-addr-3b2c20ab-b7 >> >> Type: hash:net >> >> Revision: 4 >> >> Header: family inet hashsize 1024 maxelem 65536 >> >> Size in memory: 16760 >> >> References: 1 >> >> Members: >> >> >> >> Name: felix-to-icmp-3b2c20ab-b7 >> >> Type: hash:net >> >> Revision: 4 >> >> Header: family inet hashsize 1024 maxelem 65536 >> >> Size in memory: 16760 >> >> References: 1 >> >> Members: >> >> >> >> >> >> >> >> On Fri, Mar 6, 2015 at 2:10 PM, Nick Bartos <[email protected]> >> wrote: >> >> >> >>> Sometimes, after one of our tests that moves around various services >> and >> >>> VMs, the iptables rules become broken such that VMs do not get >> traffic to >> >>> them, and they cannot ping their gateway. This is completely >> intermittent, >> >>> most of the time it does work. >> >>> >> >>> I verified it was an iptables problem by inserting allow rules for the >> >>> VM's IP in both the INPUT and FORWARD chains (and in both source and >> dest >> >>> directions) at the top of the filter table. Additionally, I verified >> that >> >>> the VM had the correct IP and MAC that were referenced in the iptables >> >>> rules. >> >>> >> >>> Here is the collection from diags.sh: >> >>> >> >>> >> >>> >> https://gist.github.com/nbartos/5d5651bef6f0bed2ddd4/raw/b9881cf648c033b9d2a10dc62060a8243b1acfe7/10.16.0.8-has-vms-2015-03-06_21-45-20.tar.gz >> >>> >> >>> Additionally, here is another instance of the problem with the full >> >>> cluster install logs included. >> >>> >> >>> >> https://gist.github.com/nbartos/5d5651bef6f0bed2ddd4/raw/042e34eebe2dc86f452f1f7be353c0e7b3520cfc/functional-test-3.7.14615-11071.log.xz >> >>> >> >>> I'm trying to figure out exactly what the problem is, but I'm not sure >> >>> exactly what the different felix chains are supposed to represent yet. >> >>> >> >> >> >> >> > >> _______________________________________________ >> Calico mailing list >> [email protected] >> http://lists.projectcalico.org/listinfo/calico >> > > _______________________________________________ Calico mailing list [email protected] http://lists.projectcalico.org/listinfo/calico
