On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote:
> On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote:
> > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
> > <a...@cohaesio.com> wrote:
> > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio
> > > wrote:
> > > > Hello,
> > > > 
> > > > On one of our border routers, Netfilter is occasionally logging
> > > > packets
> > > > with "OUT=lo" (output interface lo) even though the packet should
> > > > be
> > > > going out via a regular interface. This behavior is present on
> > > > Linux
> > > > 4.13.0 to 4.14.9, and a bisection of the problem points to
> > > > 
> > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call
> > > > dst_dev_put() properly
> > > > 
> > > > as the first bad commit. This commit adds dst_dev_put() calls
> > > > before
> > > > some dst_release() calls, and dst_dev_put() does
> > > > 
> > > >     dst->dev = dev_net(dst->dev)->loopback_dev;
> > > > 
> > > > (among other things), which fits the problem we're seeing.
> > > > 
> > > > The essential part of our nftables rule set that shows this
> > > > behavior is
> > > > 
> > > > chain forward {
> > > >                 type filter hook forward priority 0;
> > > > 
> > > >                 meta oif { $internal_interfaces } accept
> > > > 
> > > >                 meta oif lo ip daddr != 127.0.0.0/8 \
> > > >                         log group 0 snaplen 80 prefix "oif-lo"
> > > > counter
> > > > 
> > > >                 ip saddr { $our_ip_series } \
> > > >                         flow table acct_out \
> > > >                         { meta oif . rt nexthop . ip saddr
> > > > timeout 12m counter } \
> > > >                         accept
> > > > 
> > > >                 log group 0 snaplen 80 prefix "DROP" counter drop
> > > > }
> > > > 
> > > > The router only does stateless packet filtering and no
> > > > redirection or
> > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are
> > > > not
> > > > even compiled for this kernel).
> > > > 
> > > > As a result of this problem we see packets that should be going
> > > > to an
> > > > internal interface (and thus accepted by the first rule above)
> > > > being
> > > > logged and dropped by the last rule. Some examples:
> > > > 
> > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > 
> > > > It also happens for outbound traffic, where the packets are
> > > > logged and
> > > > counted in the acct_out flow table with "meta oif" = "lo", but a
> > > > correct "rt nexthop" - an example:
> > > > 
> > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo
> > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25
> > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF
> > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901
> > > > WINDOW=12618 ACK URGP=0 MARK=0
> > > > 
> > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo
> > > >         flow table acct_out {
> > > >  "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter
> > > > packets 1 bytes 40
> > > > 
> > > > I don't know if these packets are actually sent out on the
> > > > correct
> > > > outbound interface thanks to the proper nexthop (the MAC=
> > > > information
> > > > in the Netfilter log is from the received packet and thus not
> > > > useful
> > > > here).
> > > > 
> > > > I tried running a tcpdump on the lo interface to see if these
> > > > packets
> > > > would show up there, but during the three days I had it running,
> > > > it
> > > > only logged one such packet, while Netfilter logs 20+ outbound
> > > > packets
> > > > every day, and the one packet logged by tcpdump was *not* logged
> > > > by
> > > > Netfilter.
> > > 
> > > Further testing of the individual parts of the first bad commit
> > > shows
> > > that the five first additions of the dst_dev_put() call doesn't
> > > trigger
> > > the problem, while the last one does (also without the first five),
> > > so
> > > the problematic part is:
> > > 
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 3dee004..d986d80 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> > > struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > +                       dst_dev_put(&orig->dst);
> > >                         dst_release(&orig->dst);
> > >                         rt_free(orig);
> > >                 }
> > > 
> > 
> > Hi Anders,
> > 
> > First of all, the bad commit you pointed to is part of the effort to
> > remove dst garbage collector. And it is expected that certain routing
> > behavior will change after this whole patch set.
> > In your specific case:
> > Specifically for the one line change you pointed above,
> > rt_cache_route() is called to replace an existing cached route with a
> > new one because the existing cached route is already invalid. And the
> > criteria to determine if it is valid is in rt_cache_valid(). If a
> > route becomes invalid, it means the route is either being deleted, or
> > a new route is being inserted to the same table, or there is certain
> > netdev event happening that could make this route unusable.
> 
> Hi Wei,
> 
> I looked into this and activated debug logging for all route changes
> from the Quagga/zebra routing daemons, and none of the logged packets
> with OUT=lo happened within a minute of a route change for a prefix
> that covered the destination IP in the packet.
> 
> Also, some of the logged packets with OUT=lo are for destination IPs
> that are on the subnet of a directly connected interface on the router,
> so there are no dynamic route changes and there haven't been any link
> changes either.
> 
> Which other "certain netdev event" could cause this?
> 
> >  In all
> > cases, a route re-lookup is required. dst_dev_put() here does cleanup
> > work to free up the device refcount and all packets using this route
> > will be dropped. And as you said, you don't really see the packets
> > going out of lo.
> 
> I wrote that we don't see the same packets being logged *and* show up
> in a "tcpdump -i lo", but we do see packets in the tcpdump that should
> have been routed out a regular interface - it's just not the same
> packets that netfilter is logging (and it happens less frequently). It
> looks like it depends on how far the packet is in the path through the
> kernel, when dst_dev_put() is called.

I spent some more time testing this in order to understand the scenario
that causes this. It turns out that the packets being dropped due to
OUT=lo only happen for the __mkroute_input() -> rt_set_nexthop() ->
rt_cache_route() -> dst_dev_put() call chain, when rt_cache_valid()
returns false due to rt_is_expired(), which compares rth->rt_genid
with rt_genid_ipv4(dev_net(rth->dst.dev)).

The latter is bumped every time *any* fib route is changed (and in
other cases), which thus invalidates everything in the route cache. On
a router with full BGP feeds that happens quite frequently, and I
believe the scenario we're seeing is:

- Packet A received for destination X is assigned rth R

- Some route in the FIB is added, deleted, or changed, so rt_genid is
bumped

- Packet B received for destination X before packet A has been
forwarded

- __mkroute_input() -> rt_cache_valid() -> rt_is_expired() determines
that rth R is no longer valid

- A new rth S is created and rth R is invalidated via __mkroute_input()
-> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put()

- Packet A now has rth R with dst->dev = dst->output = dst_discard_out
and dev_net(dst->dev)->loopback_dev as it moves through the kernel
forwarding (and Netfilter) logic, so it gets dropped

To avoid this I developed the patch below, which updates rth->rt_genid
to the current value of rt_genid_ipv4(dev_net(rth->dst.dev)), if the
information in it is the same as what would be put into a new rth. This
avoids invalidating the old rth and allocating a new one with the same
information, and packet A from the scenario above is no longer dropped.

The patch has a lot of net_warn_ratelimited() for debugging, but none
of them have been triggered in my testing. I also revived and abused
the in_hit counter to validate that the code path is being hit. After a
few days with the patch there haven't been any more packet drops, and
we have the following statistics:

# rtstat -k entries,in_hit,in_slow_tot,in_slow_mc,in_no_route
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
 entries|  in_hit|in_slow_|in_slow_|in_no_ro|
        |        |     tot|      mc|     ute|
    1015| 2305460| 1177662|  758764| 1080447|

Would a cleaned up (i.e. without the warnings and counter) version of
this patch be appropriate for submission, or are there cases where this
would break things?

--- linux-4.14.18/include/net/route.h.orig      2017-11-12 19:46:13.000000000 
+0100
+++ linux-4.14.18/include/net/route.h   2018-02-11 11:20:34.133788420 +0100
@@ -96,6 +96,7 @@
 };
 
 struct rt_cache_stat {
+        unsigned int in_hit;
         unsigned int in_slow_tot;
         unsigned int in_slow_mc;
         unsigned int in_no_route;
--- linux-4.14.18/net/ipv4/route.c.orig 2018-02-11 11:12:41.801093116 +0100
+++ linux-4.14.18/net/ipv4/route.c      2018-02-14 07:27:54.290124737 +0100
@@ -295,7 +295,7 @@
        seq_printf(seq,"%08x  %08x %08x %08x %08x %08x %08x %08x "
                   " %08x %08x %08x %08x %08x %08x %08x %08x %08x \n",
                   dst_entries_get_slow(&ipv4_dst_ops),
-                  0, /* st->in_hit */
+                  st->in_hit, /* 0, / * st->in_hit */
                   st->in_slow_tot,
                   st->in_slow_mc,
                   st->in_no_route,
@@ -1668,6 +1668,7 @@
 {
        struct fib_nh_exception *fnhe;
        struct rtable *rth;
+       struct fib_nh *nh;
        int err;
        struct in_device *out_dev;
        bool do_cache;
@@ -1713,23 +1714,110 @@
 
        fnhe = find_exception(&FIB_RES_NH(*res), daddr);
        if (do_cache) {
+               nh = &FIB_RES_NH(*res);
                if (fnhe) {
                        rth = rcu_dereference(fnhe->fnhe_rth_input);
                        if (rth && rth->dst.expires &&
                            time_after(jiffies, rth->dst.expires)) {
-                               ip_del_fnhe(&FIB_RES_NH(*res), daddr);
+                               ip_del_fnhe(nh, daddr);
                                fnhe = NULL;
                        } else {
                                goto rt_cache;
                        }
                }
 
-               rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
+               rth = rcu_dereference(nh->nh_rth_input);
 
 rt_cache:
                if (rt_cache_valid(rth)) {
                        skb_dst_set_noref(skb, &rth->dst);
                        goto out;
+               } else if (rth &&
+                          rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
+                          rth->dst.error == 0 &&
+                          rth->dst.dev == out_dev->dev &&
+                          rth->dst.ops == &ipv4_dst_ops &&
+#ifdef CONFIG_XFRM
+                          rth->dst.xfrm == NULL &&
+#endif
+                          rth->dst.flags ==
+                               ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+                                       DST_NOPOLICY : 0) |
+                                (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+                                       DST_NOXFRM : 0)) &&
+                          rth->dst.lwtstate == nh->nh_lwtstate &&
+                          rth->rt_type == res->type &&
+                          rth->rt_table_id ==
+                               (res->table ? res->table->tb_id : 0) &&
+                          rth->rt_gateway ==
+                               ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+                                nh->nh_gw : 0) &&
+                          rth->rt_uses_gateway ==
+                               ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+                                1 : 0)) {
+                       rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
+                       skb_dst_set_noref(skb, &rth->dst);
+                       RT_CACHE_STAT_INC(in_hit);
+                       if (rth->dst.header_len != 0)
+                               net_warn_ratelimited("rth->dst.header_len (%x) 
!= 0\n", rth->dst.header_len);
+                       if (rth->dst.trailer_len != 0)
+                               net_warn_ratelimited("rth->dst.trailer_len (%x) 
!= 0\n", rth->dst.trailer_len);
+                       if (rth->rt_is_input != 1)
+                               net_warn_ratelimited("rth->rt_is_input != 1\n");
+                       if (rth->rt_flags != 0)
+                               net_warn_ratelimited("rth->rt_flags (%x) != 
0\n", rth->rt_flags);
+                       if (rth->rt_iif != 0)
+                               net_warn_ratelimited("rth->rt_iif (%x) != 0\n", 
rth->rt_iif);
+                       if (rth->rt_pmtu != 0)
+                               net_warn_ratelimited("rth->rt_pmtu (%x) != 
0\n", rth->rt_pmtu);
+                       if (rth->dst.output != ip_output)
+                               net_warn_ratelimited("rth->dst.output != 
ip_output\n");
+                       if (rth->dst.input != ip_forward)
+                               net_warn_ratelimited("rth->dst.input != 
ip_forward\n");
+                       goto out;
+               } else if (rth &&
+                          rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK) {
+                       if (rth->dst.error != 0)
+                               net_warn_ratelimited("rth->dst.error (%x) != 
0\n", rth->dst.error);
+                       if (rth->dst.dev != out_dev->dev)
+                               net_warn_ratelimited("mkroute_input: 
rth->dst.dev != out_dev->dev\n");
+                       if (rth->dst.ops != &ipv4_dst_ops)
+                               net_warn_ratelimited("mkroute_input: 
rth->dst.ops != &ipv4_dst_ops\n");
+#ifdef CONFIG_XFRM
+                       if (rth->dst.xfrm != NULL)
+                               net_warn_ratelimited("mkroute_input: 
rth->dst.xfrm != NULL\n");
+#endif
+                       if (rth->dst.flags !=
+                               ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+                                       DST_NOPOLICY : 0) |
+                               (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+                                       DST_NOXFRM : 0)))
+                               net_warn_ratelimited("rth->dst.flags (%x) != 
%x\n",
+                                       rth->dst.flags,
+                                       (IN_DEV_CONF_GET(in_dev, NOPOLICY) ?
+                                               DST_NOPOLICY : 0) |
+                                       (IN_DEV_CONF_GET(out_dev, NOXFRM) ?
+                                               DST_NOXFRM : 0));
+                       if (rth->rt_type != res->type)
+                               net_warn_ratelimited("rth->rt_type (%x) != 
res->type (%x)\n",
+                                       rth->rt_type, res->type);
+                       if (rth->rt_table_id !=
+                               (res->table ? res->table->tb_id : 0))
+                               net_warn_ratelimited("rth->rt_table_id (%x) != 
(%x)\n",
+                                       rth->rt_table_id, res->table ? 
res->table->tb_id : 0);
+                       if (rth->rt_gateway !=
+                               ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+                                nh->nh_gw : 0))
+                               net_warn_ratelimited("rth->rt_gateway (%x) != 
(%x)\n",
+                                       rth->rt_gateway, (nh->nh_gw && 
nh->nh_scope == RT_SCOPE_LINK) ? nh->nh_gw : 0);
+                       if (rth->rt_uses_gateway !=
+                               ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ?
+                                1 : 0))
+                               net_warn_ratelimited("rth->rt_uses_gateway (%x) 
!= (%x)\n",
+                                       rth->rt_uses_gateway, (nh->nh_gw && 
nh->nh_scope == RT_SCOPE_LINK) ? 1 : 0);
+                       if (rth->dst.lwtstate != nh->nh_lwtstate)
+                               net_warn_ratelimited("rth->dst.lwtstate (%x) != 
nh->nh_lwtstate (%x)\n",
+                                       rth->dst.lwtstate, nh->nh_lwtstate);
                }
        }
 

Regards,
Anders K. Pedersen

Reply via email to