[quagga-dev 13969] Re: [PATCH 06/58] quagga: nexthop-tracking.patch

Vivek Venkatraman Mon, 23 Nov 2015 20:15:22 -0800

Hi Christian,

Thanks for your review and for trying out the patch. Some responses inline.


On Thu, Nov 19, 2015 at 12:42 PM, Christian Franke <
[email protected]> wrote:

>
> While the code looks pretty clean in general, I skimmed over it and took
> it for a spin and came up with some questions. The functional ones are
> up here, the code-related ones are inline below.
>
> For my tests, I used a very simple topology:
>
>      ___
>     /   \
> [R1]-----[R2]----[R3]
>
> R1 and R2 are in the same AS running OSPF and OSPFv3 over both links
> between them and having an iBGP session between their loopback IPs. R3
> is in a different AS and has an eBGP session to R2. The peering network
> between R2 and R3 is announced into the IGP so that R1 can resolve the
> routes originated from R3 recursively.
>
> With this topology, I see the following in R1's BGP log every minute:
>
> BGP: Performing BGP general scanning
> BGP: scanning IPv4 Unicast routing tables
> BGP: scanning IPv6 Unicast routing tables
> BGP: Zebra send: IPv4 route add 10.10.0.0/16 nexthop 100.100.253.3
> metric 0 count 1
> BGP: Zebra send: IPv6 route add 3ffe::/16 nexthop 2001:db8:fffd::3 metric 0
>
> These are the networks originated from R3. For some reason, R1 sends
> them to zebra every minute whenever the BGP scan is run. This behavior
> differs from the one in Quagga 0.99.24.1 where bgpd doesn't send
> anything to zebra in the steady state during/after bgp scan.
>
> This is a real issue imho since it:
> a) causes massive load on zebra - it will possibly receive a new
> full-table from bgpd on each scan in many common deployments
> b) causes load on netlink and packet drops - zebra doesn't currently do
> make-before-break/atomic updates
>
> This behavior is the same directly after this patch and for the top of
> the take-2 branch.
>

With BGP nexthop tracking the scan thread should be completely eliminated.
The networks that are originated by BGP through the 'network' configuration
statements also use the same mechanism. There is probably something amiss
with the patch, we're looking.


>
> Also - this is only a minor issue - the 'show ip bgp nexthop' command
> seems to have a problem regarding determination of nexthop types. This
> was fixed in the later commit 394b230ac36bce4ba38d494dcf550b4cc03dc3bf
> "bgpd: bug in nexthop display". Since the bgp_nexthop.c wasn't touched
> by much in the meantime it should be straighforward to squash those two
> together.
>
> Overall it works as advertised otherwise and I see snappy updates to IGP
> changes. I think once the issue regarding the BGP scan and the memory
> leak I described below are resolved, this direly needed functionality
> can and should be added to mainline Quagga.
>

Thank you, we also saw the existing method of periodic scan for nexthop
reachability (sending a message to zebra and waiting for response) as a
significant impediment towards improved scaling/performance of BGP.


> -Christian
>
>
> > +
> > +  rn = bgp_node_lookup(bgp_nexthop_cache_table[family2afi(p.family)],
> &p);
> > +  if (!rn || !rn->info)
> > +    {
> > +      if (BGP_DEBUG(nht, NHT))
> > +     {
> > +       char buf[INET6_ADDRSTRLEN];
> > +       prefix2str(&p, buf, INET6_ADDRSTRLEN);
> > +       zlog_debug("parse nexthop update(%s): rn not found", buf);
>
> rn should be unlocked here if rn && !rn->info.
>
> > +     }
> > +      return;
> > +    }
> > +
>
> > +      bnc->flags &= ~BGP_NEXTHOP_VALID;
> > +      bnc_nexthop_free(bnc);
> > +      bnc->nexthop = NULL;
> > +    }
> > +
> > +  evaluate_paths(bnc);
>
> Where is rn unlocked? I looked at the lock count and it appears to me
> like the reference is incremented by 1 on each received nexthop_update.
>

These two issues you pointed out have been addressed in a subsequent patch
that Donald is including into his 'take X' branch.


> > diff --git a/zebra/zebra_rib.c b/zebra/zebra_rib.c
> > +/*
> > + * All meta queues have been processed. Trigger next-hop evaluation.
> > + */
> > +static void
> > +meta_queue_process_complete (struct work_queue *dummy)
> > +{
> > +  zebra_evaluate_rnh_table(0, AF_INET);
> > +#ifdef HAVE_IPV6
> > +  zebra_evaluate_rnh_table(0, AF_INET6);
> > +#endif /* HAVE_IPV6 */
> > +}
>
> What are the guarantees that zebra_evaluate_rnh_table will run in a
> timely fashion? Will the meta_queue definitively get empty at some point
> even in the face of high update churn?
>

In the face of very high route processing churn in zebra, it may take a
while for the meta queue to complete. We didn't see it worthwhile to also
start evaluating nexthops and potentially adding to that churn, without one
round of processing completing. We haven't noticed any convergence issues
in testing with a few 100 peers and large number of routes.


>
> > +++ b/zebra/zebra_rnh.c
> >
> > +#define lookup_rnh_table(v, f)                        \
> > +({                                            \
> > +  struct zebra_vrf *zvrf;                        \
> > +  struct route_table *t = NULL;                  \
> > +  zvrf = zebra_vrf_lookup(v);                    \
> > +  if (zvrf)                                      \
> > +    t = zvrf->rnh_table[family2afi(f)];               \
> > +  t;                                             \
> > +})
>
> Is there really a point to make this a macro? I would expect any
> reasonable compiler to inline this code anyway if it was declared as a
> static function and inlining made sense performance-wise.
>
> > +struct rnh *
> > +zebra_add_rnh (struct prefix *p, vrf_id_t vrfid)
> > +{
> > +  struct route_table *table;
> > +  struct route_node *rn;
> > +  struct rnh *rnh = NULL;
> > +
> > +  if (IS_ZEBRA_DEBUG_NHT)
> > +    {
> > +      char buf[INET6_ADDRSTRLEN];
> > +      prefix2str(p, buf, INET6_ADDRSTRLEN);
>
> INET6_ADDRSTRLEN is not really a great value to use here since
> prefix2str writes address+prefixlen into the returned string. Maybe it
> would be good to have a generic constant for this, since it seems like
> the buffersize used for prefix2str varies widely throughout Quagga, so
> obviously people are not really clear on what buffersize to use with
> prefix2str.
>
> > +int
> > +zebra_evaluate_rnh_table (vrf_id_t vrfid, int family)
> > +{
> > +  struct route_table *ptable;
> > +  struct route_table *ntable;
> > +  struct route_node *prn;
> > +  struct route_node *nrn;
> > +  struct rnh *rnh;
> > +  struct zserv *client;
> > +  struct listnode *node;
> > +  struct rib *rib;
> > +
> > +  ntable = lookup_rnh_table(vrfid, family);
> > +  if (!ntable)
> > +    {
> > +      zlog_debug("evaluate_rnh_table: rnh table not found\n");
> > +      return -1;
> > +    }
> > +
> > +  ptable = zebra_vrf_table(family2afi(family), SAFI_UNICAST, vrfid);
> > +  if (!ptable)
> > +    {
> > +      zlog_debug("evaluate_rnh_table: prefix table not found\n");
> > +      return -1;
> > +    }
> > +
> > +  for (nrn = route_top (ntable); nrn; nrn = route_next (nrn))
> > +    {
> > +      if (!nrn->info)
> > +       continue;
> > +
> > +      prn = route_node_match(ptable, &nrn->p);
> > +      if (!prn)
> > +     rib = NULL;
> > +      else
> > +     {
> > +       RNODE_FOREACH_RIB(prn, rib)
> > +         {
> > +           if (CHECK_FLAG (rib->status, RIB_ENTRY_REMOVED))
> > +             continue;
> > +           if (CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELECTED))
> > +             break;
> > +         }
> > +     }
>
> Shouldn't this have similar semantics like the resolution in
> nexthop_active_ipv4/_ipv6 in zebra_rib.c? E.g. not matching against BGP
> routes and progressing upwards to less specific routes if the result of
> the longest prefix match has all its ribs with RIB_ENTRY_REMOVED and/or
> w/o ZEBRA_FLAG_SELECTED? Not matching against BGP here is obviously
> problematic because it is what is usually needed for iBGP, but not
> necessarily what is needed e.g. for PIM, I think.
>
> Looking at e.g. this (admittedly somewhat synthetic) case:
>
> # show ip route
> B>  10.10.0.0/16 [200/0] via 100.100.253.3 (recursive), 00:00:33
>   *                        via 100.100.254.2, veth0, 00:00:33
>   *                        via 100.100.252.2, veth2, 00:00:33
> O>* 100.100.253.0/24 [110/20] via 100.100.254.2, veth0, 00:01:52
>   *                           via 100.100.252.2, veth2, 00:01:52
> S   100.100.253.3/32 [1/0] via 192.168.1.1 inactive
>
> It works absolutely fine with 0.99.24.1 while bgpd won't send the
> 10.10.0.0/16 route down to zebra with this patch applied, because zebra
> claims that the nexthop is not usable.
>

On the first point, there are scenarios where a BGP nexthop can be resolved
over a BGP route. One example would be with BGP labeled-unicast.
draft-ietf-rtgwg-bgp-routing-large-dc mentions a use-case too.

On the second point, it is a somewhat artificial scenario as you say. It is
possible that the right change is to walk up the tree as done in
zebra_rib.c. We'll get back on this.


>
> > +int
> > +zebra_dispatch_rnh_table (vrf_id_t vrfid, int family, struct zserv
> *client)
>
> What is this used for? I don't see any call to it.
>
> > diff --git a/zebra/zserv.c b/zebra/zserv.c
> >
> > +/* Nexthop register */
> > +static int
> > +zserv_nexthop_register (struct zserv *client, int sock, u_short length,
> vrf_id_t vrf_id)
> > +{
> > +  struct rnh *rnh;
> > +  struct stream *s;
> > +  struct prefix p;
> > +  u_short l = 0;
> > +
> > +  if (IS_ZEBRA_DEBUG_NHT)
> > +    zlog_debug("nexthop_register msg from client %s: length=%d\n",
> > +            zebra_route_string(client->proto), length);
> > +
> > +  s = client->ibuf;
> > +
> > +  while (l < length)
> > +    {
> > +      p.family = stream_getw(s);
> > +      p.prefixlen = stream_getc(s);
> > +      l += 3;
> > +      stream_get(&p.u.prefix, s, PSIZE(p.prefixlen));
> > +      l += PSIZE(p.prefixlen);
> > +      rnh = zebra_add_rnh(&p, 0);
> > +      zebra_add_rnh_client(rnh, client, vrf_id);
>
> What are the semantics here with respect to the vrf_id? It seems that
> the rnh_table is only maintained for vrf 0, at least 0 is hardcoded in
> most places.
>
> Also, it appears to me that a client that sends a nexthop_register with
> e.g. vrf_id == 1 will receive one nexthop_update with vrf_id == 1
> triggered from the zerba_add_rnh_client(..., vrf_id). This update does
> possibly not list any resolving nexthops since it is sent before any
> call to zebra_evaluate_rnh_table. From there on, any update will be sent
> with vrf_id == 0 since the updates are originated by the calls to
> zebra_evaluate_rnh_table which are always done with vrf_id 0.
>
> My expectation as a user of this API would be that if I send a
> nexthop_register with vrf_id == 1, the lookup/tracking would be done in
> the rib of vrf 1.
>
> Yes, with the support for VRFs in BGP that we're working on, the nexthop
tracking will be per VRF. Of course, Quagga currently does not have VRF
support other than in Zebra, but that's what we're adding in BGP right now,
hence some of the other mails on that topic.


> _______________________________________________
> Quagga-dev mailing list
> [email protected]
> https://lists.quagga.net/mailman/listinfo/quagga-dev
>

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

[quagga-dev 13969] Re: [PATCH 06/58] quagga: nexthop-tracking.patch

Reply via email to