On 11/13/2015 08:18 PM, Donald Sharp wrote:
> From: Pradosh Mohapatra <[email protected]>
>
> Add next hop tracking support to Quagga. Complete documentation in
doc/next-hop-tracking.txt.
>
> Signed-off-by: Pradosh Mohapatra <[email protected]>
> Signed-off-by: Daniel Walton <[email protected]>
> Signed-off-by: Dinesh Dutt <[email protected]>
> Signed-off-by: Donald Sharp <[email protected]>

I am glad to see that there is progress towards getting this
functionality into Quagga - I still remember that I was really
nonplussed to realize that Quagga did not have this when I started
working on the RIB code.

While the code looks pretty clean in general, I skimmed over it and took
it for a spin and came up with some questions. The functional ones are
up here, the code-related ones are inline below.

For my tests, I used a very simple topology:

     ___
    /   \
[R1]-----[R2]----[R3]

R1 and R2 are in the same AS running OSPF and OSPFv3 over both links
between them and having an iBGP session between their loopback IPs. R3
is in a different AS and has an eBGP session to R2. The peering network
between R2 and R3 is announced into the IGP so that R1 can resolve the
routes originated from R3 recursively.

With this topology, I see the following in R1's BGP log every minute:

BGP: Performing BGP general scanning
BGP: scanning IPv4 Unicast routing tables
BGP: scanning IPv6 Unicast routing tables
BGP: Zebra send: IPv4 route add 10.10.0.0/16 nexthop 100.100.253.3
metric 0 count 1
BGP: Zebra send: IPv6 route add 3ffe::/16 nexthop 2001:db8:fffd::3 metric 0

These are the networks originated from R3. For some reason, R1 sends
them to zebra every minute whenever the BGP scan is run. This behavior
differs from the one in Quagga 0.99.24.1 where bgpd doesn't send
anything to zebra in the steady state during/after bgp scan.

This is a real issue imho since it:
a) causes massive load on zebra - it will possibly receive a new
full-table from bgpd on each scan in many common deployments
b) causes load on netlink and packet drops - zebra doesn't currently do
make-before-break/atomic updates

This behavior is the same directly after this patch and for the top of
the take-2 branch.

Also - this is only a minor issue - the 'show ip bgp nexthop' command
seems to have a problem regarding determination of nexthop types. This
was fixed in the later commit 394b230ac36bce4ba38d494dcf550b4cc03dc3bf
"bgpd: bug in nexthop display". Since the bgp_nexthop.c wasn't touched
by much in the meantime it should be straighforward to squash those two
together.

Overall it works as advertised otherwise and I see snappy updates to IGP
changes. I think once the issue regarding the BGP scan and the memory
leak I described below are resolved, this direly needed functionality
can and should be added to mainline Quagga.

-Christian

> @@ -391,6 +364,7 @@ bgp_nexthop_lookup (afi_t afi, struct peer *peer, struct 
> bgp_info *ri,
>    return bnc->valid;
>  }
>  
> +#if BGP_SCAN_NEXTHOP

I would prefer if the code could just be removed instead of adding
defines. With BGP_SCAN_NEXTHOP set, it doesn't compile, so I do not
really see a point to have it.

> +static int
> +show_ip_bgp_nexthop_table (struct vty *vty, int detail)
> +{
[...]
> +#ifdef HAVE_CLOCK_MONOTONIC
> +     tbuf = time(NULL) - (bgp_clock() - bnc->last_update);
> +     vty_out (vty, "  Last update: %s", ctime(&tbuf));
> +#else
> +     vty_out (vty, "  Last update: %s", ctime(&bnc->uptime));
> +#endif /* HAVE_CLOCK_MONOTONIC */

Having this here in bgpd feels somewhat weird. I though that lib/ should
already abstract the necessity for this differentiation away, but maybe
I am mistaken.

It also looks a bit like it would be possible to write a function that
dumps a nexthop_cache_table that can be called for IPv4 and IPv6.

> +++ b/bgpd/bgp_nht.c
> +void
> +bgp_parse_nexthop_update (void)
> +{
> +  struct stream *s;
> +  struct bgp_node *rn;
> +  struct bgp_nexthop_cache *bnc;
> +  struct nexthop *nexthop;
> +  struct nexthop *oldnh;
> +  struct nexthop *nhlist_head = NULL;
> +  struct nexthop *nhlist_tail = NULL;
> +  uint32_t metric;
> +  u_char nexthop_num;
> +  struct prefix p;
> +  int i;
> +
> +  s = zclient->ibuf;
> +
> +  memset(&p, 0, sizeof(struct prefix));
> +  p.family = stream_getw(s);
> +  p.prefixlen = stream_getc(s);
> +  switch (p.family)
> +    {
> +    case AF_INET:
> +      p.u.prefix4.s_addr = stream_get_ipv4 (s);
> +      break;
> +    case AF_INET6:
> +      stream_get(&p.u.prefix6, s, 16);
> +      break;
> +    default:
> +      break;
> +    }

The send_client function in zebra/zebra_rnh.c will send PSIZE(prefixlen)
bytes of the prefix while this function always reads 4/16 bytes
depending on the family. bgpd currently only registers /32 or /128 so
this is not really a major issue, but it is an inconsistency.

> +
> +  rn = bgp_node_lookup(bgp_nexthop_cache_table[family2afi(p.family)], &p);
> +  if (!rn || !rn->info)
> +    {
> +      if (BGP_DEBUG(nht, NHT))
> +     {
> +       char buf[INET6_ADDRSTRLEN];
> +       prefix2str(&p, buf, INET6_ADDRSTRLEN);
> +       zlog_debug("parse nexthop update(%s): rn not found", buf);

rn should be unlocked here if rn && !rn->info.

> +     }
> +      return;
> +    }
> +
> +  bnc = rn->info;
> +  bnc->last_update = bgp_clock();
> +  bnc->change_flags = 0;
> +  metric = stream_getl (s);
> +  nexthop_num = stream_getc (s);
> +
> +  /* debug print the input */
> +  if (BGP_DEBUG(nht, NHT))
> +    {
> +      char buf[INET6_ADDRSTRLEN];
> +      prefix2str(&p, buf, INET6_ADDRSTRLEN);
> +      zlog_debug("parse nexthop update(%s): metric=%d, #nexthop=%d", buf,
> +              metric, nexthop_num);
> +    }
> +
> +  if (metric != bnc->metric)
> +    bnc->change_flags |= BGP_NEXTHOP_METRIC_CHANGED;
> +
> +  if(nexthop_num != bnc->nexthop_num)
> +    bnc->change_flags |= BGP_NEXTHOP_CHANGED;
> +
> +  if (nexthop_num)
> +    {
> +      bnc->flags |= BGP_NEXTHOP_VALID;
> +      bnc->metric = metric;
> +      bnc->nexthop_num = nexthop_num;
> +
> +      for (i = 0; i < nexthop_num; i++)
> +     {
> +       nexthop = nexthop_new();
> +       nexthop->type = stream_getc (s);
> +       switch (nexthop->type)
> +         {
> +         case ZEBRA_NEXTHOP_IPV4:
> +           nexthop->gate.ipv4.s_addr = stream_get_ipv4 (s);
> +           break;
> +         case ZEBRA_NEXTHOP_IFINDEX:
> +         case ZEBRA_NEXTHOP_IFNAME:
> +           nexthop->ifindex = stream_getl (s);
> +           break;
> +            case ZEBRA_NEXTHOP_IPV4_IFINDEX:
> +         case ZEBRA_NEXTHOP_IPV4_IFNAME:
> +           nexthop->gate.ipv4.s_addr = stream_get_ipv4 (s);
> +           nexthop->ifindex = stream_getl (s);
> +           break;
> +#ifdef HAVE_IPV6
> +            case ZEBRA_NEXTHOP_IPV6:
> +           stream_get (&nexthop->gate.ipv6, s, 16);
> +           break;
> +            case ZEBRA_NEXTHOP_IPV6_IFINDEX:
> +         case ZEBRA_NEXTHOP_IPV6_IFNAME:
> +           stream_get (&nexthop->gate.ipv6, s, 16);
> +           nexthop->ifindex = stream_getl (s);
> +           break;
> +#endif
> +            default:
> +              /* do nothing */
> +              break;
> +         }
> +
> +       if (nhlist_tail)
> +         {
> +           nhlist_tail->next = nexthop;
> +           nhlist_tail = nexthop;
> +         }
> +       else
> +         {
> +           nhlist_tail = nexthop;
> +           nhlist_head = nexthop;
> +         }
> +
> +       /* No need to evaluate the nexthop if we have already determined
> +        * that there has been a change.
> +        */
> +       if (bnc->change_flags & BGP_NEXTHOP_CHANGED)
> +         continue;
> +
> +       for (oldnh = bnc->nexthop; oldnh; oldnh = oldnh->next)
> +           if (nexthop_same_no_recurse(oldnh, nexthop))
> +               break;
> +
> +       if (!oldnh)
> +         bnc->change_flags |= BGP_NEXTHOP_CHANGED;
> +     }
> +      bnc_nexthop_free(bnc);
> +      bnc->nexthop = nhlist_head;
> +    }
> +  else
> +    {
> +      bnc->flags &= ~BGP_NEXTHOP_VALID;
> +      bnc_nexthop_free(bnc);
> +      bnc->nexthop = NULL;
> +    }
> +
> +  evaluate_paths(bnc);

Where is rn unlocked? I looked at the lock count and it appears to me
like the reference is incremented by 1 on each received nexthop_update.

> +++ b/doc/next-hop-tracking.txt

Thank you for writing up this documentation, it is quite helpful to
understand how the parts play together.

> diff --git a/lib/zebra.h b/lib/zebra.h
>  /* Marker value used in new Zserv, in the byte location corresponding
>   * the command value in the old zserv header. To allow old and new
> @@ -516,6 +519,7 @@ extern const char *zserv_command_string (unsigned int 
> command);
>  #define CHECK_FLAG(V,F)      ((V) & (F))
>  #define SET_FLAG(V,F)        (V) |= (F)
>  #define UNSET_FLAG(V,F)      (V) &= ~(F)
> +#define RESET_FLAG(V)        (V) = 0

Maybe make this RESET_FLAGS?

> diff --git a/zebra/debug.c b/zebra/debug.c
> +  install_element (CONFIG_NODE, &debug_zebra_nht_cmd);
>    install_element (CONFIG_NODE, &debug_zebra_packet_cmd);
>    install_element (CONFIG_NODE, &debug_zebra_packet_direct_cmd);
>    install_element (CONFIG_NODE, &debug_zebra_packet_detail_cmd);
> @@ -375,6 +404,7 @@ zebra_debug_init (void)
>    install_element (CONFIG_NODE, &debug_zebra_rib_q_cmd);
>    install_element (CONFIG_NODE, &debug_zebra_fpm_cmd);
>    install_element (CONFIG_NODE, &no_debug_zebra_events_cmd);
> +  install_element (CONFIG_NODE, &no_debug_zebra_nht_cmd);
>    install_element (CONFIG_NODE, &no_debug_zebra_packet_cmd);
>    install_element (CONFIG_NODE, &no_debug_zebra_kernel_cmd);
>    install_element (CONFIG_NODE, &no_debug_zebra_rib_cmd);

Saving the debug flag to startup config does currently not work. Imho,
it would be neat if it did.

> diff --git a/zebra/zebra_rib.c b/zebra/zebra_rib.c
> +/*
> + * All meta queues have been processed. Trigger next-hop evaluation.
> + */
> +static void
> +meta_queue_process_complete (struct work_queue *dummy)
> +{
> +  zebra_evaluate_rnh_table(0, AF_INET);
> +#ifdef HAVE_IPV6
> +  zebra_evaluate_rnh_table(0, AF_INET6);
> +#endif /* HAVE_IPV6 */
> +}

What are the guarantees that zebra_evaluate_rnh_table will run in a
timely fashion? Will the meta_queue definitively get empty at some point
even in the face of high update churn?

> +++ b/zebra/zebra_rnh.c
>
> +#define lookup_rnh_table(v, f)                        \
> +({                                            \
> +  struct zebra_vrf *zvrf;                        \
> +  struct route_table *t = NULL;                  \
> +  zvrf = zebra_vrf_lookup(v);                    \
> +  if (zvrf)                                      \
> +    t = zvrf->rnh_table[family2afi(f)];               \
> +  t;                                             \
> +})

Is there really a point to make this a macro? I would expect any
reasonable compiler to inline this code anyway if it was declared as a
static function and inlining made sense performance-wise.

> +struct rnh *
> +zebra_add_rnh (struct prefix *p, vrf_id_t vrfid)
> +{
> +  struct route_table *table;
> +  struct route_node *rn;
> +  struct rnh *rnh = NULL;
> +
> +  if (IS_ZEBRA_DEBUG_NHT)
> +    {
> +      char buf[INET6_ADDRSTRLEN];
> +      prefix2str(p, buf, INET6_ADDRSTRLEN);

INET6_ADDRSTRLEN is not really a great value to use here since
prefix2str writes address+prefixlen into the returned string. Maybe it
would be good to have a generic constant for this, since it seems like
the buffersize used for prefix2str varies widely throughout Quagga, so
obviously people are not really clear on what buffersize to use with
prefix2str.

> +int
> +zebra_evaluate_rnh_table (vrf_id_t vrfid, int family)
> +{
> +  struct route_table *ptable;
> +  struct route_table *ntable;
> +  struct route_node *prn;
> +  struct route_node *nrn;
> +  struct rnh *rnh;
> +  struct zserv *client;
> +  struct listnode *node;
> +  struct rib *rib;
> +
> +  ntable = lookup_rnh_table(vrfid, family);
> +  if (!ntable)
> +    {
> +      zlog_debug("evaluate_rnh_table: rnh table not found\n");
> +      return -1;
> +    }
> +
> +  ptable = zebra_vrf_table(family2afi(family), SAFI_UNICAST, vrfid);
> +  if (!ptable)
> +    {
> +      zlog_debug("evaluate_rnh_table: prefix table not found\n");
> +      return -1;
> +    }
> +
> +  for (nrn = route_top (ntable); nrn; nrn = route_next (nrn))
> +    {
> +      if (!nrn->info)
> +       continue;
> +
> +      prn = route_node_match(ptable, &nrn->p);
> +      if (!prn)
> +     rib = NULL;
> +      else
> +     {
> +       RNODE_FOREACH_RIB(prn, rib)
> +         {
> +           if (CHECK_FLAG (rib->status, RIB_ENTRY_REMOVED))
> +             continue;
> +           if (CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELECTED))
> +             break;
> +         }
> +     }

Shouldn't this have similar semantics like the resolution in
nexthop_active_ipv4/_ipv6 in zebra_rib.c? E.g. not matching against BGP
routes and progressing upwards to less specific routes if the result of
the longest prefix match has all its ribs with RIB_ENTRY_REMOVED and/or
w/o ZEBRA_FLAG_SELECTED? Not matching against BGP here is obviously
problematic because it is what is usually needed for iBGP, but not
necessarily what is needed e.g. for PIM, I think.

Looking at e.g. this (admittedly somewhat synthetic) case:

# show ip route
B>  10.10.0.0/16 [200/0] via 100.100.253.3 (recursive), 00:00:33
  *                        via 100.100.254.2, veth0, 00:00:33
  *                        via 100.100.252.2, veth2, 00:00:33
O>* 100.100.253.0/24 [110/20] via 100.100.254.2, veth0, 00:01:52
  *                           via 100.100.252.2, veth2, 00:01:52
S   100.100.253.3/32 [1/0] via 192.168.1.1 inactive

It works absolutely fine with 0.99.24.1 while bgpd won't send the
10.10.0.0/16 route down to zebra with this patch applied, because zebra
claims that the nexthop is not usable.

> +int
> +zebra_dispatch_rnh_table (vrf_id_t vrfid, int family, struct zserv *client)

What is this used for? I don't see any call to it.

> diff --git a/zebra/zserv.c b/zebra/zserv.c
>
> +/* Nexthop register */
> +static int
> +zserv_nexthop_register (struct zserv *client, int sock, u_short length, 
> vrf_id_t vrf_id)
> +{
> +  struct rnh *rnh;
> +  struct stream *s;
> +  struct prefix p;
> +  u_short l = 0;
> +
> +  if (IS_ZEBRA_DEBUG_NHT)
> +    zlog_debug("nexthop_register msg from client %s: length=%d\n",
> +            zebra_route_string(client->proto), length);
> +
> +  s = client->ibuf;
> +
> +  while (l < length)
> +    {
> +      p.family = stream_getw(s);
> +      p.prefixlen = stream_getc(s);
> +      l += 3;
> +      stream_get(&p.u.prefix, s, PSIZE(p.prefixlen));
> +      l += PSIZE(p.prefixlen);
> +      rnh = zebra_add_rnh(&p, 0);
> +      zebra_add_rnh_client(rnh, client, vrf_id);

What are the semantics here with respect to the vrf_id? It seems that
the rnh_table is only maintained for vrf 0, at least 0 is hardcoded in
most places.

Also, it appears to me that a client that sends a nexthop_register with
e.g. vrf_id == 1 will receive one nexthop_update with vrf_id == 1
triggered from the zerba_add_rnh_client(..., vrf_id). This update does
possibly not list any resolving nexthops since it is sent before any
call to zebra_evaluate_rnh_table. From there on, any update will be sent
with vrf_id == 0 since the updates are originated by the calls to
zebra_evaluate_rnh_table which are always done with vrf_id 0.

My expectation as a user of this API would be that if I send a
nexthop_register with vrf_id == 1, the lookup/tracking would be done in
the rib of vrf 1.

_______________________________________________
Quagga-dev mailing list
[email protected]
https://lists.quagga.net/mailman/listinfo/quagga-dev

Reply via email to