On Fri, Aug 11, 2023 at 6:56 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 8/4/23 19:58, Mike Pattrick wrote:
> > Diagnosing connectivity issues involving a bond can be complicated by a
> > lack of logging in LACP. It is difficult to determine the health of
> > sending and receving LACP packets. This is further complicated by the
> > tendency of some switches to toggle the carrier on interfaces that
> > experience issues with LACP, which can cause confusion about why an
> > interface is flapping down and up again.
> >
> > With this patch, OVS will log when LACP packets aren't sent or recieved
> > on time.
> >
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2223306
> > Signed-off-by: Mike Pattrick <m...@redhat.com>
>
> Hi, Mike.  Thanks for the improvements!
>
> Not a full review, but I have a few comments inline.
>
> > ---
> > v2: Added ratelimits to logs
> > ---
> >  lib/lacp.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/lib/lacp.c b/lib/lacp.c
> > index 3252f17eb..22001a39d 100644
> > --- a/lib/lacp.c
> > +++ b/lib/lacp.c
> > @@ -143,6 +143,8 @@ struct member {
> >      uint32_t count_link_expired;    /* Num of times link expired */
> >      uint32_t count_link_defaulted;  /* Num of times link defaulted */
> >      uint32_t count_carrier_changed; /* Num of times link status changed */
> > +    long long int last_tx;
> > +    long long int last_rx;
>
> Some simple comments would be nice here.
>
> >  };
> >
> >  static struct ovs_mutex mutex;
> > @@ -387,6 +389,13 @@ lacp_process_packet(struct lacp *lacp, const void 
> > *member_,
> >          goto out;
> >      }
> >
> > +    if (member->last_rx && member->status != LACP_CURRENT) {
> > +        long long int delay = time_msec() - member->last_rx;
> > +        VLOG_DBG_RL(&rl, "%s: %s recieved PDU after expiry, delayed by 
> > %lldms "
> > +                    "seconds.", lacp->name, member->name, delay);
>
> "delayed by X ms seconds" ?
> The "seconds" part should not be there.
>
> > +    }
> > +
> > +    member->last_rx = time_msec();
> >      member->status = LACP_CURRENT;
> >      tx_rate = lacp->fast ? LACP_FAST_TIME_TX : LACP_SLOW_TIME_TX;
> >      timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * tx_rate);
> > @@ -511,6 +520,7 @@ lacp_member_carrier_changed(const struct lacp *lacp, 
> > const void *member_,
> >                              bool carrier_up)
> >      OVS_EXCLUDED(mutex)
> >  {
> > +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> >      struct member *member;
> >      if (!lacp) {
> >          return;
> > @@ -524,6 +534,11 @@ lacp_member_carrier_changed(const struct lacp *lacp, 
> > const void *member_,
> >
> >      if (member->status == LACP_CURRENT || member->lacp->active) {
> >          member_set_expired(member);
> > +        VLOG_DBG_RL(&rl, "%s: Expiring LACP due to %s carrier change.",
> > +                    lacp->name, member->name);
> > +        /* Do not warn about long LACP RX/TX interval if interface was 
> > down */
>
> Period at the end.
>
> > +        member->last_rx = 0;
> > +        member->last_tx = 0;
> >      }
> >
> >      if (member->carrier_up != carrier_up) {
> > @@ -594,6 +609,7 @@ lacp_member_is_current(const struct lacp *lacp, const 
> > void *member_)
> >  void
> >  lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) OVS_EXCLUDED(mutex)
> >  {
> > +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> >      struct member *member;
> >
> >      lacp_lock();
> > @@ -603,6 +619,8 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) 
> > OVS_EXCLUDED(mutex)
> >
> >              if (member->status == LACP_CURRENT) {
> >                  member_set_expired(member);
> > +                VLOG_DBG_RL(&rl, "%s: Expired member %s because LACP PDU 
> > was "
> > +                            "not received on time.", lacp->name, 
> > member->name);
> >                  member->count_link_expired++;
> >              } else if (member->status == LACP_EXPIRED) {
> >                  member_set_defaulted(member);
> > @@ -642,6 +660,13 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) 
> > OVS_EXCLUDED(mutex)
> >                          ? LACP_FAST_TIME_TX
> >                          : LACP_SLOW_TIME_TX);
> >
> > +            /* Log if we exceed the tx timer by double the tx rate. */
> > +            if (member->last_tx &&
> > +                (time_msec() - member->last_tx) > (duration * 2)) {
> > +                VLOG_INFO_RL(&rl, "%s: Member %s failed to send LACP PCU "
> > +                             "on time.", lacp->name, member->name);
> > +            }
> > +            member->last_tx = time_msec();
> >              timer_set_duration(&member->tx, duration);
> >              seq_change(connectivity_seq_get());
> >          }
> > @@ -800,6 +825,7 @@ member_set_expired(struct member *member) 
> > OVS_REQUIRES(mutex)
> >      member->partner.state &= ~LACP_STATE_SYNC;
> >
> >      timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * 
> > LACP_FAST_TIME_TX);
> > +    member->last_tx = 0;
>
> Why we clearing this for expired ports?  Shouldn't we still transmit even
> if we do not receive anything from the other side?

The goal here was to log cases where PDU RX was fine but TX was slow.
However, I think this can be improved, and I'll address it in the next
version.


Thank you,
Mike



>
> Not sure how to correctly achieve that, but is seems that last_tx shoudl
> be cleared whenever member_may_tx() turns false.
>
> Best regards, Ilya Maximets.
>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to