On Fri, Aug 11, 2023 at 6:56 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 8/4/23 19:58, Mike Pattrick wrote: > > Diagnosing connectivity issues involving a bond can be complicated by a > > lack of logging in LACP. It is difficult to determine the health of > > sending and receving LACP packets. This is further complicated by the > > tendency of some switches to toggle the carrier on interfaces that > > experience issues with LACP, which can cause confusion about why an > > interface is flapping down and up again. > > > > With this patch, OVS will log when LACP packets aren't sent or recieved > > on time. > > > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2223306 > > Signed-off-by: Mike Pattrick <m...@redhat.com> > > Hi, Mike. Thanks for the improvements! > > Not a full review, but I have a few comments inline. > > > --- > > v2: Added ratelimits to logs > > --- > > lib/lacp.c | 26 ++++++++++++++++++++++++++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/lib/lacp.c b/lib/lacp.c > > index 3252f17eb..22001a39d 100644 > > --- a/lib/lacp.c > > +++ b/lib/lacp.c > > @@ -143,6 +143,8 @@ struct member { > > uint32_t count_link_expired; /* Num of times link expired */ > > uint32_t count_link_defaulted; /* Num of times link defaulted */ > > uint32_t count_carrier_changed; /* Num of times link status changed */ > > + long long int last_tx; > > + long long int last_rx; > > Some simple comments would be nice here. > > > }; > > > > static struct ovs_mutex mutex; > > @@ -387,6 +389,13 @@ lacp_process_packet(struct lacp *lacp, const void > > *member_, > > goto out; > > } > > > > + if (member->last_rx && member->status != LACP_CURRENT) { > > + long long int delay = time_msec() - member->last_rx; > > + VLOG_DBG_RL(&rl, "%s: %s recieved PDU after expiry, delayed by > > %lldms " > > + "seconds.", lacp->name, member->name, delay); > > "delayed by X ms seconds" ? > The "seconds" part should not be there. > > > + } > > + > > + member->last_rx = time_msec(); > > member->status = LACP_CURRENT; > > tx_rate = lacp->fast ? LACP_FAST_TIME_TX : LACP_SLOW_TIME_TX; > > timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * tx_rate); > > @@ -511,6 +520,7 @@ lacp_member_carrier_changed(const struct lacp *lacp, > > const void *member_, > > bool carrier_up) > > OVS_EXCLUDED(mutex) > > { > > + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); > > struct member *member; > > if (!lacp) { > > return; > > @@ -524,6 +534,11 @@ lacp_member_carrier_changed(const struct lacp *lacp, > > const void *member_, > > > > if (member->status == LACP_CURRENT || member->lacp->active) { > > member_set_expired(member); > > + VLOG_DBG_RL(&rl, "%s: Expiring LACP due to %s carrier change.", > > + lacp->name, member->name); > > + /* Do not warn about long LACP RX/TX interval if interface was > > down */ > > Period at the end. > > > + member->last_rx = 0; > > + member->last_tx = 0; > > } > > > > if (member->carrier_up != carrier_up) { > > @@ -594,6 +609,7 @@ lacp_member_is_current(const struct lacp *lacp, const > > void *member_) > > void > > lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) OVS_EXCLUDED(mutex) > > { > > + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); > > struct member *member; > > > > lacp_lock(); > > @@ -603,6 +619,8 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) > > OVS_EXCLUDED(mutex) > > > > if (member->status == LACP_CURRENT) { > > member_set_expired(member); > > + VLOG_DBG_RL(&rl, "%s: Expired member %s because LACP PDU > > was " > > + "not received on time.", lacp->name, > > member->name); > > member->count_link_expired++; > > } else if (member->status == LACP_EXPIRED) { > > member_set_defaulted(member); > > @@ -642,6 +660,13 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) > > OVS_EXCLUDED(mutex) > > ? LACP_FAST_TIME_TX > > : LACP_SLOW_TIME_TX); > > > > + /* Log if we exceed the tx timer by double the tx rate. */ > > + if (member->last_tx && > > + (time_msec() - member->last_tx) > (duration * 2)) { > > + VLOG_INFO_RL(&rl, "%s: Member %s failed to send LACP PCU " > > + "on time.", lacp->name, member->name); > > + } > > + member->last_tx = time_msec(); > > timer_set_duration(&member->tx, duration); > > seq_change(connectivity_seq_get()); > > } > > @@ -800,6 +825,7 @@ member_set_expired(struct member *member) > > OVS_REQUIRES(mutex) > > member->partner.state &= ~LACP_STATE_SYNC; > > > > timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * > > LACP_FAST_TIME_TX); > > + member->last_tx = 0; > > Why we clearing this for expired ports? Shouldn't we still transmit even > if we do not receive anything from the other side?
The goal here was to log cases where PDU RX was fine but TX was slow. However, I think this can be improved, and I'll address it in the next version. Thank you, Mike > > Not sure how to correctly achieve that, but is seems that last_tx shoudl > be cleared whenever member_may_tx() turns false. > > Best regards, Ilya Maximets. > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev