On Wed, Aug 06, 2025 at 07:05:19PM +0200, Paul Menzel wrote:
> Dear Maciej,
> 
> 
> Thank you for your patch.
> 
> 
> Am 06.08.25 um 18:58 schrieb Maciej Fijalkowski:
> > Currently ixgbe driver checks periodically in its watchdog subtask if
> > there is anything to be transmitted (considering both Tx and XDP rings)
> > under state of carrier not being 'ok'. Such event is interpreted as Tx
> > hang and therefore results in interface reset.
> 
> For people grepping through commit messages, add some more details how the
> hang manifests?

Hi Paul,

I didn't want to repeat too much of things here that are included under
the link where original report took place (see lore link from Closes:
tag).

I know this adds a level of indirection for future reader, but I assumed
lore link will not be gone and it is safe to rely on it.

> 
> > This is currently problematic for ndo_xdp_xmit() as it is allowed to
> > produce descriptors when interface is going through reset or its carrier
> > is turned off.
> > 
> > Furthermore, XDP rings should not really be objects of Tx hang
> > detection. This mechanism is rather a matter of ndo_tx_timeout() being
> > called from dev_watchdog against Tx rings exposed to networking stack.
> > 
> > Taking into account issues described above, let us have a two fold fix -
> > do not respect XDP rings in local ixgbe watchdog and do not produce Tx
> > descriptors in ndo_xdp_xmit callback when there is some problem with
> > carrier currently. For now, keep the Tx hang checks in clean Tx irq
> > routine, but adjust it to not execute for XDP rings.
> 
> Do you have a reproducer for this?

Again, the original report has it. xdp-trafficgen was used to trigger this
problem.

I am not sure if it's worth re-spinning, especially that Marcus and Tobias
might be angry at me that it still didn't make it to mainline:P

> 
> > Cc: Tobias Böhm <[email protected]>
> > Reported-by: Marcus Wichelmann <[email protected]>
> > Closes: 
> > https://lore.kernel.org/netdev/[email protected]/
> > Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect")
> > Fixes: 33fdc82f0883 ("ixgbe: add support for XDP_TX action")
> > Reviewed-by: Aleksandr Loktionov <[email protected]>
> > Tested-by: Marcus Wichelmann <[email protected]>
> > Signed-off-by: Maciej Fijalkowski <[email protected]>
> > ---
> > v1->v2:
> > * collect tags
> > * fix typos (Dawid)
> > ---
> >   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 34 ++++++-------------
> >   1 file changed, 11 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 03d31e5b131d..7c0db3b3ee8e 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -967,10 +967,6 @@ static void ixgbe_update_xoff_rx_lfc(struct 
> > ixgbe_adapter *adapter)
> >     for (i = 0; i < adapter->num_tx_queues; i++)
> >             clear_bit(__IXGBE_HANG_CHECK_ARMED,
> >                       &adapter->tx_ring[i]->state);
> > -
> > -   for (i = 0; i < adapter->num_xdp_queues; i++)
> > -           clear_bit(__IXGBE_HANG_CHECK_ARMED,
> > -                     &adapter->xdp_ring[i]->state);
> >   }
> >   static void ixgbe_update_xoff_received(struct ixgbe_adapter *adapter)
> > @@ -1264,10 +1260,13 @@ static bool ixgbe_clean_tx_irq(struct 
> > ixgbe_q_vector *q_vector,
> >                                total_bytes);
> >     adapter->tx_ipsec += total_ipsec;
> > +   if (ring_is_xdp(tx_ring))
> > +           return !!budget;
> > +
> >     if (check_for_tx_hang(tx_ring) && ixgbe_check_tx_hang(tx_ring)) {
> >             /* schedule immediate reset if we believe we hung */
> >             struct ixgbe_hw *hw = &adapter->hw;
> > -           e_err(drv, "Detected Tx Unit Hang %s\n"
> > +           e_err(drv, "Detected Tx Unit Hang\n"
> >                     "  Tx Queue             <%d>\n"
> >                     "  TDH, TDT             <%x>, <%x>\n"
> >                     "  next_to_use          <%x>\n"
> > @@ -1275,16 +1274,14 @@ static bool ixgbe_clean_tx_irq(struct 
> > ixgbe_q_vector *q_vector,
> >                     "tx_buffer_info[next_to_clean]\n"
> >                     "  time_stamp           <%lx>\n"
> >                     "  jiffies              <%lx>\n",
> > -                   ring_is_xdp(tx_ring) ? "(XDP)" : "",
> >                     tx_ring->queue_index,
> >                     IXGBE_READ_REG(hw, IXGBE_TDH(tx_ring->reg_idx)),
> >                     IXGBE_READ_REG(hw, IXGBE_TDT(tx_ring->reg_idx)),
> >                     tx_ring->next_to_use, i,
> >                     tx_ring->tx_buffer_info[i].time_stamp, jiffies);
> > -           if (!ring_is_xdp(tx_ring))
> > -                   netif_stop_subqueue(tx_ring->netdev,
> > -                                       tx_ring->queue_index);
> > +           netif_stop_subqueue(tx_ring->netdev,
> > +                               tx_ring->queue_index);
> >             e_info(probe,
> >                    "tx hang %d detected on queue %d, resetting adapter\n",
> > @@ -1297,9 +1294,6 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
> > *q_vector,
> >             return true;
> >     }
> > -   if (ring_is_xdp(tx_ring))
> > -           return !!budget;
> > -
> >   #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
> >     txq = netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
> >     if (!__netif_txq_completed_wake(txq, total_packets, total_bytes,
> > @@ -7796,12 +7790,9 @@ static void ixgbe_check_hang_subtask(struct 
> > ixgbe_adapter *adapter)
> >             return;
> >     /* Force detection of hung controller */
> > -   if (netif_carrier_ok(adapter->netdev)) {
> > +   if (netif_carrier_ok(adapter->netdev))
> >             for (i = 0; i < adapter->num_tx_queues; i++)
> >                     set_check_for_tx_hang(adapter->tx_ring[i]);
> > -           for (i = 0; i < adapter->num_xdp_queues; i++)
> > -                   set_check_for_tx_hang(adapter->xdp_ring[i]);
> > -   }
> >     if (!(adapter->flags & IXGBE_FLAG_MSIX_ENABLED)) {
> >             /*
> > @@ -8016,13 +8007,6 @@ static bool ixgbe_ring_tx_pending(struct 
> > ixgbe_adapter *adapter)
> >                     return true;
> >     }
> > -   for (i = 0; i < adapter->num_xdp_queues; i++) {
> > -           struct ixgbe_ring *ring = adapter->xdp_ring[i];
> > -
> > -           if (ring->next_to_use != ring->next_to_clean)
> > -                   return true;
> > -   }
> > -
> >     return false;
> >   }
> > @@ -10825,6 +10809,10 @@ static int ixgbe_xdp_xmit(struct net_device *dev, 
> > int n,
> >     if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
> >             return -ENETDOWN;
> > +   if (!netif_carrier_ok(adapter->netdev) ||
> > +       !netif_running(adapter->netdev))
> > +           return -ENETDOWN;
> > +
> 
> I am no expert here, but should the commit be split into two?

fixing producer on one commit and consumer on other means that first
commit would still contain a broken driver, which would be not a real
*fix*. you can think of ixgbe_xdp_xmit() as a producer of descriptors and
ixgbe_clean_tx_irq() as consumer (in reality HW is the consumer, but i
hope this analogy makes some sense to you).

> 
> >     if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
> >             return -EINVAL;
> 
> 
> Kind regards,
> 
> Paul

Reply via email to