[Bug 279245] igc(4) I226 (and I225) hangups

bugzilla-noreply Thu, 23 May 2024 02:12:39 -0700

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279245


            Bug ID: 279245
           Summary: igc(4) I226 (and I225) hangups
           Product: Base System
           Version: 13.2-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: b...@freebsd.org
          Reporter: freebsd_em...@congenio.de

When using an I226 under OpnSense (FreeBSD 13.2-RELEASE kernel - I also tried
FreeBSD 14.0-RELEASE), I experience connection hangups about once per day under
no specific circumstances (maximum was 3 times within one hour, I also had none
in three days).

This problem manifests in a dead connection (no packets are received, note are
sent), but the low-level counters (dev.igc.0.mac_stats) still increase.
The conditon can be cleard up by bringing the interface down and up again or by
shortly disconnecting the cable.

There are reports on this and other related problems all over the internet for
different OSes, see:

Windows:
https://forums.evga.com/PSA-Intel-I226V-25GbE-on-Raptor-Lake-Motherboards-Has-a-Connection-Drop-Issue-No-Fix-m3595279.aspx
OpnSense (FreeBSD):
https://forum.opnsense.org/index.php?topic=40404.msg199288#msg199288
pfSense (FreeBSD):
https://forum.netgate.com/topic/181571/chinese-i226-v-on-23-05-1-problems

My specific variant is an I226-V, rev.4, built into a Minisforum MS-01:

igc0@pci0:87:0:0:       class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086
device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet


However, there are reports of the I226-LM connected to the same machine showing
the same behaviour, see: https://forum.opnsense.org/index.php?topic=40556

igc1@pci0:88:0:0:       class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086
device=0x125b subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-LM'
    class      = network
    subclass   = ethernet

This seems to indicate that at least the I226 family (which is a successor to
the problem-ridden I225 using the same driver module) is affected by this
problem.
I tried all possible settings I could think of to make this go away, like
reducing the speed from 2.5 to 1 Gbps, disabling EEE (which is off by default
anyway) to no avail.

Interestingly, the Minisforum-MS01 has gained much interest in the last few
months and there was a specific review on Youtube were the creator states in a
comment that he is not seeing this problem
(https://www.youtube.com/watch?v=_wgX1sDab-M). However, he uses OpnSense under
a Proxmox hypervisor, thus using the Linux driver modules (OpnSense itself uses
the virtualized virtio NICs).

This and the reports of gamers stating they had "micro-hangs" manifesting as
short lags in online games got me thinking.
So I compared the Linux and FreeBSD drivers and found, that the Linux driver
has a specific routine to catch, protocol and clear "TX hang" conditions, see
from line 3150 here:
https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/igc/igc_main.c,
which reads:

        if (test_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
                struct igc_hw *hw = &adapter->hw;

                /* Detect a transmit hang in hardware, this serializes the
                * check with the clearing of time_stamp and movement of i
                */
                clear_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags);
                if (tx_buffer->next_to_watch &&
                    time_after(jiffies, tx_buffer->time_stamp +
                    (adapter->tx_timeout_factor * HZ)) &&
                    !(rd32(IGC_STATUS) & IGC_STATUS_TXOFF) &&
                    (rd32(IGC_TDH(tx_ring->reg_idx)) != readl(tx_ring->tail))
&&
                    !tx_ring->oper_gate_closed) {
                        /* detected Tx unit hang */
                        netdev_err(tx_ring->netdev,
                                   "Detected Tx Unit Hang\n"
                                   "  Tx Queue             <%d>\n"
                                   "  TDH                  <%x>\n"
                                   "  TDT                  <%x>\n"
                                   "  next_to_use          <%x>\n"
                                   "  next_to_clean        <%x>\n"
                                   "buffer_info[next_to_clean]\n"
                                   "  time_stamp           <%lx>\n"
                                   "  next_to_watch        <%p>\n"
                                   "  jiffies              <%lx>\n"
                                   "  desc.status          <%x>\n",
                                   tx_ring->queue_index,
                                   rd32(IGC_TDH(tx_ring->reg_idx)),
                                   readl(tx_ring->tail),
                                   tx_ring->next_to_use,
                                   tx_ring->next_to_clean,
                                   tx_buffer->time_stamp,
                                   tx_buffer->next_to_watch,
                                   jiffies,
                                   tx_buffer->next_to_watch->wb.status);
                        netif_stop_subqueue(tx_ring->netdev,
                                            tx_ring->queue_index);

                        /* we are about to reset, no point in enabling stuff */
                        return true;
                }
        }

There is also a routine to reset the adapter:

/**
 * igc_tx_timeout - Respond to a Tx Hang
 * @netdev: network interface device structure
 * @txqueue: queue number that timed out
 **/
static void igc_tx_timeout(struct net_device *netdev,
                           unsigned int __always_unused txqueue)
{
        struct igc_adapter *adapter = netdev_priv(netdev);
        struct igc_hw *hw = &adapter->hw;

        /* Do the reset outside of interrupt context */
        adapter->tx_timeout_count++;
        schedule_work(&adapter->reset_task);
        wr32(IGC_EICS,
             (adapter->eims_enable_mask & ~adapter->eims_other));
}

I did not see anything to this extent in the FreeBSD driver igc module.

Intel themselves do not offer an OEM driver for FreeBSD in their Intel Network
Connections 29.1 package.

So, my theory is that there is a hardware ideosyncrasy in this Intel adapter
family which causes packet flow to stop sometimes.
This is handled in the Linux driver module by testing if no packets are
processed for a short period.
That detection and handling would not be there if there was no problem, so we
can take this for a fact.

I suspect that the same handling is contained in the Windows drivers, too -
which I cannot ascertain because I cannot look at the source code.
However, this would be in line with the observed "micro-hangs" under Windows
from other users.

Alas, under FreeBSD, there is no handling of this condition which might explain
the total packet loss after it occurs.
If it were fixed in FreeBSD, it would be a great benefit for applications like
pfSense and OpnSense since now, these adapters are essentially unusable.
A potential fix would still produce "micro-hangs" once in a while, however this
is far better than losing the connection completely.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 279245] igc(4) I226 (and I225) hangups

Reply via email to