I leveraged Claude Opus 4.6 to develop a stress-test suite with a
primary 'break-it' objective targeting VF stability. The suite focuses
on aggressive edge cases, specifically cyclic VF migration between
network namespaces while VLAN filtering is active a sequence known
to trigger state machine regressions. The following output
demonstrates the failure state on an unpatched iavf driver (prior to
the 'fix VLAN filter state machine races' patch):
echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
iavf VLAN state machine test suite
================================================
VF1: enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6502
VF2: enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6502
PF: enp65s0f0np0 (0000:41:00.0)
MAX: 8 user VLANs per VF
================================================
PASS state: basic add/remove
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: 8 VLANs add/remove (only 7 created)
PASS state: VLAN persists across down/up
PASS state: 5 VLANs persist across down/up
PASS state: rapid add/del same VLAN x100
PASS state: add during remove (REMOVING race)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
PASS state: bulk 8 add then remove
PASS state: 20x rapid down/up with VLAN
PASS state: add VLAN while down
PASS state: remove VLAN while down
PASS state: down -> remove -> up
PASS state: add VLANs while down, verify all after up
PASS state: double add same VLAN (idempotent)
PASS state: double remove same VLAN
PASS state: interleaved add/remove different VIDs
PASS state: remove+re-add loop x50
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: stress 8 VLANs (fill to max) (expected 8, got 7)
PASS state: VLAN VID 1 (common edge case)
PASS state: VLAN VID 4094 (max)
PASS state: concurrent VLAN adds (4 parallel)
PASS state: concurrent VLAN deletes (4 parallel)
PASS state: add/del storm (200 ops, 5 VIDs)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL state: over-limit VLAN rejected, existing survive (fill:
expected 8, got 7)
PASS reset: VLANs recover after VF PCI FLR
PASS reset: 5 VLANs recover after VF PCI FLR
PASS reset: rapid VF resets x5 with VLANs
PASS reset: VLANs survive PF link flap
PASS reset: 5 VLANs survive PF link flap
PASS reset: VLANs survive 3x PF link flap
PASS reset: VLANs survive PF PCI FLR
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs recover after VF FLR (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs survive PF link flap (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
FAIL reset: all 8 VLANs survive PF PCI FLR (VLAN 107 gone)
PASS reset: FLR during VLAN add/del (race)
PASS reset: VF driver unbind/bind cycle
PASS ping: basic VLAN traffic
PASS ping: 5 VLANs simultaneously
PASS ping: survives VF down/up
PASS ping: survives 10x rapid VF flap
PASS ping: survives VF PCI FLR
PASS ping: survives PF link flap
PASS ping: survives PF PCI FLR
PASS ping: stable while adding/removing other VLANs
PASS ping: all 3 VLANs work after down/up
PASS ping: parallel VLAN churn from both VFs
PASS ping: VLANs work after rapid add/del churn
PASS ping: VLANs survive repeated NS move cycle
PASS ping: all VLANs survive PF link flap
PASS ping: VLAN isolation (no cross-VLAN leakage)
PASS ping: traffic works with spoofchk enabled
PASS ping: port VLAN (PF-assigned pvid)
PASS dmesg: no call traces / BUGs / stalls
================================================
PASS 46 | FAIL 6 | SKIP 0 | TOTAL 52
================================================
RESULT: FAIL -- check dmesg
The underlying failures stem from a breakdown in state synchronization
between the VF and the PF. This desynchronization prevents the driver
from maintaining a consistent hardware state during rapid configuration
cycles, leading to the observed issues.
...................
Patched kernel:
# echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
iavf VLAN state machine test suite
================================================
VF1: enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6573
VF2: enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6573
PF: enp65s0f0np0 (0000:41:00.0)
MAX: 8 user VLANs per VF
================================================
PASS state: basic add/remove
PASS state: 8 VLANs add/remove
PASS state: VLAN persists across down/up
PASS state: 5 VLANs persist across down/up
PASS state: rapid add/del same VLAN x100
PASS state: add during remove (REMOVING race)
PASS state: bulk 8 add then remove
PASS state: 20x rapid down/up with VLAN
PASS state: add VLAN while down
PASS state: remove VLAN while down
PASS state: down -> remove -> up
PASS state: add VLANs while down, verify all after up
PASS state: double add same VLAN (idempotent)
PASS state: double remove same VLAN
PASS state: interleaved add/remove different VIDs
PASS state: remove+re-add loop x50
PASS state: stress 8 VLANs (fill to max)
PASS state: VLAN VID 1 (common edge case)
PASS state: VLAN VID 4094 (max)
PASS state: concurrent VLAN adds (4 parallel)
PASS state: concurrent VLAN deletes (4 parallel)
PASS state: add/del storm (200 ops, 5 VIDs)
PASS state: over-limit VLAN rejected, existing survive
PASS reset: VLANs recover after VF PCI FLR
PASS reset: 5 VLANs recover after VF PCI FLR
PASS reset: rapid VF resets x5 with VLANs
PASS reset: VLANs survive PF link flap
PASS reset: 5 VLANs survive PF link flap
PASS reset: VLANs survive 3x PF link flap
PASS reset: VLANs survive PF PCI FLR
PASS reset: all 8 VLANs recover after VF FLR
PASS reset: all 8 VLANs survive PF link flap
PASS reset: all 8 VLANs survive PF PCI FLR
PASS reset: FLR during VLAN add/del (race)
PASS reset: VF driver unbind/bind cycle
PASS ping: basic VLAN traffic
PASS ping: 5 VLANs simultaneously
PASS ping: survives VF down/up
PASS ping: survives 10x rapid VF flap
PASS ping: survives VF PCI FLR
PASS ping: survives PF link flap
PASS ping: survives PF PCI FLR
PASS ping: stable while adding/removing other VLANs
PASS ping: all 3 VLANs work after down/up
PASS ping: parallel VLAN churn from both VFs
PASS ping: VLANs work after rapid add/del churn
PASS ping: VLANs survive repeated NS move cycle
PASS ping: all VLANs survive PF link flap
PASS ping: VLAN isolation (no cross-VLAN leakage)
PASS ping: traffic works with spoofchk enabled
PASS ping: port VLAN (PF-assigned pvid)
PASS dmesg: no call traces / BUGs / stalls
================================================
PASS 52 | FAIL 0 | SKIP 0 | TOTAL 52
================================================
RESULT: OK
Additionally, interface up/down performance with active VLAN
filtering is significantly improved. The previous bottleneck—a
synchronous VLAN filtering cycle (VF -> PF -> HW -> PF -> VF)
utilizing AdminQ for per-VLAN updates introduced substantial
latency.
Test suite:
https://github.com/torvalds/linux/commit/5c60850c33da80a1c2497fb6bc31f956316197a9
Regards,
Petr