On 2021/7/22 23:46, Thomas Monjalon wrote: > 22/07/2021 15:50, fengchengwen: >> Hi, all >> >> I notice ethdev support dev_reset ops, which could be used to recover >> from >> errors, and only 13+ drivers support this function. >> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 >> drivers support it (most of them are VF). >> >> This provides users with two ways to handle hardware errors: >> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops. >> b. application detect errors (the detection method is unclear), and call >> reset ops to recover. >> >> According to the design of this API, error handling is assigned to the >> application, and the driver is only responsible for reporting events. This >> simplifies the driver design (for example, the driver does not need to >> maintain >> mutex locks). >> >> As we know, many modern NICs come with firmware, have PCIE interfaces, >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in >> most drivers. >> >> Question 1: what do we think of these errors(particularly firmware/PF)? >> Do >> we think that the probability is very low and that there is no need to deal >> with >> them? > > Even rare errors must be managed.
Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code, and found: 1) i40e PF driver, it only show logs when detect global reset and other error: if (icr0 & I40E_PFINT_ICR0_GRST_MASK) PMD_DRV_LOG(INFO, "ICR0: global reset requested"); if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK) PMD_DRV_LOG(INFO, "ICR0: PCI exception activated"); if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK) PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state"); @Beilei Why not report RESET_EVENT in these cases ? or these error are very rarely so just report it. And also, the application may still send or recv packet, These resets, if not handled correctly, do not cause the hardware/driver to hang ? 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which will remove the device. @Matan Why not report RESET_EVENT in these cases ? so the application can be recovered quickly. > >> Question 2: I prefer to put error handling in the application layer, >> because >> doing it in the driver can make the driver complex, but there is no app to >> register the INTR_RESET event handler. I think we can build a standard >> handler >> in testpmd, What do you think? > > Absolutely. As any ethdev API, it must be tested with testpmd. > > > . >