Hi all, I mailed to netdev and inter-wired-lan about stability issues with i40e driver on 4.9 kernels and Todd Fujinaka suggested to mail this ML instead about our issue.
We have been running 4.9 kernels for several months on CentOS 7.3 and for few weeks on CentOS 7.4, and after we replaced 10GbE copper cards(X540-AT2 with ixgbe driver) with X710 10GbE SFP cards using i40e driver, we noticed sever instabilities on our servers. On several servers the links were marked down and up again, without any obvious reasons expect a lot of errors on kernel.log: [..snip..] 2017-10-04T15:50:46.839998+02:00kernel: i40e 0000:04:00.1 eth0: tx_timeout recovery level 3, hung_queue 11 2017-10-04T15:50:50.119447+02:00kernel: i40e 0000:04:00.0: Query for DCB configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM 2017-10-04T15:50:50.119455+02:00kernel: i40e 0000:04:00.0: DCB init failed -53, disabled 2017-10-04T15:50:50.301798+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Down 2017-10-04T15:50:50.423744+02:00kernel: i40e 0000:04:00.1: Query for DCB configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM 2017-10-04T15:50:50.423752+02:00kernel: i40e 0000:04:00.1: DCB init failed -53, disabled 2017-10-04T15:50:50.600812+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Down 2017-10-04T15:50:50.764799+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None 2017-10-04T15:50:53.234804+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None 2017-10-04T15:51:17.201808+02:00kernel: i40e 0000:04:00.1: TX driver issue detected, PF reset issued [..snip..] We run Bird Internet daemon on our servers in order to establish BGP peerings with routers and we have also observed flapping on BGP peerings. At the same time we had BGP peering stabilities issues we had kernel errors: 2017-10-06T07:36:10.526657+02:00 kernel: [60720.957855] i40e 0000:04:00.1: DCB init failed -53, disabled 2017-10-06T07:36:12.127091+02:00 kernel: [60722.553258] i40e 0000:04:00.1: TX driver issue detected, PF reset issued 2017-10-06T07:36:12.509188+02:00 kernel: [60722.891523] i40e 0000:04:00.1: Query for DCB configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM We decided to go back to 3.10 kernel from CentOS, but that process wasn't smooth as latest firmware gave us problems with speed detection. We rolled back to two version old and speed detection issue was resolved. We have been running 3.10 several weeks without any problems. Even we want certain functionality from kernel 4.9, we decided to switch back to 3.10 as stability of our systems has higher priority. I need to mention that in all occurrences of the issue we didn't see any anomalies, such DDOS attacks and etc. I have opened https://communities.intel.com/message/501682#501682 and there you can find all the error messages and other information. Todd Fujinaka asked me to provide reproduction steps, but we only got the issues when we had real customer traffic on our servers. Has anyone seen those errors and observed this kind of instability? Since we noticed the issues, I have been following netdev ML and I know that there are a lot of improvements/patched queued up for 4.14 and I am hoping those patches fix our issue and most importantly are sent to linux-stable for inclusion in 4.9 kernel. Cheers, Pavlos
signature.asc
Description: OpenPGP digital signature
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired