Hi all,

I mailed to netdev and inter-wired-lan about stability issues with i40e driver 
on
4.9 kernels and Todd Fujinaka suggested to mail this ML instead about our issue.

We have been running 4.9 kernels for several months on CentOS 7.3 and for few
weeks on CentOS 7.4, and after we replaced 10GbE copper cards(X540-AT2 with 
ixgbe
driver) with X710 10GbE SFP cards using i40e driver, we noticed sever
instabilities on our servers.

On several servers the links were marked down and up again, without any obvious
reasons expect a lot of errors on kernel.log:

[..snip..]
2017-10-04T15:50:46.839998+02:00kernel: i40e 0000:04:00.1 eth0: tx_timeout
recovery level 3, hung_queue 11

2017-10-04T15:50:50.119447+02:00kernel: i40e 0000:04:00.0: Query for DCB
configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM

2017-10-04T15:50:50.119455+02:00kernel: i40e 0000:04:00.0: DCB init failed -53,
disabled

2017-10-04T15:50:50.301798+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Down

2017-10-04T15:50:50.423744+02:00kernel: i40e 0000:04:00.1: Query for DCB
configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM

2017-10-04T15:50:50.423752+02:00kernel: i40e 0000:04:00.1: DCB init failed -53,
disabled

2017-10-04T15:50:50.600812+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Down

2017-10-04T15:50:50.764799+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Up 
10
Gbps Full Duplex, Flow Control: None

2017-10-04T15:50:53.234804+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Up 
10
Gbps Full Duplex, Flow Control: None

2017-10-04T15:51:17.201808+02:00kernel: i40e 0000:04:00.1: TX driver issue
detected, PF reset issued
[..snip..]

We run Bird Internet daemon on our servers in order to establish BGP peerings 
with
routers and we have also observed flapping on BGP peerings. At the same time we
had BGP peering stabilities issues we had kernel errors:

2017-10-06T07:36:10.526657+02:00 kernel: [60720.957855] i40e 0000:04:00.1: DCB
init failed -53, disabled

2017-10-06T07:36:12.127091+02:00 kernel: [60722.553258] i40e 0000:04:00.1: TX
driver issue detected, PF reset issued

2017-10-06T07:36:12.509188+02:00 kernel: [60722.891523] i40e 0000:04:00.1: Query
for DCB configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err 
I40E_AQ_RC_EPERM

We decided to go back to 3.10 kernel from CentOS, but that process wasn't smooth
as latest firmware gave us problems with speed detection. We rolled back to two
version old and speed detection issue was resolved. We have been running 3.10
several weeks without any problems.

Even we want certain functionality from kernel 4.9, we decided to switch back to
3.10 as stability of our systems has higher priority.

I need to mention that in all occurrences of the issue we didn't see any
anomalies, such DDOS attacks and etc.

I have opened https://communities.intel.com/message/501682#501682 and there you
can find all the error messages and other information.

Todd Fujinaka asked me to provide reproduction steps, but we only got the issues
when we had real customer traffic on our servers.

Has anyone seen those errors and observed this kind of instability?

Since we noticed the issues, I have been following netdev ML and I know that 
there
are a lot of improvements/patched queued up for 4.14 and I am hoping those 
patches
fix our issue and most importantly are sent to linux-stable for inclusion in 4.9
kernel.

Cheers,
Pavlos

Attachment: signature.asc
Description: OpenPGP digital signature

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to