This is just about the last part of your post, about the 4.9 kernel and CentOS.

Are you using the stable 4.9 kernel or are you hoping patches get pulled into 
the CentOS 4.9 kernel? If it's the latter, you need to file a bug with Red Hat 
to have the patches pulled into RHEL, and then CentOS should get those changes 
as well. We have no direct control on the RHEL/CentOS kernels.

If it's the former, someone (most likely you, since you're the one who needs 
the patches) has to identify the patches that should be pulled into the stable 
4.9 kernel and email the maintainer of the stable kernels.

I never said Intel is not monitoring the communities. I said the networking 
group is not monitoring the communities. At the very least, I am not monitoring 
the communities at all and only look when someone points things out to me.

Also, if you're running HP hardware, you may want to file a bug with HP as the 
firmware updates have to come from HP and this may be a firmware issue.

Todd Fujinaka
Software Application Engineer
Datacenter Engineering Group
Intel Corporation
todd.fujin...@intel.com

-----Original Message-----
From: Pavlos Parissis [mailto:pavlos.paris...@gmail.com] 
Sent: Wednesday, October 25, 2017 2:45 PM
To: e1000-devel@lists.sourceforge.net
Subject: [E1000-devel] Instability of i40e driver on 4.9 kernel

Hi all,

I mailed to netdev and inter-wired-lan about stability issues with i40e driver 
on
4.9 kernels and Todd Fujinaka suggested to mail this ML instead about our issue.

We have been running 4.9 kernels for several months on CentOS 7.3 and for few 
weeks on CentOS 7.4, and after we replaced 10GbE copper cards(X540-AT2 with 
ixgbe
driver) with X710 10GbE SFP cards using i40e driver, we noticed sever 
instabilities on our servers.

On several servers the links were marked down and up again, without any obvious 
reasons expect a lot of errors on kernel.log:

[..snip..]
2017-10-04T15:50:46.839998+02:00kernel: i40e 0000:04:00.1 eth0: tx_timeout 
recovery level 3, hung_queue 11

2017-10-04T15:50:50.119447+02:00kernel: i40e 0000:04:00.0: Query for DCB 
configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM

2017-10-04T15:50:50.119455+02:00kernel: i40e 0000:04:00.0: DCB init failed -53, 
disabled

2017-10-04T15:50:50.301798+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Down

2017-10-04T15:50:50.423744+02:00kernel: i40e 0000:04:00.1: Query for DCB 
configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EPERM

2017-10-04T15:50:50.423752+02:00kernel: i40e 0000:04:00.1: DCB init failed -53, 
disabled

2017-10-04T15:50:50.600812+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Down

2017-10-04T15:50:50.764799+02:00kernel: i40e 0000:04:00.1 eth0: NIC Link is Up 
10 Gbps Full Duplex, Flow Control: None

2017-10-04T15:50:53.234804+02:00kernel: i40e 0000:04:00.0 eth1: NIC Link is Up 
10 Gbps Full Duplex, Flow Control: None

2017-10-04T15:51:17.201808+02:00kernel: i40e 0000:04:00.1: TX driver issue 
detected, PF reset issued [..snip..]

We run Bird Internet daemon on our servers in order to establish BGP peerings 
with routers and we have also observed flapping on BGP peerings. At the same 
time we had BGP peering stabilities issues we had kernel errors:

2017-10-06T07:36:10.526657+02:00 kernel: [60720.957855] i40e 0000:04:00.1: DCB 
init failed -53, disabled

2017-10-06T07:36:12.127091+02:00 kernel: [60722.553258] i40e 0000:04:00.1: TX 
driver issue detected, PF reset issued

2017-10-06T07:36:12.509188+02:00 kernel: [60722.891523] i40e 0000:04:00.1: 
Query for DCB configuration failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err 
I40E_AQ_RC_EPERM

We decided to go back to 3.10 kernel from CentOS, but that process wasn't 
smooth as latest firmware gave us problems with speed detection. We rolled back 
to two version old and speed detection issue was resolved. We have been running 
3.10 several weeks without any problems.

Even we want certain functionality from kernel 4.9, we decided to switch back to
3.10 as stability of our systems has higher priority.

I need to mention that in all occurrences of the issue we didn't see any 
anomalies, such DDOS attacks and etc.

I have opened https://communities.intel.com/message/501682#501682 and there you 
can find all the error messages and other information.

Todd Fujinaka asked me to provide reproduction steps, but we only got the 
issues when we had real customer traffic on our servers.

Has anyone seen those errors and observed this kind of instability?

Since we noticed the issues, I have been following netdev ML and I know that 
there are a lot of improvements/patched queued up for 4.14 and I am hoping 
those patches fix our issue and most importantly are sent to linux-stable for 
inclusion in 4.9 kernel.

Cheers,
Pavlos

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to