Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
> On 31/01/17 10:49, Kevin Stange wrote:
>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
>> 4.4 kernel.  Any chance you have tested with that one?
> 
> Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and
> Xen with kernel 4.4.

I'll keep you (and others here) posted on my own experiences with that
4.4 build over the next few weeks to report on any issues.  I'm hoping
something happened between 3.18 and 4.4 that fixed underlying problems.

>> Did you ever try without MTU=9000 (default 1500 instead)?
> 
> Yes, also with all sorts of configuration combinations like LACP rate
> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.

Alright, I'll assume that probably won't help then.  I tried it on one
box which hasn't had the issue again yet, but that doesn't guarantee
anything.

>> I am having certain issues on certain hardware where there's no shutting
>> down the affected NICs.  Trying to do so or unload the igb module hangs
>> the entire box.  But in that case they're throwing AER errors instead of
>> just unit hangs:
>>
>> pcieport :00:03.0: AER: Uncorrected (Non-Fatal) error received:
>> id=
>> igb :04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, id=0401(Requester ID)
>> igb :04:00.1:   device [8086:10a7] error
>> status/mask=4000/
>> igb :04:00.1:[14] Completion Timeout (First)
>> igb :04:00.1: broadcast error_detected message
>> igb :04:00.1: broadcast slot_reset message
>> igb :04:00.1: broadcast resume message
>> igb :04:00.1: AER: Device recovery successful
> 
> This is interesting. We've never had any problems with the 1Gb NICs, but
> we're only using 10Gb for the storage network. Could it be a common
> problem with either the adapters, or the drivers which only replicate
> running the Xen enabled kernel?

Since I've never run the 3.18 kernel on a box of this type without
running in a dom0 and since I can't reproduce this kind of issue without
a fair amount of NIC load over a tremendous period of time, it's
impossible to test if it's tied to Xen.

However, I know this hardware works well under 2.6.32-*.el6 and
3.10.0-*.el7 kernels without stability problems, as it did with
2.6.18-*.el5xen (Xen 3.4.4).

I suspect the above errors are actually due to something PCIe related,
and I have a subset of boxes which are actually being impacted by two
distinct problems with equivalent impact, which increases the likelihood
that the boxes will die.  Another set of boxes only ever sees the unit
hangs which seem unrecoverable even unloading/reloading the driver.  A
third set has random recoverable unit hangs only.  With so much
diversity, it's even harder to pin any specific causes to the problems.

The fact we're both pushing NFS and iSCSI traffic over these links makes
me wonder if there's something about that kind of traffic that increases
the chances of causing these issues.  When I put VM network traffic over
the same NICs, they seem a lot less prone to failures, but also end up
pushing less traffic in general.

>> Switching to Broadcom would be a possibility, though it's tricky because
>> two of the NICs are onboard, so we'd need to replace the dual-port 1G
>> card with a quad-port 1G card.  Since you're saying you're all 10G,
>> maybe you don't know, but if you have any specific Broadcom 1G cards
>> you've had good fortune with, I'd be interested in knowing which models.
>>   Broadcom cards are rarely labeled as such which makes finding them a
>> bit more difficult than Intel ones.
> 
> We've purchased a number of servers with Broadcom BCM957810A1008G, sold
> by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up
> & down like a yo-yo so far.
> 
>> So far the one hypervisor with pci=nomsi has been quiet but that doesn't
>> mean it's fixed.  I need to give it 6 weeks or so. :)
> 
> It'd be more like 6-9 months for us, making it terrible to debug it :-/

I had a bunch of these on relatively light VM load for 3 months for
"burn in" with no issues but they've been pretty aggressively failing
since I started to try to put real loads on them.  Still, it's odd
because some of the boxes with identical hardware and similar VM loads
have not yet blown up after 3 or more weeks, and maybe they won't for
several months.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Adi Pircalabu

On 31/01/17 10:49, Kevin Stange wrote:

You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
4.4 kernel.  Any chance you have tested with that one?


Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and 
Xen with kernel 4.4.



Did you ever try without MTU=9000 (default 1500 instead)?


Yes, also with all sorts of configuration combinations like LACP rate 
slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.



I am having certain issues on certain hardware where there's no shutting
down the affected NICs.  Trying to do so or unload the igb module hangs
the entire box.  But in that case they're throwing AER errors instead of
just unit hangs:

pcieport :00:03.0: AER: Uncorrected (Non-Fatal) error received: id=
igb :04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
type=Transaction Layer, id=0401(Requester ID)
igb :04:00.1:   device [8086:10a7] error status/mask=4000/
igb :04:00.1:[14] Completion Timeout (First)
igb :04:00.1: broadcast error_detected message
igb :04:00.1: broadcast slot_reset message
igb :04:00.1: broadcast resume message
igb :04:00.1: AER: Device recovery successful


This is interesting. We've never had any problems with the 1Gb NICs, but 
we're only using 10Gb for the storage network. Could it be a common 
problem with either the adapters, or the drivers which only replicate 
running the Xen enabled kernel?



Switching to Broadcom would be a possibility, though it's tricky because
two of the NICs are onboard, so we'd need to replace the dual-port 1G
card with a quad-port 1G card.  Since you're saying you're all 10G,
maybe you don't know, but if you have any specific Broadcom 1G cards
you've had good fortune with, I'd be interested in knowing which models.
  Broadcom cards are rarely labeled as such which makes finding them a
bit more difficult than Intel ones.


We've purchased a number of servers with Broadcom BCM957810A1008G, sold 
by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up 
& down like a yo-yo so far.



So far the one hypervisor with pci=nomsi has been quiet but that doesn't
mean it's fixed.  I need to give it 6 weeks or so. :)


It'd be more like 6-9 months for us, making it terrible to debug it :-/

Adi Pircalabu
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 04:17 PM, Adi Pircalabu wrote:
> On 28/01/17 05:21, Kevin Stange wrote:
>> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
>>> Have you tried to eliminate all power management features all over?
>>
>> I've been trying to find and disable all power management features but
>> having relatively little luck with that solving the problems.  Stabbing
>> the the dark I've tried different ACPI settings, including completely
>> disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
>> on the kernel command line.  Are there other kernel options that might
>> be useful to try?
> 
> May I chip in here? In our environment we're randomly seeing:

Welcome.  It's a relief to know someone else has been having a similar
nightmare!  Perhaps that's not encouraging...

> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: Detected Tx Unit
> Hang
> Jan 17 23:40:14 xen01 kernel:  Tx Queue <0>
> Jan 17 23:40:14 xen01 kernel:  TDH, TDT <9a>, <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_use  <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_clean<98>
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6:
> tx_buffer_info[next_to_clean]
> Jan 17 23:40:14 xen01 kernel:  time_stamp   <218443db3>
> Jan 17 23:40:14 xen01 kernel:  jiffies  <218445368>
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: tx hang 1
> detected on queue 0, resetting adapter
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: Reset adapter
> Jan 17 23:40:15 xen01 kernel: ixgbe :04:00.1 eth6: PCIe transaction
> pending bit also did not clear.
> Jan 17 23:40:15 xen01 kernel: ixgbe :04:00.1: master disable timed out
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for
> interface eth6, disabling it in 200 ms.
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely
> down for interface eth6, disabling it
> [...] repeated every second or so.
> 
>>> Are the devices connected to the same network infrastructure?
>>
>> There are two onboard NICs and two NICs on a dual-port card in each
>> server.  All devices connect to a cisco switch pair in VSS and the links
>> are paired in LACP.
> 
> We've been experienced ixgbe stability issues on CentOS 6.x with various
> 3.x kernels for years with different ixgbe driver versions and, to date,
> the only way to completely get rid of the issue was to switch from Intel
> to Broadcom. Just like in your case, the problem pops up randomly and
> the only reliable temporary fix is to reboot the affected Xen node.
> Another temporary fix that worked several times but not always was to
> migrate / shutdown the domUs, deactivate the volume groups, log out of
> all the iSCSI targets, "ifdown bond1" and "modprobe -r ixgbe" followed
> by "ifup bond1".
> 
> The set up is:
> - Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
> - Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
> - LACP bonding to connect to the NFS & iSCSI storage using Brocade
> VDX6740T fabric. MTU=9000

You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
4.4 kernel.  Any chance you have tested with that one?

Did you ever try without MTU=9000 (default 1500 instead)?

I am having certain issues on certain hardware where there's no shutting
down the affected NICs.  Trying to do so or unload the igb module hangs
the entire box.  But in that case they're throwing AER errors instead of
just unit hangs:

pcieport :00:03.0: AER: Uncorrected (Non-Fatal) error received: id=
igb :04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
type=Transaction Layer, id=0401(Requester ID)
igb :04:00.1:   device [8086:10a7] error status/mask=4000/
igb :04:00.1:[14] Completion Timeout (First)
igb :04:00.1: broadcast error_detected message
igb :04:00.1: broadcast slot_reset message
igb :04:00.1: broadcast resume message
igb :04:00.1: AER: Device recovery successful

Spammed continuously.

Switching to Broadcom would be a possibility, though it's tricky because
two of the NICs are onboard, so we'd need to replace the dual-port 1G
card with a quad-port 1G card.  Since you're saying you're all 10G,
maybe you don't know, but if you have any specific Broadcom 1G cards
you've had good fortune with, I'd be interested in knowing which models.
 Broadcom cards are rarely labeled as such which makes finding them a
bit more difficult than Intel ones.

>>> There has to be something common.
>>
>> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
>> and NFS traffic, as well as some basic management stuff over SSH, and
>> they are configured with an MTU of 9000 on the native VLAN.  It's a lot
>> of features, but I can't really turn them off and then actually have
>> enough load on the NICs to reproduce the issue.  Several of these
>> servers were installed and being burned in for 3 months without ever
>> having an issue, but suddenly collapsed when I 

Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 02:15 PM, Johnny Hughes wrote:
> On 01/30/2017 12:59 PM, Kevin Stange wrote:
>> On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
 Are there other kernel options that might be useful to try?
>>>
>>> pci=nomsi
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
>>
>> Incidentally, already found that one and I'm trying it currently on one
>> of the boxes.  So far there's been no issues, but it's only been since
>> Friday.
>>
>> Also, I found this:
>>
>> https://xen.crc.id.au/support/guides/install/
>>
>> There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
>> to see how stable it is, also only since Friday.  I'm not using anything
>> else he's packaged from his repo.
>>
>> On a related note, does the SIG have plans to replace the 3.18 kernel
>> which is marked as projected EOL of January 2017
>> (https://www.kernel.org/category/releases.html)?
>>
> 
> I am currently working on a 4.4 kernel as a replacement for the 3.18
> kernel.  I have it working well no el7, but not yet working well on el6.
>  I hope to have something to release in the first 2 weeks of Feb. for
> testing.

What kind of issues are you having with 4.4?  Since I'm testing that
"Xen Made Easy" build of 4.4, are there any things I should watch out
for?  Might be worth looking at what he did for his builds to see if
that helps get yours working better.

http://au1.mirror.crc.id.au/repo/el6/SRPM/

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Johnny Hughes
On 01/30/2017 12:59 PM, Kevin Stange wrote:
> On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
>>> Are there other kernel options that might be useful to try?
>>
>> pci=nomsi
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
> 
> Incidentally, already found that one and I'm trying it currently on one
> of the boxes.  So far there's been no issues, but it's only been since
> Friday.
> 
> Also, I found this:
> 
> https://xen.crc.id.au/support/guides/install/
> 
> There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
> to see how stable it is, also only since Friday.  I'm not using anything
> else he's packaged from his repo.
> 
> On a related note, does the SIG have plans to replace the 3.18 kernel
> which is marked as projected EOL of January 2017
> (https://www.kernel.org/category/releases.html)?
> 

I am currently working on a 4.4 kernel as a replacement for the 3.18
kernel.  I have it working well no el7, but not yet working well on el6.
 I hope to have something to release in the first 2 weeks of Feb. for
testing.



signature.asc
Description: OpenPGP digital signature
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
>>Are there other kernel options that might be useful to try?
> 
> pci=nomsi
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13

Incidentally, already found that one and I'm trying it currently on one
of the boxes.  So far there's been no issues, but it's only been since
Friday.

Also, I found this:

https://xen.crc.id.au/support/guides/install/

There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
to see how stable it is, also only since Friday.  I'm not using anything
else he's packaged from his repo.

On a related note, does the SIG have plans to replace the 3.18 kernel
which is marked as projected EOL of January 2017
(https://www.kernel.org/category/releases.html)?

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Selinux Problem

2017-01-30 Thread George Dunlap
On Thu, Jan 26, 2017 at 8:08 PM, Günther J. Niederwimmer
 wrote:
> Hello,
>
> Am Donnerstag, 26. Januar 2017, 10:54:20 CET schrieb Johnny Hughes:
>> On 01/26/2017 10:06 AM, Günther J. Niederwimmer wrote:
>> > Hello,
>> >
>> > CentOS 7.(3) Xen 4.4,
>> >
>> > Can I find any Doc for selinux with XEN, I found many Problems with
>> > selinux on Dom0 ?
>> >
>> > Or have I to disable selinux when I install XEN.
>> >
>> > Thank's for a answer.
>>
>> We have not tried to make xen work with selinux on Dom0 .. in fact our
>> documentation:
>>
>> https://wiki.centos.org/Manuals/ReleaseNotes/Xen4-01
>>
>>  says:
>>
>> SELinux support is disabled, and you might need to disable SELinux on
>> the dom0 for some operations; primarily when using qemu-xen and blktap
>> backed storage.
>
> This is not the best Situation, but when I have no other way I have to disable
> selinux :-(.

I think that comment may be a little old.  I do try to support SELinux
-- the smoke tests I use before pushing changes have it enabled by
default, and they use both qemu-xen and blktap.

But it's difficult to help debug problems when you haven't even said
what problem(s) you're having. :-)

Please be sure to include the output of `dmesg`, `xl dmesg`, your
xl.cfg, and /var/log/audit/audit.log.

Thanks,
 -George
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Jinesh Choksi
>Are there other kernel options that might be useful to try?

pci=nomsi

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13



On 27 January 2017 at 18:21, Kevin Stange  wrote:

> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
> > Have you tried to eliminate all power management features all over?
>
> I've been trying to find and disable all power management features but
> having relatively little luck with that solving the problems.  Stabbing
> the the dark I've tried different ACPI settings, including completely
> disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
> on the kernel command line.  Are there other kernel options that might
> be useful to try?
>
> > Are the devices connected to the same network infrastructure?
>
> There are two onboard NICs and two NICs on a dual-port card in each
> server.  All devices connect to a cisco switch pair in VSS and the links
> are paired in LACP.
>
> > There has to be something common.
>
> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
> and NFS traffic, as well as some basic management stuff over SSH, and
> they are configured with an MTU of 9000 on the native VLAN.  It's a lot
> of features, but I can't really turn them off and then actually have
> enough load on the NICs to reproduce the issue.  Several of these
> servers were installed and being burned in for 3 months without ever
> having an issue, but suddenly collapsed when I tried to bring 20 or so
> real-world VMs up on them.
>
> The other NICs in the system that are connected don't exhibit issues and
> run only VM network interfaces.  They are also in LACP and running VLAN
> tags, but normal 1500 MTU.
>
> So far it seems to correlate with NICs on the expansion cards, but it's
> a coincidence that these cards are the ones with the storage and
> management traffic.  I'm trying to swap some of this load to the onboard
> NICs to see if the issues migrate over with it, or if they stay with the
> expansion cards.
>
> If the issue exists on both NIC types, then it rules out the specific
> NIC chipset as the culprit.  It could point to the driver, but upgrading
> it to a newer version did not help and actually appeared to make
> everything worse.  This issue might actually be more to do with the PCIe
> bridge than the NICs, but these are still different motherboards with
> different PCIe bridges (5520 vs C600) experiencing the same issues.
>
> > I've been using Intel NICs with Xen/CentOS for ages with no issues.
>
> I figured that must be so.  Everyone uses Intel NICs.  If this was a
> common issue, it would probably be causing a lot of people a lot of
> trouble.
>
> --
> Kevin Stange
> Chief Technology Officer
> Steadfast | Managed Infrastructure, Datacenter and Cloud Services
> 800 S Wells, Suite 190 | Chicago, IL 60607
> 312.602.2689 X203 | Fax: 312.602.2688
> ke...@steadfast.net | www.steadfast.net
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt