[CentOS-virt] XSA-384

2021-09-22 Thread Kevin Stange
I was looking to see if XSA-384 was in testing for CentOS Virt and so far
it doesn't look like it is yet.  From the patch, it looks like it touches
x86 code.  Can anyone push a build with this version?

 

Thanks.

 

Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
231 S LaSalle St, Suite 2100 | Chicago, IL 60604
312.602.2689 x203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net 

 

___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen Version update policy

2019-12-12 Thread Kevin Stange
On 12/12/19 8:25 AM, George Dunlap wrote:
> On Mon, Dec 2, 2019 at 5:08 PM Kevin Stange  wrote:
>> I don't really think we should drop a release before its security
>> support ends, unless we have *really clear* communication to repo users
>> as to the life cycles of these builds in advance.
> 
> Indeed, the purpose of this email is in fact to make such a clear
> communication.  Citrix (i.e., Anthony & I) have committed to providing
> up-to-date packages for one version at a time; this is meant to give
> people input into which version that is.  The Virt SIG cannot as a
> whole commit to supporting releases until security support ends unless
> others step up and make commitments to do so.

We should probably provide a matrix of which Xen versions are offered by
the SIG, who is maintaining them, and when they will last be supported
(roughly if it's not 100% clear due to upstream scheduling).

There's a bunch of legacy Xen4CentOS and other confusing Xen docs in the
CentOS documentation that need to be cleaned up, removed, or unified.

I'm happy to continue working on and testing releases for whichever
branch I'm currently on (which is 4.8 right now, but I'm moving on since
upstream is done with security support).  However I can't make
commitments to support or test versions I am not actively running in
production, nor provide specific life cycle guarantees.  I made that
mistake with 4.4, though meltdown was an extenuating circumstance; I
simply couldn't handle that kind of backport myself.

That said, I *may* offer continued support and testing for 4.12 when
4.14's release pushes it out of maintenance by Virt SIG.  That's
something I'm going to have to play by ear, I guess.

I don't want to burden Steven Haigh any, but I wonder if there's a way
we could combine some of our efforts to make both "Xen made easy!" and
the Virt SIG Xen easier to manage.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen Version update policy

2019-12-02 Thread Kevin Stange
On 12/2/19 11:08 AM, Kevin Stange wrote:
> On 11/28/19 12:12 PM, George Dunlap wrote:
>> Hey all,
>>
>> This mail has been a long time in coming, but with the upcoming
>> expiration of security support for Xen 4.8, it's time to start thinking
>> about what our update policy will be for the Xen packages in general.
>>
>> Citrix is committed to officially supporting one Xen version at a time
>> through the CentOS Virt SIG.  (Others in the community are welcome to
>> support others.)  But we'd like input as to which version the community
>> would like to be supported at any one time.
>>
>> Please express your opinion on each option by replying as follows:
>> -2: This is a bad idea, and I would argue against this
>> -1: I'm not happy with this, but I wouldn't argue against this.
>> 0: No opinion.
>> 1: I'm happy with this, but I wouldn't argue for it.
>> 2: This is a great idea, and I'd argue for it.
>>
>> There are several possible options:
>>
>> 1. Always support the newest option.  This means we get all the newest
>> features from Xen in the Virt SIG by default; but also means we get all
>> the newest bugs.
>>
>> 1a. Always support the newest option once it has at least one point
>> release.  This balances the newness with a bit of extra testing.
>>
>> 1b. Always support the second-to-newest version (e.g., when 4.13 comes
>> out, switch to 4.12.x)
>>
>> 2. Always support the oldest security-supported version.  This means we
>> get the most stable version of Xen; but it does mean it is several years
>> behind as far as features go.  It also means that further bugfixes do
>> not happen automatically, and further bugs found will need to be
>>
>> 3. Always support the oldest fully-supported version.  Reasonably
>> stable, reasonably old, still gets bugfixes.
>>
>> 4. Support a version until it's out of security support, then jump to
>> the newest version.  This minimizes the number of upgrades required
>> (although may make each upgrade more painful).
>>
>> 4a.  Support a version until it's out of full support, then jump to the
>> newest version.
>>
>> Any other options?
>>
>> For my part, I think 1a, 1b, and 3 are all reasonable options.
> 
> By supporting only even numbered releases as is the case now, it has not
> been possible to do hot migration based upgrades which means that we
> have to do full reboots of our entire environment every so often.  Right
> now we're running on Xen 4.8 and transitioning to 4.12 directly.  We
> skipped 4.10 because we felt that 4.12 has been out and stable for long
> enough.  Ideally if every major build of Xen were provided we would
> transition by hot migrations up to the next release periodically and
> stay on a security supported release each time one is going toward EOL.
> 
> Personally I would love to see at the very least transitional packages
> for each Xen version available to allow for easier hot migrations to the
> latest release, under the assumption that such migrations are considered
> "supported" upstream.  I believe you said this was to be expected in a
> previous conversation we had on IRC.
> 
> I don't really think we should drop a release before its security
> support ends, unless we have *really clear* communication to repo users
> as to the life cycles of these builds in advance.
> 
> I get why providing updates for 5 major releases concurrently is
> prohibitive for the entire security support period, though if it were
> more automated, maybe it would be easier to manage.
> 
> I think the keys are making sure that the lifecycles are clearly
> communicated in advance and that there's a fairly reliable path for
> people to move up to the latest version that is suitable for production
> use.  So I wouldn't say no to a 1 + 1a + 1b configuration, with the idea
> that 1 is effectively "testing" to become stable at 1a, then
> simultaneously always provide 1b as well.  That would, by my
> interpretation mean there are always 2 or 3 supported versions.  Right
> now, 4.12 "stable" and 4.11 "legacy" would be supported.  When 4.13
> comes out, 4.13 would be "testing" but would be fully maintained with
> security and point release updates.  When 4.13.1 is released it would
> become "stable," 4.11 would be deprecated and 4.12 would become "legacy."
> 
> However, during the transitional period maybe we need to commit to
> supporting 4.10 until its security support ends.
> 

I realized I didn't respond in any way to rate the options as requested.
I don't really support any configuration that doesn't

Re: [CentOS-virt] Xen Version update policy

2019-12-02 Thread Kevin Stange
On 11/28/19 12:12 PM, George Dunlap wrote:
> Hey all,
> 
> This mail has been a long time in coming, but with the upcoming
> expiration of security support for Xen 4.8, it's time to start thinking
> about what our update policy will be for the Xen packages in general.
> 
> Citrix is committed to officially supporting one Xen version at a time
> through the CentOS Virt SIG.  (Others in the community are welcome to
> support others.)  But we'd like input as to which version the community
> would like to be supported at any one time.
> 
> Please express your opinion on each option by replying as follows:
> -2: This is a bad idea, and I would argue against this
> -1: I'm not happy with this, but I wouldn't argue against this.
> 0: No opinion.
> 1: I'm happy with this, but I wouldn't argue for it.
> 2: This is a great idea, and I'd argue for it.
> 
> There are several possible options:
> 
> 1. Always support the newest option.  This means we get all the newest
> features from Xen in the Virt SIG by default; but also means we get all
> the newest bugs.
> 
> 1a. Always support the newest option once it has at least one point
> release.  This balances the newness with a bit of extra testing.
> 
> 1b. Always support the second-to-newest version (e.g., when 4.13 comes
> out, switch to 4.12.x)
> 
> 2. Always support the oldest security-supported version.  This means we
> get the most stable version of Xen; but it does mean it is several years
> behind as far as features go.  It also means that further bugfixes do
> not happen automatically, and further bugs found will need to be
> 
> 3. Always support the oldest fully-supported version.  Reasonably
> stable, reasonably old, still gets bugfixes.
> 
> 4. Support a version until it's out of security support, then jump to
> the newest version.  This minimizes the number of upgrades required
> (although may make each upgrade more painful).
> 
> 4a.  Support a version until it's out of full support, then jump to the
> newest version.
> 
> Any other options?
> 
> For my part, I think 1a, 1b, and 3 are all reasonable options.

By supporting only even numbered releases as is the case now, it has not
been possible to do hot migration based upgrades which means that we
have to do full reboots of our entire environment every so often.  Right
now we're running on Xen 4.8 and transitioning to 4.12 directly.  We
skipped 4.10 because we felt that 4.12 has been out and stable for long
enough.  Ideally if every major build of Xen were provided we would
transition by hot migrations up to the next release periodically and
stay on a security supported release each time one is going toward EOL.

Personally I would love to see at the very least transitional packages
for each Xen version available to allow for easier hot migrations to the
latest release, under the assumption that such migrations are considered
"supported" upstream.  I believe you said this was to be expected in a
previous conversation we had on IRC.

I don't really think we should drop a release before its security
support ends, unless we have *really clear* communication to repo users
as to the life cycles of these builds in advance.

I get why providing updates for 5 major releases concurrently is
prohibitive for the entire security support period, though if it were
more automated, maybe it would be easier to manage.

I think the keys are making sure that the lifecycles are clearly
communicated in advance and that there's a fairly reliable path for
people to move up to the latest version that is suitable for production
use.  So I wouldn't say no to a 1 + 1a + 1b configuration, with the idea
that 1 is effectively "testing" to become stable at 1a, then
simultaneously always provide 1b as well.  That would, by my
interpretation mean there are always 2 or 3 supported versions.  Right
now, 4.12 "stable" and 4.11 "legacy" would be supported.  When 4.13
comes out, 4.13 would be "testing" but would be fully maintained with
security and point release updates.  When 4.13.1 is released it would
become "stable," 4.11 would be deprecated and 4.12 would become "legacy."

However, during the transitional period maybe we need to commit to
supporting 4.10 until its security support ends.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Are XSA-289, XSA-274/CVE-2018-14678 fixed ?

2019-06-28 Thread Kevin Stange
Looks like this never got a response from anyone.

On 6/25/19 10:15 AM, Yuriy Kohut wrote:
> Hello,
> 
> Are XSA-289 and XSA-274/CVE-2018-14678 fixed with Xen recent 4.8, 4.10 and 
> kernel 4.9.177 packages  ?

XSA-289 is a tricky subject.  In the end, it was effectively decided
that these patches were not recommended until they were reviewed again
and XSA-289 has no official list of flaws or fixes as a result.  The
main mitigation action suggested is to disable SMT on the CPU if possible.

XSA-274 was patched into Linux 4.9 almost a year ago:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=987156381c5f875d75ef1f7cc29994d82f646dad

That's 4.9.124, so yes, 4.9.177 has it.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Speculative attack mitigations

2019-06-12 Thread Kevin Stange
s not quite the same as running natively on
your hardware.  Your dom0 is a type of guest, so the mitigations
described in the /sys/ location only reflect the state of the dom0
itself, not Xen itself or other guests.

The meltdown section will always show as vulnerable because the Linux
kernel does not recognize the Xen-based mitigation, even though it is
effective for dom0.  Also, PV domains running inside a Xen release
patched for meltdown are protected in spite of the indicator by the
guest kernel.

> When I mount the debugfs, the flag for ibrs is missing - preventing me
> enabling it as a mitigation for ssbd (due to lack of relevant cpu flag):
> 
> # mount -t debugfs none /sys/kernel/debug/
> 
> 
> # ls -lahtr /sys/kernel/debug/x86/ibrs_enabled
> ls: cannot access /sys/kernel/debug/x86/ibrs_enabled: No such file or
> directory
> 
> I have other R620's with the same CPUs running stock el6 kernels that
> are showing as fully patched to these issues.   Could I please get some
> feedback from whoever builds out the kernels; if the mitigations/patches
> are in place in the 4.9.177-35 kernel for the various speculative
> mitigations?
> 
> Specifically; CVE-2018-3639, CVE-2018-3640, CVE-2018-3646, CVE-2018-12126, 
> CVE-2018-12130, CVE-2018-12127
> and CVE-2019-11091.

Just to reiterate, these fixes are in 4.9.177 but the mitigations
generally require cooperation from Xen, and Xen 4.6 builds from CentOS
do not contain fixes for disclosed issues from May 2018 onward.

For proper mitigation, you need to upgrade to Xen 4.8 or newer, but I
would suggest you consider going to 4.10 if possible (or 4.12 if you
upgrade to CentOS 7) to give you more headroom before EOL as well as a
number of performance improvements, both related to the mitigations and
otherwise.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen-kernel: Update to 4.14 or 4.19?

2019-03-07 Thread Kevin Stange
On 3/7/19 12:55 PM, Karl Johnson wrote:
> 
> 
> On Thu, Mar 7, 2019 at 1:42 PM Sarah Newman  <mailto:s...@prgmr.com>> wrote:
> 
> On 3/7/19 10:30 AM, Akemi Yagi wrote:
> > On Thu, Mar 7, 2019 at 9:42 AM George Dunlap  <mailto:dunl...@umich.edu>> wrote:
> >>
> >> Hey all,
> >>
> >> We've been on 4.9 for some time now, and while it's still
> supported, I
> >> think it's time to start thinking about upgrading, and I'd like input
> >> from the community about which version to move up to.
> >>
> >> 4.19 has been out for almost 5 months now.  It will include PVH domU
> >> support, and PVH dom0 support in what _is believed_ to be the final
> >> form; so when the Virt SIG moves to a version of Xen that
> supports PVH
> >> dom0, the kernel will already be in place with no need to upgrade.
> >>
> >> The other option would be to move to 4.14: Probably more stable (as
> >> it's been out for over a year now), but doesn't have either PVH domU
> >> or PVH dom0 support.
> >>
> >> I'd suggest 4.19. Any other opinions?
> >>
> >>  -George
> >
> > You may also want to consider each version's EOL:
> >
> > 4.9   Jan, 2023
> > 4.14   Jan, 2020
> > 4.19   Dec, 2020
> 
> Regardless of EOL date, I think it's worth trying to upgrade when
> Xen has stable PVH dom0 support.
> 
> I am pretty sure historically that there have been difficulties
> backporting some of the side channel mitigations as they can be
> quite invasive. That
> may be another reason to upgrade sooner rather than later.
> 
> --Sarah
> 
> 
> +1 for 4.19. However, this version requires a recent GCC version so it
> wont build at least for el6 on the CBS. We would have to build them with
> recent GCC from devtoolset like I do in my pull request (gcc 7.3.1).
> 
> Karl

I am +1 for 4.19 as well and I agree with Sarah's reasoning that we'll
want stable PVH dom0 support as soon as it's reasonable.  However, I had
serious stability issues with 3.18 in the past and I would want keep a
major kernel bump in the testing repo for 3-6 months before moving it to
release.  I will do as much testing as I can during that time to
establish stability on my side.

It might make sense that we just bump to 4.19 for EL7 to avoid the
complications related to devtoolset on EL6. 4.9 lasts the entire
remaining lifetime of EL6, but will come up slightly short of EL7's.
However that means bumping two divergent kernels periodically for each
set of repos.

Based on recent history (4.4, 4.9) we can probably expect both 4.14 and
4.19 to become 6 year kernels extending to Jan 2024 and Dec 2024
respectively, though GKH seems to like to wait until close to the
original EOL to announce these decisions.  We can likely also expect a
kernel like 5.3 to end up becoming longterm around end of 2019.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] xen 4.11

2019-01-07 Thread Kevin Stange
On 1/4/19 10:24 AM, Christoph wrote:
> Hi
> 
> is there a reason why xen 4.11 isnt released for centos7 (I cant find it
> on http://mirror.centos.org/centos/7/virt/)?

CentOS Virt SIG has a policy of releasing only even numbered Xen
releases, so the next one that will be published is Xen 4.12.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] how to update ucode with xen

2018-09-19 Thread Kevin Stange
On 9/19/18 1:27 PM, Christoph wrote:
> it is working thx a lot...
> 
> but it has included the GenuineIntel.bin only in actually used kernel...
> do I need to reinstall microcode_ctl every time I update the kernel?
> And second question, I quess I still need the ucode = scan as xen
> parameter right?

You won't need to reinstall microcode_ctl.  Once you create the file at
/etc/microcode_ctl/ucode_with_caveats/force any time you update the
kernel or microcode_ctl it will put the microcode into the initramfs
automatically.

You do need to keep ucode=scan on the Xen command line because that is
how it knows to scan the initramfs for the microcode.

> 
> Am 2018-09-19 20:08, schrieb Kevin Stange:
>> On 9/19/18 1:55 AM, Christoph wrote:
>>>
>>> Hi
>>>
>>> can someone say me how to update the µcode of the cpu with xen?
>>>
>>> I have added the ucode=scan parameter to xen but it does not seem to
>>> work...
>>>
>>> the µcode version of my xeon is really old :/
>>>
>>> model name    : Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz
>>> microcode    : 0x10
>>>
>>
>> There is a "caveat" in the current version of microcode_ctl which means
>> it doesn't automatically install the microcode into the initramfs if the
>> kernel isn't "known good" because of various issues with Linux kernel
>> patches being needed for certain microcode features.  There is a quick
>> way to get it to force the microcode into the initramfs of any kernel:
>>
>> mkdir -p /etc/microcode_ctl/ucode_with_caveats/
>> touch /etc/microcode_ctl/ucode_with_caveats/force
>>
>> This only works with the most recent version of microcode_ctl
>> (2.1-29.16.el7_5.x86_64).  If you do this, you can then run 'yum
>> reinstall microcode_ctl' and you should get the microcode in the
>> initramfs after it finishes.  Note that this will cause it to trust ALL
>> kernels and all microcode versions which might not always be a good
>> thing.  See this file for info:
>>
>> /usr/share/doc/microcode_ctl/README.caveats
>>
>> You can test that the initramfs has the microcode by running cpio:
>>
>> cpio -t < /boot/initramfs-4.9.112-32.el7.x86_64.img
>>
>> If there is a GenuineIntel.bin you should be good.  If you get spammed
>> with errors, then it isn't included.
> 


-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] how to update ucode with xen

2018-09-19 Thread Kevin Stange
On 9/19/18 1:55 AM, Christoph wrote:
> 
> Hi
> 
> can someone say me how to update the µcode of the cpu with xen?
> 
> I have added the ucode=scan parameter to xen but it does not seem to
> work...
> 
> the µcode version of my xeon is really old :/
> 
> model name    : Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz
> microcode    : 0x10
> 

There is a "caveat" in the current version of microcode_ctl which means
it doesn't automatically install the microcode into the initramfs if the
kernel isn't "known good" because of various issues with Linux kernel
patches being needed for certain microcode features.  There is a quick
way to get it to force the microcode into the initramfs of any kernel:

mkdir -p /etc/microcode_ctl/ucode_with_caveats/
touch /etc/microcode_ctl/ucode_with_caveats/force

This only works with the most recent version of microcode_ctl
(2.1-29.16.el7_5.x86_64).  If you do this, you can then run 'yum
reinstall microcode_ctl' and you should get the microcode in the
initramfs after it finishes.  Note that this will cause it to trust ALL
kernels and all microcode versions which might not always be a good
thing.  See this file for info:

/usr/share/doc/microcode_ctl/README.caveats

You can test that the initramfs has the microcode by running cpio:

cpio -t < /boot/initramfs-4.9.112-32.el7.x86_64.img

If there is a GenuineIntel.bin you should be good.  If you get spammed
with errors, then it isn't included.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] libvirt and libvirt-daemon-xen: failing dependencies

2018-05-17 Thread Kevin Stange
On 05/17/2018 01:57 PM, Frank Sauerburger wrote:
> Hi all,
> 
> I'm trying to install libvirt for xen on a brand new, minimal
> installation of CentOS 7.5.1804. After installing the OS, I did a 'yum
> update' and followed the basic how-tos at
> 
> https://wiki.centos.org/HowTos/Xen/Xen4QuickStart
> 
> and
> 
> https://wiki.centos.org/HowTos/Xen/Xen4QuickStart/Xen4Libvirt
> 
> From previous experience, I know that the above steps worked fine.
> However, now on CentOS 7.5, I am seeing the dependency resolution failing.
> 
> $ yum install libvirt libvirt-daemon-xen
> [ ... ]
> Error: Package: libvirt-daemon-xen-3.2.1-402.el7.x86_64 (centos-virt-xen-46)
>Requires: libvirt-daemon-driver-nwfilter = 3.2.1-402.el7
>Available:
> libvirt-daemon-driver-nwfilter-3.2.1-402.el7.x86_64 (centos-virt-xen-46)
>libvirt-daemon-driver-nwfilter = 3.2.1-402.el7
>Available: libvirt-daemon-driver-nwfilter-3.9.0-14.el7.x86_64
> (base)
>libvirt-daemon-driver-nwfilter = 3.9.0-14.el7
>Installing:
> libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.2.x86_64 (updates)
>libvirt-daemon-driver-nwfilter = 3.9.0-14.el7_5.2
> Error: Package: libvirt-daemon-driver-libxl-3.2.1-402.el7.x86_64
> (centos-virt-xen-46)
>Requires: libvirt-daemon = 3.2.1-402.el7
>Available: libvirt-daemon-3.2.1-402.el7.x86_64
> (centos-virt-xen-46)
>libvirt-daemon = 3.2.1-402.el7
>Available: libvirt-daemon-3.9.0-14.el7.x86_64 (base)
>libvirt-daemon = 3.9.0-14.el7
>Installing: libvirt-daemon-3.9.0-14.el7_5.2.x86_64 (updates)
>libvirt-daemon = 3.9.0-14.el7_5.2
> 
> [ ... similar errors ... ]
>  You could try using --skip-broken to work around the problem
>  You could try running: rpm -Va --nofiles --nodigest
> 
> I've never seen an error like this before. As far as I read it,
> libvirt-daemon-xen wants version 3.2.1 of its dependencies. I've checked
> the repositories and it seems that in 'base' and 'updates' we have
> version 3.9.0 and in the 'virt' repo we have 3.2.1 of libvirt. Can I
> solve this issue on my end? Should I force yum to install the packages
> from the 'virt' repo?

Red Hat upgraded the libvirt provided in 7.5 from 3.2.0 to 3.9.0, but
they don't provide a Xen driver, which is the reason you are seeing the
dependency issue.

A 4.1.0 release of libvirt is currently in testing, along with other
updates to deal with CentOS 7.5 changes.  You also need to exclude the
seabios updates because HVM guests cannot boot using the new Red Hat
version.

The best workaround for now is probably to either set up
yum-plugin-priorities and set a high priority on the centos-virt-xen*
repo, or exclude=libvirt* seabios* in your base and updates repos.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen 4.6.6-9 (with XPTI meltdown mitigation) packages making their way to centos-virt-xen-testing

2018-01-23 Thread Kevin Stange
On 01/23/2018 05:57 PM, Karl Johnson wrote:
> 
> 
> On Tue, Jan 23, 2018 at 4:50 PM, Nathan March <nat...@gt.net
> <mailto:nat...@gt.net>> wrote:
> 
> Hi,
> 
> > Hmm.. isn't this the ldisc bug that was discussed a few months ago on 
> this
> list,
> > and a patch was applied to virt-sig kernel aswell?
> >
> > Call trace looks similar..
> 
> Good memory! I'd forgotten about that despite being the one who ran
> into it.
> 
> Looks like that patch was just removed in 4.9.75-30 which I just
> upgraded
> this system to: http://cbs.centos.org/koji/buildinfo?buildID=21122
> <http://cbs.centos.org/koji/buildinfo?buildID=21122>
> Previously I was on 4.9.63-29 which does not have this problem, and does
> have the ldisc patch. So I guess the question is for Johnny, why was it
> removed?
> 
> In the meantime, I'll revert the kernel and follow up if I see any
> further
> problems.
> 
> 
> IIRC the patch has been removed from the spec file because it has been
> merged upstream in 4.9.71.

The IRC discussion I found in my log indicates that it was removed
because it didn't apply cleanly due to changes when updating to 4.9.75,
yet I don't think anyone independently validated that the changes made
are equivalent to the patch that was removed.  I was never able to
reproduce this issue, so I didn't investigate it myself.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen 4.4 Immediate EOL

2018-01-19 Thread Kevin Stange
On 01/19/2018 06:17 AM, Pasi Kärkkäinen wrote:
> On Thu, Jan 18, 2018 at 11:48:35AM -0600, Kevin Stange wrote:
>> Hi,
>>
> 
> Hi,
>  
>> I am very sorry to do this on short notice, but obviously Meltdown and
>> Spectre are a lot more than anyone was really expecting to come down the
>> pipeline.  Xen 4.4 has been EOL upstream for about a year now and I have
>> personally been reviewing and backporting patches based on the 4.5
>> versions made available upstream.
>>
>> Given that 4.5 is now also reaching EOL, backporting to 4.4 will become
>> harder and I've already taken steps to vacate 4.4 in my own environment
>> ASAP.  Spectre and Meltdown patches most likely will only officially
>> reach 4.6 and are very complicated.  Ultimately, I don't think this is a
>> constructive use of my time.  Therefore, I will NOT be continuing to
>> provide updated Xen 4.4 builds any longer through CentOS Virt SIG.  If
>> someone else would like to take on the job, you're welcome to try.  Pop
>> by #centos-virt on Freenode to talk to us there if you're interested.
>>
>> For short term mitigation of the Meltdown issue on 4.4 with PV domains,
>> your best bet is probably to use the "Vixen" shim solution, which George
>> has put into the xen-44 package repository per his email from two days
>> ago. Vixen allows you to run PV domains inside HVM guest containers.  It
>> does not protect the guest from itself, but protects the domains from
>> each other.  Long term, your best bet is to try to get up to a new
>> version of Xen that is under upstream security support, probably 4.8.
> 
> Oracle VM 3.4 product is based on Xen 4.4, and they seem to have backported 
> the fixes already.. 
> 
> It looks like those src.rpms have {CVE-2017-5753} {CVE-2017-5715} 
> {CVE-2017-5754} fixes included.
> 
> https://oss.oracle.com/pipermail/oraclevm-errata/2018-January/thread.html
> https://oss.oracle.com/pipermail/oraclevm-errata/2018-January/000816.html
> https://oss.oracle.com/pipermail/oraclevm-errata/2018-January/000817.html
> 
> http://oss.oracle.com/oraclevm/server/3.4/SRPMS-updates/xen-4.4.4-155.0.12.el6.src.rpm
> http://oss.oracle.com/oraclevm/server/3.4/SRPMS-updates/xen-4.4.4-105.0.30.el6.src.rpm

That's impressive but dubious as Xen has not released any fixes for
CVE-2017-5753 or CVE-2017-5715 even for 4.10 yet.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen 4.4 Immediate EOL

2018-01-18 Thread Kevin Stange
On 01/18/2018 11:48 AM, Kevin Stange wrote:
> Hi,
> 
> I am very sorry to do this on short notice, but obviously Meltdown and
> Spectre are a lot more than anyone was really expecting to come down the
> pipeline.  Xen 4.4 has been EOL upstream for about a year now and I have
> personally been reviewing and backporting patches based on the 4.5
> versions made available upstream.
> 
> Given that 4.5 is now also reaching EOL, backporting to 4.4 will become
> harder and I've already taken steps to vacate 4.4 in my own environment
> ASAP.  Spectre and Meltdown patches most likely will only officially
> reach 4.6 and are very complicated.  Ultimately, I don't think this is a
> constructive use of my time.  Therefore, I will NOT be continuing to
> provide updated Xen 4.4 builds any longer through CentOS Virt SIG.  If
> someone else would like to take on the job, you're welcome to try.  Pop
> by #centos-virt on Freenode to talk to us there if you're interested.
> 
> For short term mitigation of the Meltdown issue on 4.4 with PV domains,
> your best bet is probably to use the "Vixen" shim solution, which George
> has put into the xen-44 package repository per his email from two days
> ago. Vixen allows you to run PV domains inside HVM guest containers.  It
> does not protect the guest from itself, but protects the domains from
> each other.  Long term, your best bet is to try to get up to a new
> version of Xen that is under upstream security support, probably 4.8.

Apparently I failed to do proper due diligence before making this
recommendation.  The Xen 4.4 repo does not have vixen build because of a
dependency upon grub2 which isn't available under CentOS 6.  Your best
bet would be to use Vixen for PV domains, so if you think that's
something you want to do, we need some volunteers to help with packaging
and testing.  Otherwise, use HVM domains or upgrade to a newer version
of Xen.  Sorry for this error on my part.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


[CentOS-virt] Xen 4.4 Immediate EOL

2018-01-18 Thread Kevin Stange
Hi,

I am very sorry to do this on short notice, but obviously Meltdown and
Spectre are a lot more than anyone was really expecting to come down the
pipeline.  Xen 4.4 has been EOL upstream for about a year now and I have
personally been reviewing and backporting patches based on the 4.5
versions made available upstream.

Given that 4.5 is now also reaching EOL, backporting to 4.4 will become
harder and I've already taken steps to vacate 4.4 in my own environment
ASAP.  Spectre and Meltdown patches most likely will only officially
reach 4.6 and are very complicated.  Ultimately, I don't think this is a
constructive use of my time.  Therefore, I will NOT be continuing to
provide updated Xen 4.4 builds any longer through CentOS Virt SIG.  If
someone else would like to take on the job, you're welcome to try.  Pop
by #centos-virt on Freenode to talk to us there if you're interested.

For short term mitigation of the Meltdown issue on 4.4 with PV domains,
your best bet is probably to use the "Vixen" shim solution, which George
has put into the xen-44 package repository per his email from two days
ago. Vixen allows you to run PV domains inside HVM guest containers.  It
does not protect the guest from itself, but protects the domains from
each other.  Long term, your best bet is to try to get up to a new
version of Xen that is under upstream security support, probably 4.8.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


[CentOS-virt] Xen-44 Package Updates for XSAs up to XSA-235

2017-09-14 Thread Kevin Stange
Hi all,

Sorry for running a bit behind on security patch releases for the Xen-44
branch.  As of yesterday, package version 4.4.4-28 was released for
testing, which includes all relevant XSA patches through XSA-235 here:

https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/

Please test and provide feedback if possible so we can get this package
moved to release fairly soon.

Currently in the release repo is 4.4.4-27 as of last week, which
contains all relevant patches through XSA-230.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-07 Thread Kevin Stange
On 09/06/2017 05:21 PM, Kevin Stange wrote:
> On 09/06/2017 08:40 AM, Johnny Hughes wrote:
>> On 09/05/2017 02:26 PM, Kevin Stange wrote:
>>> On 09/04/2017 05:27 PM, Johnny Hughes wrote:
>>>> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>>>>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>>>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>>>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>>>>>>> after applying CR updates:
>>>>>>> 
>>>>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>>>>> rpm -qa xen\*
>>>>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>>>>> xen-4.6.3-15.el7.x86_64
>>>>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>>>>
>>>>>>>> uname -a
>>>>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>>>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>
>>>>>>>> Sadly, the other issue is that the grub menu will not display for me to
>>>>>>>> select another kernel to see if it is just a kernel issue.
>>>>>>>>
>>>>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>>>>
>>>>>>>
>>>>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>>>>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>>>>>>> 514.26.2.  The kernel messages that appear to kick off the failure for
>>>>>>> me start with a page allocation failure.  It eventually reaches dracut
>>>>>>> failures due to systemd/udev not setting up properly, but I think the
>>>>>>> root is this:
>>>>>>>
> 
>>>>>>
>>>>>> Do any of you guys have access to RHEL to try the RHEL 7.4 Kernel?
>>>>>
>>>>> I think I may.  I haven't tried yet, but I'll see if I can get my hands
>>>>> on one and test it tomorrow when I'm back at the office tomorrow.
>>>>>
>>>>> RH closed my bug as "WONTFIX" so far, saying Red Hat Quality Engineering
>>>>> Management declined the request.  I started to look at the Red Hat
>>>>> source browser to see the list of patches from 693 to 514, but getting
>>>>> the full list seems impossible because the change log only goes back to
>>>>> 644 and there doesn't seem to be a way to obtain full builds of
>>>>> unreleased kernels.  Unless I'm mistaken.
>>>>>
>>>>> I will also do some digging via RH support if I can.
>>>>>
>>>> I would think that RH would want AWS support for RHEL 7.4 and I thought
>>>> AWS was run on Xen // Note:  I could be wrong about that.
>>>>
>>>> In any event, at the very least, we can make a kernel that boots PV for
>>>> 7.4 at some point.
>>>
>>> AWS does run on Xen, but the modifications they make to Xen are not
>>> known to me nor which version of Xen they use.  They may also run the
>>> domains as HVM, which seems to mitigate the issue here.
>>>
>>> I just verified this kernel issue exists on a RHEL 7.3 system image
>>> under the same conditions, when it's updated to RHEL 7.4 and kernel
>>> 3.10.0-693.2.1.el7.x86_64.
>>>
>>
>> One other option is to run the DomU's as PVHVM:
>> https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers
>>
>> That should be much better performance than HVM and may be a workable
>> solution for people who don't want to modify their VM kernel.
>>
>> Here is more info on PVHVM:
>> https://wiki.xen.org/wiki/PV_on_HVM
>>
>> 
>> Also heard from someone to try this Config file change to the base
>> kernel and rebuild:
>>
>> CONFIG_RANDOMIZE_BASE=n
> 
> This suggestion was mirrored in the RH bugzilla as well, it worked, but
> the same issue does not exist in newer kernels which have the option on.
>  I've posted updated findings in the CentOS bug, which includes a patch
> that I found which seems to fix the issue:
> 
> https://bugs.centos.org/view.php?id=13763#c30014

With many thanks to hughesjr and toracat, I was able to find a patch
that seems to resolve this issue and get it into CentOS Plus
3.10.0-693.2.1.  I've asked Red Hat to apply it to some future kernel
update, but that is only a dream for now.

In the meantime, if anyone who has been experiencing the issue with PV
domains can try out the CentOS Plus kernel here and provide feedback,
I'd appreciate it!

https://buildlogs.centos.org/c7-plus/kernel-plus/20170907163005/3.10.0-693.2.1.el7.centos.plus.x86_64/

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-06 Thread Kevin Stange
On 09/06/2017 08:40 AM, Johnny Hughes wrote:
> On 09/05/2017 02:26 PM, Kevin Stange wrote:
>> On 09/04/2017 05:27 PM, Johnny Hughes wrote:
>>> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>>>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>>>>>> after applying CR updates:
>>>>>> 
>>>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>>>> rpm -qa xen\*
>>>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>>>> xen-4.6.3-15.el7.x86_64
>>>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>>>
>>>>>>> uname -a
>>>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>
>>>>>>> Sadly, the other issue is that the grub menu will not display for me to
>>>>>>> select another kernel to see if it is just a kernel issue.
>>>>>>>
>>>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>>>
>>>>>>
>>>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>>>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>>>>>> 514.26.2.  The kernel messages that appear to kick off the failure for
>>>>>> me start with a page allocation failure.  It eventually reaches dracut
>>>>>> failures due to systemd/udev not setting up properly, but I think the
>>>>>> root is this:
>>>>>>

>>>>>
>>>>> Do any of you guys have access to RHEL to try the RHEL 7.4 Kernel?
>>>>
>>>> I think I may.  I haven't tried yet, but I'll see if I can get my hands
>>>> on one and test it tomorrow when I'm back at the office tomorrow.
>>>>
>>>> RH closed my bug as "WONTFIX" so far, saying Red Hat Quality Engineering
>>>> Management declined the request.  I started to look at the Red Hat
>>>> source browser to see the list of patches from 693 to 514, but getting
>>>> the full list seems impossible because the change log only goes back to
>>>> 644 and there doesn't seem to be a way to obtain full builds of
>>>> unreleased kernels.  Unless I'm mistaken.
>>>>
>>>> I will also do some digging via RH support if I can.
>>>>
>>> I would think that RH would want AWS support for RHEL 7.4 and I thought
>>> AWS was run on Xen // Note:  I could be wrong about that.
>>>
>>> In any event, at the very least, we can make a kernel that boots PV for
>>> 7.4 at some point.
>>
>> AWS does run on Xen, but the modifications they make to Xen are not
>> known to me nor which version of Xen they use.  They may also run the
>> domains as HVM, which seems to mitigate the issue here.
>>
>> I just verified this kernel issue exists on a RHEL 7.3 system image
>> under the same conditions, when it's updated to RHEL 7.4 and kernel
>> 3.10.0-693.2.1.el7.x86_64.
>>
> 
> One other option is to run the DomU's as PVHVM:
> https://wiki.xen.org/wiki/Xen_Linux_PV_on_HVM_drivers
> 
> That should be much better performance than HVM and may be a workable
> solution for people who don't want to modify their VM kernel.
> 
> Here is more info on PVHVM:
> https://wiki.xen.org/wiki/PV_on_HVM
> 
> 
> Also heard from someone to try this Config file change to the base
> kernel and rebuild:
> 
> CONFIG_RANDOMIZE_BASE=n

This suggestion was mirrored in the RH bugzilla as well, it worked, but
the same issue does not exist in newer kernels which have the option on.
 I've posted updated findings in the CentOS bug, which includes a patch
that I found which seems to fix the issue:

https://bugs.centos.org/view.php?id=13763#c30014

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-05 Thread Kevin Stange
On 09/04/2017 05:27 PM, Johnny Hughes wrote:
> On 09/04/2017 03:59 PM, Kevin Stange wrote:
>> On 09/02/2017 08:11 AM, Johnny Hughes wrote:
>>> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>>>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>>>> after applying CR updates:
>>>> 
>>>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>>>> rpm -qa xen\*
>>>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>>>> xen-4.6.3-15.el7.x86_64
>>>>> xen-licenses-4.6.3-15.el7.x86_64
>>>>> xen-libs-4.6.3-15.el7.x86_64
>>>>> xen-runtime-4.6.3-15.el7.x86_64
>>>>>
>>>>> uname -a
>>>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>> Sadly, the other issue is that the grub menu will not display for me to
>>>>> select another kernel to see if it is just a kernel issue.
>>>>>
>>>>> The dracut prompt does not show any /dev/disk folder either.
>>>>>
>>>>
>>>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>>>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>>>> 514.26.2.  The kernel messages that appear to kick off the failure for
>>>> me start with a page allocation failure.  It eventually reaches dracut
>>>> failures due to systemd/udev not setting up properly, but I think the
>>>> root is this:
>>>>
>>>> [1.970630] [ cut here ]
>>>> [1.970651] WARNING: CPU: 2 PID: 225 at mm/vmalloc.c:131
>>>> vmap_page_range_noflush+0x2c1/0x350
>>>> [1.970660] Modules linked in:
>>>> [1.970668] CPU: 2 PID: 225 Comm: systemd-udevd Not tainted
>>>> 3.10.0-693.1.1.el7.x86_64 #1
>>>> [1.970677]   8cddc75d 8803e8587bd8
>>>> 816a3d91
>>>> [1.970688]  8803e8587c18 810879c8 0083811c14e8
>>>> 8800066eb000
>>>> [1.970698]  0001 8803e86d6940 c000
>>>> 
>>>> [1.970708] Call Trace:
>>>> [1.970725]  [] dump_stack+0x19/0x1b
>>>> [1.970736]  [] __warn+0xd8/0x100
>>>> [1.970742]  [] warn_slowpath_null+0x1d/0x20
>>>> [1.970748]  [] vmap_page_range_noflush+0x2c1/0x350
>>>> [1.970758]  [] map_vm_area+0x2e/0x40
>>>> [1.970765]  [] __vmalloc_node_range+0x170/0x270
>>>> [1.970774]  [] ? module_alloc_update_bounds+0x14/0x70
>>>> [1.970781]  [] ? module_alloc_update_bounds+0x14/0x70
>>>> [1.970792]  [] module_alloc+0x73/0xd0
>>>> [1.970798]  [] ? module_alloc_update_bounds+0x14/0x70
>>>> [1.970804]  [] module_alloc_update_bounds+0x14/0x70
>>>> [1.970811]  [] load_module+0xb02/0x29e0
>>>> [1.970817]  [] ? vmap_page_range_noflush+0x257/0x350
>>>> [1.970823]  [] ? map_vm_area+0x2e/0x40
>>>> [1.970829]  [] ? __vmalloc_node_range+0x170/0x270
>>>> [1.970838]  [] ? SyS_init_module+0x99/0x110
>>>> [1.970846]  [] SyS_init_module+0xc5/0x110
>>>> [1.970856]  [] system_call_fastpath+0x16/0x1b
>>>> [1.970862] ---[ end trace 2117480876ed90d2 ]---
>>>> [1.970869] vmalloc: allocation failure, allocated 24576 of 28672 bytes
>>>> [1.970874] systemd-udevd: page allocation failure: order:0, mode:0xd2
>>>> [1.970883] CPU: 2 PID: 225 Comm: systemd-udevd Tainted: GW
>>>>   3.10.0-693.1.1.el7.x86_64 #1
>>>> [1.970894]  00d2 8cddc75d 8803e8587c48
>>>> 816a3d91
>>>> [1.970910]  8803e8587cd8 81188810 8190ea38
>>>> 8803e8587c68
>>>> [1.970923]  0018 8803e8587ce8 8803e8587c88
>>>> 8cddc75d
>>>> [1.970939] Call Trace:
>>>> [1.970946]  [] dump_stack+0x19/0x1b
>>>> [1.970961]  [] warn_alloc_failed+0x110/0x180
>>>> [1.970971]  [] __vmalloc_node_range+0x234/0x270
>>>> [1.970981]  [] ? module_alloc_update_bounds+0x14/0x70
>>>> [1.970989]  [] ? module_alloc_update_bounds+0x14/0x70
>>>> [1.970999]  [] module_all

Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-04 Thread Kevin Stange
On 09/02/2017 08:11 AM, Johnny Hughes wrote:
> On 09/01/2017 02:41 PM, Kevin Stange wrote:
>> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>>> A recently created and fully functional CentOS 7.3 VM fails to boot
>>> after applying CR updates:
>> 
>>> Server OS is CentOS 7.3 using Xen (no CR updates):
>>> rpm -qa xen\*
>>> xen-hypervisor-4.6.3-15.el7.x86_64
>>> xen-4.6.3-15.el7.x86_64
>>> xen-licenses-4.6.3-15.el7.x86_64
>>> xen-libs-4.6.3-15.el7.x86_64
>>> xen-runtime-4.6.3-15.el7.x86_64
>>>
>>> uname -a
>>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Sadly, the other issue is that the grub menu will not display for me to
>>> select another kernel to see if it is just a kernel issue.
>>>
>>> The dracut prompt does not show any /dev/disk folder either.
>>>
>>
>> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
>> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
>> 514.26.2.  The kernel messages that appear to kick off the failure for
>> me start with a page allocation failure.  It eventually reaches dracut
>> failures due to systemd/udev not setting up properly, but I think the
>> root is this:
>>
>> [1.970630] [ cut here ]
>> [1.970651] WARNING: CPU: 2 PID: 225 at mm/vmalloc.c:131
>> vmap_page_range_noflush+0x2c1/0x350
>> [1.970660] Modules linked in:
>> [1.970668] CPU: 2 PID: 225 Comm: systemd-udevd Not tainted
>> 3.10.0-693.1.1.el7.x86_64 #1
>> [1.970677]   8cddc75d 8803e8587bd8
>> 816a3d91
>> [1.970688]  8803e8587c18 810879c8 0083811c14e8
>> 8800066eb000
>> [1.970698]  0001 8803e86d6940 c000
>> 
>> [1.970708] Call Trace:
>> [1.970725]  [] dump_stack+0x19/0x1b
>> [1.970736]  [] __warn+0xd8/0x100
>> [1.970742]  [] warn_slowpath_null+0x1d/0x20
>> [1.970748]  [] vmap_page_range_noflush+0x2c1/0x350
>> [1.970758]  [] map_vm_area+0x2e/0x40
>> [1.970765]  [] __vmalloc_node_range+0x170/0x270
>> [1.970774]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.970781]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.970792]  [] module_alloc+0x73/0xd0
>> [1.970798]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.970804]  [] module_alloc_update_bounds+0x14/0x70
>> [1.970811]  [] load_module+0xb02/0x29e0
>> [1.970817]  [] ? vmap_page_range_noflush+0x257/0x350
>> [1.970823]  [] ? map_vm_area+0x2e/0x40
>> [1.970829]  [] ? __vmalloc_node_range+0x170/0x270
>> [1.970838]  [] ? SyS_init_module+0x99/0x110
>> [1.970846]  [] SyS_init_module+0xc5/0x110
>> [1.970856]  [] system_call_fastpath+0x16/0x1b
>> [1.970862] ---[ end trace 2117480876ed90d2 ]---
>> [1.970869] vmalloc: allocation failure, allocated 24576 of 28672 bytes
>> [1.970874] systemd-udevd: page allocation failure: order:0, mode:0xd2
>> [1.970883] CPU: 2 PID: 225 Comm: systemd-udevd Tainted: GW
>>   3.10.0-693.1.1.el7.x86_64 #1
>> [1.970894]  00d2 8cddc75d 8803e8587c48
>> 816a3d91
>> [1.970910]  8803e8587cd8 81188810 8190ea38
>> 8803e8587c68
>> [1.970923]  0018 8803e8587ce8 8803e8587c88
>> 8cddc75d
>> [1.970939] Call Trace:
>> [1.970946]  [] dump_stack+0x19/0x1b
>> [1.970961]  [] warn_alloc_failed+0x110/0x180
>> [1.970971]  [] __vmalloc_node_range+0x234/0x270
>> [1.970981]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.970989]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.970999]  [] module_alloc+0x73/0xd0
>> [1.971031]  [] ? module_alloc_update_bounds+0x14/0x70
>> [1.971038]  [] module_alloc_update_bounds+0x14/0x70
>> [1.971046]  [] load_module+0xb02/0x29e0
>> [1.971052]  [] ? vmap_page_range_noflush+0x257/0x350
>> [1.971061]  [] ? map_vm_area+0x2e/0x40
>> [1.971067]  [] ? __vmalloc_node_range+0x170/0x270
>> [1.971075]  [] ? SyS_init_module+0x99/0x110
>> [1.971081]  [] SyS_init_module+0xc5/0x110
>> [1.971088]  [] system_call_fastpath+0x16/0x1b
>> [1.971094] Mem-Info:
>> [1.971103] active_anon:875 inactive_anon:2049 isolated_anon:0
>> [1.971103]  active_file:791 inactive_file:8841 isolated_file:0
>> [  

Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-01 Thread Kevin Stange
On 09/01/2017 02:41 PM, Kevin Stange wrote:
> On 08/31/2017 07:50 AM, PJ Welsh wrote:
>> A recently created and fully functional CentOS 7.3 VM fails to boot
>> after applying CR updates:
> 
>> Server OS is CentOS 7.3 using Xen (no CR updates):
>> rpm -qa xen\*
>> xen-hypervisor-4.6.3-15.el7.x86_64
>> xen-4.6.3-15.el7.x86_64
>> xen-licenses-4.6.3-15.el7.x86_64
>> xen-libs-4.6.3-15.el7.x86_64
>> xen-runtime-4.6.3-15.el7.x86_64
>>
>> uname -a
>> Linux tsxen2.xx.com <http://tsxen2.xx.com> 4.9.39-29.el7.x86_64 #1 SMP
>> Fri Jul 21 15:09:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Sadly, the other issue is that the grub menu will not display for me to
>> select another kernel to see if it is just a kernel issue.
>>
>> The dracut prompt does not show any /dev/disk folder either.
>>
> 
> I'm seeing this as well.  My host is 4.9.44-29 and Xen 4.4.4-26 from
> testing repo, my guest is 3.10.0-693.1.1.  Guest boots fine with
> 514.26.2.  The kernel messages that appear to kick off the failure for
> me start with a page allocation failure.  It eventually reaches dracut
> failures due to systemd/udev not setting up properly, but I think the
> root is this:
> 

I created bugs for this issue (at least as I'm able to reproduce it):

https://bugs.centos.org/view.php?id=0013763
https://bugzilla.redhat.com/show_bug.cgi?id=1487754

Please add any extra information you might have to hopefully increase
the chance the problem gets fixed.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen CentOS 7.3 server + CentOS 7.3 VM fails to boot after CR updates (applied to VM)!

2017-09-01 Thread Kevin Stange
e_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:4177920kB managed:4162956kB mlocked:0kB dirty:0kB writeback:0kB
mapped:4kB shmem:1928kB slab_reclaimable:240kB slab_unreclaimable:504kB
kernel_stack:32kB pagetables:592kB unstable:0kB bounce:0kB
free_pcp:1760kB local_pcp:288kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? no
[1.971264] lowmem_reserve[]: 0 0 11964 11964
[1.971273] Node 0 Normal free:12091564kB min:12088kB low:15108kB
high:18132kB active_anon:2352kB inactive_anon:6272kB active_file:3164kB
inactive_file:35364kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:12591104kB managed:12251788kB mlocked:0kB
dirty:0kB writeback:0kB mapped:5852kB shmem:6284kB
slab_reclaimable:6688kB slab_unreclaimable:6012kB kernel_stack:880kB
pagetables:1328kB unstable:0kB bounce:0kB free_pcp:1196kB
local_pcp:152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[1.971309] lowmem_reserve[]: 0 0 0 0
[1.971316] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 1*32kB (U) 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) =
15912kB
[1.971343] Node 0 DMA32: 7*4kB (M) 18*8kB (UM) 7*16kB (EM) 3*32kB
(EM) 1*64kB (E) 2*128kB (UM) 1*256kB (E) 4*512kB (UM) 4*1024kB (UEM)
4*2048kB (EM) 1011*4096kB (M) = 4156348kB
[1.971377] Node 0 Normal: 64*4kB (UEM) 10*8kB (UEM) 6*16kB (EM)
3*32kB (EM) 3*64kB (UE) 3*128kB (UEM) 1*256kB (E) 2*512kB (UE) 0*1024kB
1*2048kB (M) 2951*4096kB (M) = 12091728kB
[1.971413] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[1.971425] 11685 total pagecache pages
[1.971430] 0 pages in swap cache
[1.971437] Swap cache stats: add 0, delete 0, find 0/0
[1.971444] Free swap  = 0kB
[1.971451] Total swap = 0kB
[1.971456] 4196255 pages RAM
[1.971462] 0 pages HighMem/MovableOnly
[1.971467] 88591 pages reserved

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Status of reverted Linux patch "tty: Fix ldisc crash on reopened tty", Linux 4.9 kernel frequent crashes

2017-08-30 Thread Kevin Stange
On 08/30/2017 03:10 PM, Pasi Kärkkäinen wrote:
> Hello everyone,
> 
> Recently Nathan March reported on centos-virt list he's getting frequent 
> Linux kernel crashes with Linux 4.9 LTS kernel because of the missing patch 
> "tty: Fix ldisc crash on reopened tty".
> 
> The patch was already merged upstream here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71472fa9c52b1da27663c275d416d8654b905f05
> 
> but then reverted here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=896d81fefe5d1919537db2c2150ab6384e4a6610
> 
> Nathan confirmed if he applies the patch from 
> 71472fa9c52b1da27663c275d416d8654b905f05 to his Linux 4.9 LTS kernel the 
> bug/problem goes away, so the patch (or similar fix) is still needed, at 
> least for 4.9 LTS kernel.
> 
> 
> Mikulas reported he's able to trigger the same crash on Linux 4.10:
> https://www.spinics.net/lists/kernel/msg2440637.html
> https://lists.gt.net/linux/kernel/2664604?search_string=ldisc%20reopened;#2664604
> 
> Michael Neuling reported he's able to trigger the bug on PowerPC:
> https://lkml.org/lkml/2017/3/10/1582
> 
> 
> So now the question is.. is anyone currently working on getting this patch 
> fixed and applied upstream? I think one of the problems earlier was being 
> able to reliable reproduce the crash.. Nathan says he's able to reproduce it 
> many times per week on his environment on x86_64.

I looked briefly at the patch and related discussion on the kernel
mailing lists and it seemed to be reverted not due to any problems it
caused with kernel behavior but rather due to concerns about
insufficient review before it was committed and possible merge conflicts.

The issue is the problem doesn't appear to have been discussed any
further on the kernel mailing lists since April, and I'm not sure why.
My inclination would be to start back up discussion upstream and try to
get clarification as to why the patch has remained reverted and there's
been no effort to bring it back into the kernel, rather than assume the
patch is safe to use.  I doubt anyone else but people experiencing the
issue have it up on their radar.

The 4.9 virt kernel does have some patches currently that haven't (yet)
been accepted upstream, so it's definitely an option to do that here.
As far as I know, nothing really gets pushed back upstream.  Most often
the patches are just plucked from upstream mailing lists before they get
merged into an official upstream release.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


[CentOS-virt] 4.4.4-26 with XSA-226, 227, 230 in centos-virt-testing

2017-08-23 Thread Kevin Stange
Xen 4.4.4 along with kernel 4.9.44 containing patches for XSAs (226 -
230) from August 15th are now available in centos-virt-testing.  If
possible, please test and provide feedback here so we can move these to
release soon.

XSA-228 did not affect Xen 4.4
XSA-229 only applies to the kernel

XSA-235 disclosed today only affects ARM and isn't going to be added to
these packages.

Thanks.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] kernel-4.9.37-29.el7 (and el6)

2017-07-24 Thread Kevin Stange
On 07/20/2017 03:14 PM, Piotr Gackiewicz wrote:
> On Thu, 20 Jul 2017, Kevin Stange wrote:
> 
>> On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:
>>> On Wed, 19 Jul 2017, Johnny Hughes wrote:
>>>
>>>> On 07/19/2017 09:23 AM, Johnny Hughes wrote:
>>>>> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:
>>>>>> On Mon, 17 Jul 2017, Johnny Hughes wrote:
>>>>>>
>>>>>>> Are the testing kernels (kernel-4.9.37-29.el7 and
>>>>>>> kernel-4.9.37-29.el6,
>>>>>>> with the one config file change) working for everyone:
>>>>>>>
>>>>>>> (turn off: CONFIG_IO_STRICT_DEVMEM)
>>>>>>
>>>>>> Hello.
>>>>>> Maybe it's not the most appropriate thread or time, but I have been
>>>>>> signalling it before:
>>>>>>
>>>>>> 4.9.* kernels do not work well for me any more (and for other people
>>>>>> neither, as I know). Last stable kernel was 4.9.13-22.
>>>
>>> I think I have nailed down the faulty combo.
>>> My tests showed, that SLUB allocator does not work well in Xen Dom0, on
>>> top of Xen Hypervisor.
>>> Id does not work at least on one of my testing servers (old AMD K8 (1
>>> proc,
>>> 1 core), only 1 paravirt guest).
>>> If kernel with SLUB booted as main (w/o Xen hypervisor), it works well.
>>> If booted as Xen hypervisor module - it almost instantly gets page
>>> allocation failure.
>>>
>>>
>>> SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then
>>> problems
>>> started to explode in my production environment, and on testing server
>>> mentioned
>>> above.
>>>
>>> After recompiling recent 4.9.34 with SLAB - everything works well on
>>> that testing machine.
>>> A will try to test 4.9.38 with the same config on my production servers.
>>
>> I was having page allocation failures on 4.9.25 with SLUB, but these
>> problems seem to be gone with 4.9.34 (still with SLUB).   Have you
>> checked this build?  It was moved to the stable repo on July 4th.
> 
> Yes, 4.9.34 was failing too. And this was actually the worst case, with
> I/O error on guest:

I did find one server running 4.9.34 that was still throwing SLUB page
allocation errors, but oddly, the only servers ever to have this issue
for me are spares that are running no domains.  I've just tried booting
that box up on 4.9.39, but I may not know if the switch back to SLAB
fixes anything for several weeks.

Otherwise, the other server I'm running 4.9.39 on for the past 72 hours
has been stable with running domains.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] kernel-4.9.37-29.el7 (and el6)

2017-07-20 Thread Kevin Stange
On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:
> On Wed, 19 Jul 2017, Johnny Hughes wrote:
> 
>> On 07/19/2017 09:23 AM, Johnny Hughes wrote:
>>> On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:
>>>> On Mon, 17 Jul 2017, Johnny Hughes wrote:
>>>>
>>>>> Are the testing kernels (kernel-4.9.37-29.el7 and
>>>>> kernel-4.9.37-29.el6,
>>>>> with the one config file change) working for everyone:
>>>>>
>>>>> (turn off: CONFIG_IO_STRICT_DEVMEM)
>>>>
>>>> Hello.
>>>> Maybe it's not the most appropriate thread or time, but I have been
>>>> signalling it before:
>>>>
>>>> 4.9.* kernels do not work well for me any more (and for other people
>>>> neither, as I know). Last stable kernel was 4.9.13-22.
> 
> I think I have nailed down the faulty combo.
> My tests showed, that SLUB allocator does not work well in Xen Dom0, on
> top of Xen Hypervisor.
> Id does not work at least on one of my testing servers (old AMD K8 (1 proc,
> 1 core), only 1 paravirt guest).
> If kernel with SLUB booted as main (w/o Xen hypervisor), it works well.
> If booted as Xen hypervisor module - it almost instantly gets page
> allocation failure.
> 
> 
> SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then
> problems
> started to explode in my production environment, and on testing server
> mentioned
> above.
> 
> After recompiling recent 4.9.34 with SLAB - everything works well on
> that testing machine.
> A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these
problems seem to be gone with 4.9.34 (still with SLUB).   Have you
checked this build?  It was moved to the stable repo on July 4th.

config-4.9.25-27.el6.x86_64:CONFIG_SLUB=y
config-4.9.34-29.el6.x86_64:CONFIG_SLUB=y

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] kernel-4.9.37-29.el7 (and el6)

2017-07-17 Thread Kevin Stange
On 07/17/2017 09:47 AM, Kristián Feldsam wrote:
> Hello, is this kernel usable also for KVM or is only for XEN?

This kernel is intended for the Xen repos.  None of us are testing it
with KVM to my knowledge, but it may work.  The KVM-related virt SIG
repos don't include a custom kernel.

This kernel is tracking an upstream LTS kernel and building for Xen
specific functionality.  Personally, I would stick with the base kernels
for CentOS as they're intended to run KVM and are maintained longer than
upstream LTS kernels.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen 4.6.3-15 packages, including XSAs 216-219, 221-225 on their way through the build system

2017-06-26 Thread Kevin Stange
On 06/26/2017 06:59 AM, Giuseppe Tanzilli - Serverplan wrote:
> Hi,
> that kernel fix will be released on 6.x repo also ?
> I see it only on 7.x repo   kernel-4.9.31-27.el7.x86_64.rpm

kernel-4.9.34-28 will fix XSA-216 and CVE-2017-1000364.  It's in the
testing repo right now.

https://buildlogs.centos.org/centos/7/virt/x86_64/xen-46/
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
https://buildlogs.centos.org/centos/6/virt/x86_64/xen-46/

If you have an opportunity to test it and check for issues, it would be
appreciated.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.

2017-04-21 Thread Kevin Stange
c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 90
> d3 ca 81
> [7.024976] RIP  []
> identify_secondary_cpu+0x57/0x80
> [7.031528]  RSP 
> [7.035032] ---[ end trace f2a8d75941398d9f ]---
> [7.039658] Kernel panic - not syncing: Attempted to kill the
> idle task!
> 
> So...other than my work around...that still works...not sure what
> else I can provide in the way of feedback/testing. But if you want
> anything else gathered, let me know.
> 
> Thanks,
> -Dave
> 
> --
> Dave Anderson
> 
> 
> > On Apr 19, 2017, at 10:33 AM, Johnny Hughes <joh...@centos.org
> <mailto:joh...@centos.org>> wrote:
> >
> > On 04/19/2017 12:18 PM, PJ Welsh wrote:
> >>
> >> On Wed, Apr 19, 2017 at 5:40 AM, Johnny Hughes <joh...@centos.org
> <mailto:joh...@centos.org>
> >> <mailto:joh...@centos.org <mailto:joh...@centos.org>>> wrote:
> >>
> >>On 04/18/2017 12:39 PM, PJ Welsh wrote:
> >>> Here is something interesting... I went through the BIOS options and
> >>> found that one R710 that *is* functioning only differed in that
> "Logical
> >>> Processor"/Hyperthreading was *enabled* while the one that is *not*
> >>> functioning had HT *disabled*. Enabled Logical Processor and the
> system
> >>> starts without issue! I've rebooted 3 times now without issue.
> >>> Dell R710 BIOS version 6.4.0
> >>> 2x Intel(R) Xeon(R) CPU L5639  @ 2.13GHz
> >>> 4.9.20-26.el7.x86_64 #1 SMP Tue Apr 4 11:19:26 CDT 2017 x86_64
> x86_64
> >>> x86_64 GNU/Linux
> >>>
> >>
> >>Outstanding .. I have now released a 4.9.23-26.el6 and .el7 to the
> >>system as normal updates.  It should be available later today.
> >>
> >>
> >>
> >>
> >> I've verified with a second Dell R710 that disabling
> >> Hyperthreading/Logical Processor causes the primary xen booting
> kernel
> >> to fail and reboot. Consequently, enabling allows for the system to
> >> start as expected and without any issue:
> >> Current tested kernel was: 4.9.13-22.el7.x86_64 #1 SMP Sun Feb 26
> >> 22:15:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> I just attempted an update and the 4.9.23-26 is not yet up. Does this
> >> update address the Hyperthreading issue in any way?
> >>
> >
> > I don't think so .. at least I did not specifically add anything
> to do so.
> >
> > You can get it here for testing:
> >
> > https://buildlogs.centos.org/centos/7/virt/x86_64/xen/
> <https://buildlogs.centos.org/centos/7/virt/x86_64/xen/>
> >
> > (or from /6/ as well for CentOS-6)
> >
> > Not sure why it did not go out on the signing run .. will check
> that server.
> >
> >
> >
> > ___
> > CentOS-virt mailing list
> > CentOS-virt@centos.org <mailto:CentOS-virt@centos.org>
> > https://lists.centos.org/mailman/listinfo/centos-virt
> <https://lists.centos.org/mailman/listinfo/centos-virt>
> 
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org <mailto:CentOS-virt@centos.org>
> https://lists.centos.org/mailman/listinfo/centos-virt
> <https://lists.centos.org/mailman/listinfo/centos-virt>
> 
> 
> 
> 
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
> 


-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-03-27 Thread Kevin Stange
On 03/27/2017 04:03 PM, Kevin Stange wrote:
> On 03/25/2017 02:35 PM, Sarah Newman wrote:
>> On 03/16/2017 04:22 PM, Kevin Stange wrote:
>>
>>>> I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9
>>>> server has yet had a NIC issue, with some being up almost a full month.
>>>> It looks promising! (I'm knocking on all the wood everywhere, though.)
>>>
>>> I'm ready to call this conclusive.  The problems I was having across the
>>> board seemed to be caused by something seriously broken in 3.18.  Most
>>> of my servers are now on 4.9.13 or newer and everything has been working
>>> very well.
>>>
>>> I'm not going to post any further updates unless something breaks.
>>> Thanks to everyone that provided tips and suggestions along the way.
>>>
>>
>> Do you mind sharing what hardware have you been running the 4.9 kernel on 
>> other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and
>> "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are 
>> you using any SATA/SAS controllers?
> 
> We have no expansion cards installed except for the dual-port gigabit
> NICs.  We're using the onboard SATA controller for only the local Dom0
> OS, and iSCSI and NFS for managing storage for VMs and images.
> 

We've got some other motherboards as well, I think this list is exhaustive:

Supermicro X8DT3
Supermicro X8DT6
Supermicro X9DRD-iF/LF
Supermicro X9DRT
Supermicro X9SCL/X9SCM

These are -F variants which means they include a BMC chip with a
separate NIC.  A few of the X8DT3 are the LN4 variant, which has 4
onboard NICs and therefore we did not use an expansion NIC.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-03-27 Thread Kevin Stange
On 03/25/2017 02:35 PM, Sarah Newman wrote:
> On 03/16/2017 04:22 PM, Kevin Stange wrote:
> 
>>> I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9
>>> server has yet had a NIC issue, with some being up almost a full month.
>>> It looks promising! (I'm knocking on all the wood everywhere, though.)
>>
>> I'm ready to call this conclusive.  The problems I was having across the
>> board seemed to be caused by something seriously broken in 3.18.  Most
>> of my servers are now on 4.9.13 or newer and everything has been working
>> very well.
>>
>> I'm not going to post any further updates unless something breaks.
>> Thanks to everyone that provided tips and suggestions along the way.
>>
> 
> Do you mind sharing what hardware have you been running the 4.9 kernel on 
> other than "Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB" and
> "Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB" if any? Are 
> you using any SATA/SAS controllers?

We have no expansion cards installed except for the dual-port gigabit
NICs.  We're using the onboard SATA controller for only the local Dom0
OS, and iSCSI and NFS for managing storage for VMs and images.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-03-16 Thread Kevin Stange
On 02/24/2017 11:51 AM, Kevin Stange wrote:
> On 02/21/2017 05:32 PM, Kevin Stange wrote:
>> On 02/21/2017 11:50 AM, Johnny Hughes wrote:
>>> On 02/21/2017 11:47 AM, Johnny Hughes wrote:
>>>>
>>>>
>>>> Kevin,
>>>>
>>>> Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
>>>> with the newer linux-firmare packages and xfsprogs).
>>>>
>>>> If you enable the xen-testing repository in your CentOS-Xen.repo file
>>>> (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
>>>> should replace all the needed packages.
>>>>
>>>> The actual path is here for the packages:
>>>>
>>>> https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
>>>>
>>>> Hopefully this helps.
>>>>
>>>
>>>
>>> I should have said .. 'just releaed for testing' :)
>>>
>>> I have been using this for 4 or 5 days with no issues in production, but
>>> it needs testing before final release :)
>>
>> Currently I've moved most of my servers onto the 4.4 kernel from xen
>> made easy and they've been stable.  I have some indications of an issue
>> with one of my 3.18 servers right now which required it to be rebooted,
>> so I'm going to bring the 4.9 kernel up on that server to see how it
>> does.  It may take a few weeks or more to draw any conclusions.
> 
> Currently running 4.9.11 on a few servers and they've been working fine.
>  No new issues have come up so far, anyway.
> 
> I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9
> server has yet had a NIC issue, with some being up almost a full month.
> It looks promising! (I'm knocking on all the wood everywhere, though.)

I'm ready to call this conclusive.  The problems I was having across the
board seemed to be caused by something seriously broken in 3.18.  Most
of my servers are now on 4.9.13 or newer and everything has been working
very well.

I'm not going to post any further updates unless something breaks.
Thanks to everyone that provided tips and suggestions along the way.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-02-24 Thread Kevin Stange
On 02/21/2017 05:32 PM, Kevin Stange wrote:
> On 02/21/2017 11:50 AM, Johnny Hughes wrote:
>> On 02/21/2017 11:47 AM, Johnny Hughes wrote:
>>>
>>>
>>> Kevin,
>>>
>>> Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
>>> with the newer linux-firmare packages and xfsprogs).
>>>
>>> If you enable the xen-testing repository in your CentOS-Xen.repo file
>>> (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
>>> should replace all the needed packages.
>>>
>>> The actual path is here for the packages:
>>>
>>> https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
>>>
>>> Hopefully this helps.
>>>
>>
>>
>> I should have said .. 'just releaed for testing' :)
>>
>> I have been using this for 4 or 5 days with no issues in production, but
>> it needs testing before final release :)
> 
> Currently I've moved most of my servers onto the 4.4 kernel from xen
> made easy and they've been stable.  I have some indications of an issue
> with one of my 3.18 servers right now which required it to be rebooted,
> so I'm going to bring the 4.9 kernel up on that server to see how it
> does.  It may take a few weeks or more to draw any conclusions.

Currently running 4.9.11 on a few servers and they've been working fine.
 No new issues have come up so far, anyway.

I still can't rest assured the NIC issue is fixed, but no 4.4 or 4.9
server has yet had a NIC issue, with some being up almost a full month.
It looks promising! (I'm knocking on all the wood everywhere, though.)

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-02-21 Thread Kevin Stange
On 02/21/2017 11:50 AM, Johnny Hughes wrote:
> On 02/21/2017 11:47 AM, Johnny Hughes wrote:
>>
>>
>> Kevin,
>>
>> Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along
>> with the newer linux-firmare packages and xfsprogs).
>>
>> If you enable the xen-testing repository in your CentOS-Xen.repo file
>> (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade'
>> should replace all the needed packages.
>>
>> The actual path is here for the packages:
>>
>> https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/
>>
>> Hopefully this helps.
>>
> 
> 
> I should have said .. 'just releaed for testing' :)
> 
> I have been using this for 4 or 5 days with no issues in production, but
> it needs testing before final release :)

Currently I've moved most of my servers onto the 4.4 kernel from xen
made easy and they've been stable.  I have some indications of an issue
with one of my 3.18 servers right now which required it to be rebooted,
so I'm going to bring the 4.9 kernel up on that server to see how it
does.  It may take a few weeks or more to draw any conclusions.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Xen updates in the Testing Repo for XSA-207 and XSA-208

2017-02-17 Thread Kevin Stange
Given the circumstances, might it make sense to offer formal advisories
of some type for these to indicate when the packages going to live are
for security or other reasons?

On 02/17/2017 09:51 AM, Johnny Hughes wrote:
> These updates have now been pushed to mirror.centos.org and you can get
> them from the main repos.
> 
> On 02/15/2017 08:27 AM, Johnny Hughes wrote:
>> There are xen rpms in the testing repos for XSA 207 and 208 in the
>> testing repos (xen-4.4.4-18.el6,  xen-4.6.3-7.el6, xen-4.6.3-7.el7).
>>
>> You can enable the applicable centos-virt-xen-testing repo in your
>> /etc/yum.repos.d/CentOS-Xen.repo file.
>>
>> Please report positive and negative tests to this list so we can promote
>> the updates to the main repos.
>>
>> Thanks,
>> Johnny Hughes
>>
>>
>>
>>
>>
>> ___
>> CentOS-virt mailing list
>> CentOS-virt@centos.org
>> https://lists.centos.org/mailman/listinfo/centos-virt
>>
> 
> 
> 
> 
> _______
> CentOS-virt mailing list
> CentOS-virt@centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
> 


-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-02-13 Thread Kevin Stange
On 02/12/2017 05:07 PM, Adi Pircalabu wrote:
> On 11/02/17 06:29, Kevin Stange wrote:
>> On 01/30/2017 06:41 PM, Kevin Stange wrote:
>>> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>>>> On 31/01/17 10:49, Kevin Stange wrote:
>>>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now
>>>>> is a
>>>>> 4.4 kernel.  Any chance you have tested with that one?
>>>>
>>>> Not yet, however the future Xen nodes we'll deploy will run CentOS 7
>>>> and
>>>> Xen with kernel 4.4.
>>>
>>> I'll keep you (and others here) posted on my own experiences with that
>>> 4.4 build over the next few weeks to report on any issues.  I'm hoping
>>> something happened between 3.18 and 4.4 that fixed underlying problems.
>>>
>>>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>>>
>>>> Yes, also with all sorts of configuration combinations like LACP rate
>>>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
>>>
>>> Alright, I'll assume that probably won't help then.  I tried it on one
>>> box which hasn't had the issue again yet, but that doesn't guarantee
>>> anything.
>>
>> I was able to discover something new, which might not conclusively prove
>> anything, but it at least seems to rule out the pci=nomsi kernel option
>> from being effective.
>>
>> I had one server booted with that option as well as MTU 1500.  It was
>> stable for quite a long time, so I decided to try turning the MTU back
>> to 9000 and within 12 hours, the interface on the expansion NIC with the
>> jumbo MTU failed.
>>
>> The other NIC in the LACP bundle is onboard and didn't fail.  The other
>> NIC on the dual-port expansion card also didn't fail.  This leads me to
>> believe that ONE of the bugs I'm experiencing is related to 82575EB +
>> jumbo frames.
>>
>> I still think I'm also having a PCI-e issue that is separate and
>> additional on top of that, and which has not reared its head recently,
>> making it difficult for me to gather any new data.
>>
>> One of the things I've done that seemed to help a lot with stability was
>> balance the LACP so that one NIC from onboard and one NIC from expansion
>> card is in each LAG.  Previously we just had the first LAG onboard and
>> the second on the expansion card.  This way, at least, given the
>> expansion NIC's propensity toward failing first, I don't have to crash
>> the server and all running VMs to recover.
>>
>> I've seen absolutely no issues yet with the 4.4 kernel either, but I am
>> not willing to call that a win because of the quiet from even the
>> servers on which no tweaks have been applied yet.
> 
> Thanks for the heads-up Kevin, appreciated. One thing I need to clarify,
> though: what kernel was this machine running at the time?

Kernel running at the time was the Virt SIG's 3.18.44-20 kernel.

As a further note, within an additional 24 hours, the onboard Intel
82576 that was switched to enable jumbo frames also failed and we had to
reboot the server.  The expansion and onboard ports without jumbo frames
did not fail.  Since reboot, it's on the 4.4.47 kernel from Xen Made
Easy now with jumbo frames and has not exhibited issues since Friday.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-02-10 Thread Kevin Stange
On 01/30/2017 06:41 PM, Kevin Stange wrote:
> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>> On 31/01/17 10:49, Kevin Stange wrote:
>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
>>> 4.4 kernel.  Any chance you have tested with that one?
>>
>> Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and
>> Xen with kernel 4.4.
> 
> I'll keep you (and others here) posted on my own experiences with that
> 4.4 build over the next few weeks to report on any issues.  I'm hoping
> something happened between 3.18 and 4.4 that fixed underlying problems.
> 
>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>
>> Yes, also with all sorts of configuration combinations like LACP rate
>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
> 
> Alright, I'll assume that probably won't help then.  I tried it on one
> box which hasn't had the issue again yet, but that doesn't guarantee
> anything.

I was able to discover something new, which might not conclusively prove
anything, but it at least seems to rule out the pci=nomsi kernel option
from being effective.

I had one server booted with that option as well as MTU 1500.  It was
stable for quite a long time, so I decided to try turning the MTU back
to 9000 and within 12 hours, the interface on the expansion NIC with the
jumbo MTU failed.

The other NIC in the LACP bundle is onboard and didn't fail.  The other
NIC on the dual-port expansion card also didn't fail.  This leads me to
believe that ONE of the bugs I'm experiencing is related to 82575EB +
jumbo frames.

I still think I'm also having a PCI-e issue that is separate and
additional on top of that, and which has not reared its head recently,
making it difficult for me to gather any new data.

One of the things I've done that seemed to help a lot with stability was
balance the LACP so that one NIC from onboard and one NIC from expansion
card is in each LAG.  Previously we just had the first LAG onboard and
the second on the expansion card.  This way, at least, given the
expansion NIC's propensity toward failing first, I don't have to crash
the server and all running VMs to recover.

I've seen absolutely no issues yet with the 4.4 kernel either, but I am
not willing to call that a win because of the quiet from even the
servers on which no tweaks have been applied yet.

I will continue the story as I have more material! :)

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
> On 31/01/17 10:49, Kevin Stange wrote:
>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
>> 4.4 kernel.  Any chance you have tested with that one?
> 
> Not yet, however the future Xen nodes we'll deploy will run CentOS 7 and
> Xen with kernel 4.4.

I'll keep you (and others here) posted on my own experiences with that
4.4 build over the next few weeks to report on any issues.  I'm hoping
something happened between 3.18 and 4.4 that fixed underlying problems.

>> Did you ever try without MTU=9000 (default 1500 instead)?
> 
> Yes, also with all sorts of configuration combinations like LACP rate
> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.

Alright, I'll assume that probably won't help then.  I tried it on one
box which hasn't had the issue again yet, but that doesn't guarantee
anything.

>> I am having certain issues on certain hardware where there's no shutting
>> down the affected NICs.  Trying to do so or unload the igb module hangs
>> the entire box.  But in that case they're throwing AER errors instead of
>> just unit hangs:
>>
>> pcieport :00:03.0: AER: Uncorrected (Non-Fatal) error received:
>> id=
>> igb :04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, id=0401(Requester ID)
>> igb :04:00.1:   device [8086:10a7] error
>> status/mask=4000/
>> igb :04:00.1:[14] Completion Timeout (First)
>> igb :04:00.1: broadcast error_detected message
>> igb :04:00.1: broadcast slot_reset message
>> igb :04:00.1: broadcast resume message
>> igb :04:00.1: AER: Device recovery successful
> 
> This is interesting. We've never had any problems with the 1Gb NICs, but
> we're only using 10Gb for the storage network. Could it be a common
> problem with either the adapters, or the drivers which only replicate
> running the Xen enabled kernel?

Since I've never run the 3.18 kernel on a box of this type without
running in a dom0 and since I can't reproduce this kind of issue without
a fair amount of NIC load over a tremendous period of time, it's
impossible to test if it's tied to Xen.

However, I know this hardware works well under 2.6.32-*.el6 and
3.10.0-*.el7 kernels without stability problems, as it did with
2.6.18-*.el5xen (Xen 3.4.4).

I suspect the above errors are actually due to something PCIe related,
and I have a subset of boxes which are actually being impacted by two
distinct problems with equivalent impact, which increases the likelihood
that the boxes will die.  Another set of boxes only ever sees the unit
hangs which seem unrecoverable even unloading/reloading the driver.  A
third set has random recoverable unit hangs only.  With so much
diversity, it's even harder to pin any specific causes to the problems.

The fact we're both pushing NFS and iSCSI traffic over these links makes
me wonder if there's something about that kind of traffic that increases
the chances of causing these issues.  When I put VM network traffic over
the same NICs, they seem a lot less prone to failures, but also end up
pushing less traffic in general.

>> Switching to Broadcom would be a possibility, though it's tricky because
>> two of the NICs are onboard, so we'd need to replace the dual-port 1G
>> card with a quad-port 1G card.  Since you're saying you're all 10G,
>> maybe you don't know, but if you have any specific Broadcom 1G cards
>> you've had good fortune with, I'd be interested in knowing which models.
>>   Broadcom cards are rarely labeled as such which makes finding them a
>> bit more difficult than Intel ones.
> 
> We've purchased a number of servers with Broadcom BCM957810A1008G, sold
> by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up
> & down like a yo-yo so far.
> 
>> So far the one hypervisor with pci=nomsi has been quiet but that doesn't
>> mean it's fixed.  I need to give it 6 weeks or so. :)
> 
> It'd be more like 6-9 months for us, making it terrible to debug it :-/

I had a bunch of these on relatively light VM load for 3 months for
"burn in" with no issues but they've been pretty aggressively failing
since I started to try to put real loads on them.  Still, it's odd
because some of the boxes with identical hardware and similar VM loads
have not yet blown up after 3 or more weeks, and maybe they won't for
several months.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 04:17 PM, Adi Pircalabu wrote:
> On 28/01/17 05:21, Kevin Stange wrote:
>> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
>>> Have you tried to eliminate all power management features all over?
>>
>> I've been trying to find and disable all power management features but
>> having relatively little luck with that solving the problems.  Stabbing
>> the the dark I've tried different ACPI settings, including completely
>> disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
>> on the kernel command line.  Are there other kernel options that might
>> be useful to try?
> 
> May I chip in here? In our environment we're randomly seeing:

Welcome.  It's a relief to know someone else has been having a similar
nightmare!  Perhaps that's not encouraging...

> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: Detected Tx Unit
> Hang
> Jan 17 23:40:14 xen01 kernel:  Tx Queue <0>
> Jan 17 23:40:14 xen01 kernel:  TDH, TDT <9a>, <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_use  <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_clean<98>
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6:
> tx_buffer_info[next_to_clean]
> Jan 17 23:40:14 xen01 kernel:  time_stamp   <218443db3>
> Jan 17 23:40:14 xen01 kernel:  jiffies  <218445368>
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: tx hang 1
> detected on queue 0, resetting adapter
> Jan 17 23:40:14 xen01 kernel: ixgbe :04:00.1 eth6: Reset adapter
> Jan 17 23:40:15 xen01 kernel: ixgbe :04:00.1 eth6: PCIe transaction
> pending bit also did not clear.
> Jan 17 23:40:15 xen01 kernel: ixgbe :04:00.1: master disable timed out
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for
> interface eth6, disabling it in 200 ms.
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely
> down for interface eth6, disabling it
> [...] repeated every second or so.
> 
>>> Are the devices connected to the same network infrastructure?
>>
>> There are two onboard NICs and two NICs on a dual-port card in each
>> server.  All devices connect to a cisco switch pair in VSS and the links
>> are paired in LACP.
> 
> We've been experienced ixgbe stability issues on CentOS 6.x with various
> 3.x kernels for years with different ixgbe driver versions and, to date,
> the only way to completely get rid of the issue was to switch from Intel
> to Broadcom. Just like in your case, the problem pops up randomly and
> the only reliable temporary fix is to reboot the affected Xen node.
> Another temporary fix that worked several times but not always was to
> migrate / shutdown the domUs, deactivate the volume groups, log out of
> all the iSCSI targets, "ifdown bond1" and "modprobe -r ixgbe" followed
> by "ifup bond1".
> 
> The set up is:
> - Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
> - Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
> - LACP bonding to connect to the NFS & iSCSI storage using Brocade
> VDX6740T fabric. MTU=9000

You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
4.4 kernel.  Any chance you have tested with that one?

Did you ever try without MTU=9000 (default 1500 instead)?

I am having certain issues on certain hardware where there's no shutting
down the affected NICs.  Trying to do so or unload the igb module hangs
the entire box.  But in that case they're throwing AER errors instead of
just unit hangs:

pcieport :00:03.0: AER: Uncorrected (Non-Fatal) error received: id=
igb :04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
type=Transaction Layer, id=0401(Requester ID)
igb :04:00.1:   device [8086:10a7] error status/mask=4000/
igb :04:00.1:[14] Completion Timeout (First)
igb :04:00.1: broadcast error_detected message
igb :04:00.1: broadcast slot_reset message
igb :04:00.1: broadcast resume message
igb :04:00.1: AER: Device recovery successful

Spammed continuously.

Switching to Broadcom would be a possibility, though it's tricky because
two of the NICs are onboard, so we'd need to replace the dual-port 1G
card with a quad-port 1G card.  Since you're saying you're all 10G,
maybe you don't know, but if you have any specific Broadcom 1G cards
you've had good fortune with, I'd be interested in knowing which models.
 Broadcom cards are rarely labeled as such which makes finding them a
bit more difficult than Intel ones.

>>> There has to be something common.
>>
>> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
>> and NFS traffic, as well as some basic management stuff over SSH, and
>> they are con

Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 02:15 PM, Johnny Hughes wrote:
> On 01/30/2017 12:59 PM, Kevin Stange wrote:
>> On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
>>>> Are there other kernel options that might be useful to try?
>>>
>>> pci=nomsi
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
>>
>> Incidentally, already found that one and I'm trying it currently on one
>> of the boxes.  So far there's been no issues, but it's only been since
>> Friday.
>>
>> Also, I found this:
>>
>> https://xen.crc.id.au/support/guides/install/
>>
>> There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
>> to see how stable it is, also only since Friday.  I'm not using anything
>> else he's packaged from his repo.
>>
>> On a related note, does the SIG have plans to replace the 3.18 kernel
>> which is marked as projected EOL of January 2017
>> (https://www.kernel.org/category/releases.html)?
>>
> 
> I am currently working on a 4.4 kernel as a replacement for the 3.18
> kernel.  I have it working well no el7, but not yet working well on el6.
>  I hope to have something to release in the first 2 weeks of Feb. for
> testing.

What kind of issues are you having with 4.4?  Since I'm testing that
"Xen Made Easy" build of 4.4, are there any things I should watch out
for?  Might be worth looking at what he did for his builds to see if
that helps get yours working better.

http://au1.mirror.crc.id.au/repo/el6/SRPM/

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-30 Thread Kevin Stange
On 01/30/2017 03:18 AM, Jinesh Choksi wrote:
>>Are there other kernel options that might be useful to try?
> 
> pci=nomsi
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13

Incidentally, already found that one and I'm trying it currently on one
of the boxes.  So far there's been no issues, but it's only been since
Friday.

Also, I found this:

https://xen.crc.id.au/support/guides/install/

There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
to see how stable it is, also only since Friday.  I'm not using anything
else he's packaged from his repo.

On a related note, does the SIG have plans to replace the 3.18 kernel
which is marked as projected EOL of January 2017
(https://www.kernel.org/category/releases.html)?

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-27 Thread Kevin Stange
On 01/27/2017 06:08 AM, Karel Hendrych wrote:
> Have you tried to eliminate all power management features all over?

I've been trying to find and disable all power management features but
having relatively little luck with that solving the problems.  Stabbing
the the dark I've tried different ACPI settings, including completely
disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
on the kernel command line.  Are there other kernel options that might
be useful to try?

> Are the devices connected to the same network infrastructure?

There are two onboard NICs and two NICs on a dual-port card in each
server.  All devices connect to a cisco switch pair in VSS and the links
are paired in LACP.

> There has to be something common.

The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
and NFS traffic, as well as some basic management stuff over SSH, and
they are configured with an MTU of 9000 on the native VLAN.  It's a lot
of features, but I can't really turn them off and then actually have
enough load on the NICs to reproduce the issue.  Several of these
servers were installed and being burned in for 3 months without ever
having an issue, but suddenly collapsed when I tried to bring 20 or so
real-world VMs up on them.

The other NICs in the system that are connected don't exhibit issues and
run only VM network interfaces.  They are also in LACP and running VLAN
tags, but normal 1500 MTU.

So far it seems to correlate with NICs on the expansion cards, but it's
a coincidence that these cards are the ones with the storage and
management traffic.  I'm trying to swap some of this load to the onboard
NICs to see if the issues migrate over with it, or if they stay with the
expansion cards.

If the issue exists on both NIC types, then it rules out the specific
NIC chipset as the culprit.  It could point to the driver, but upgrading
it to a newer version did not help and actually appeared to make
everything worse.  This issue might actually be more to do with the PCIe
bridge than the NICs, but these are still different motherboards with
different PCIe bridges (5520 vs C600) experiencing the same issues.

> I've been using Intel NICs with Xen/CentOS for ages with no issues.

I figured that must be so.  Everyone uses Intel NICs.  If this was a
common issue, it would probably be causing a lot of people a lot of trouble.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-26 Thread Kevin Stange
On 01/26/2017 02:08 PM, Kevin Stange wrote:
> On 01/26/2017 09:35 AM, Johnny Hughes wrote:
>> On 01/26/2017 09:32 AM, Johnny Hughes wrote:
>>> On 01/25/2017 11:49 AM, Kevin Stange wrote:
>>>> On 01/24/2017 11:16 AM, Kevin Stange wrote:
>>>>> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>>>>>>> Kevin Stange,
>>>>>>> It can be either kernel or update the NIC driver or firmware of the NIC
>>>>>>> card. Hope that helps!
>>>>>>>
>>>>>>> Xlord
>>>>>>> -Original Message-
>>>>>>> From: CentOS-virt [mailto:centos-virt-boun...@centos.org] On Behalf Of 
>>>>>>> Kevin
>>>>>>> Stange
>>>>>>> Sent: Tuesday, January 24, 2017 1:04 AM
>>>>>>> To: centos-virt@centos.org
>>>>>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>>>>>>> Linux 3.18
>>>>>>>
>>>>> 
>>>>>>>
>>>>>>> Has anyone experienced similar issues with this configuration, and if 
>>>>>>> so,
>>>>>>> does anyone have tips on how to resolve the issues?
>>>>>>
>>>>>> Honeslty I would email Intel and see if they can help. This looks like
>>>>>> the NIC decides something is wrong, throws off an PCIe error and
>>>>>> then resets itself.
>>>>>
>>>>> This happens for several different NICs.  Is there a good contact at
>>>>> Intel for this kind of thing, or should I just try to reach them through
>>>>> their web site?
>>>>>
>>>>>> It could also be an error in the Linux stack which would "eat" an
>>>>>> interrupt when migrating interrupts (which was fixed
>>>>>> upstream, see below). Are you running irqbalance? Could you try
>>>>>> turning it off?
>>>>>
>>>>> irqbalance is enabled on these servers.  I'll try disabling it.
>>>>
>>>> I had stopped irqbalance yesterday afternoon, but had a hypervisor's
>>>> NICs fail anyway in early morning this morning, so I'm pretty sure this
>>>> is not the right tree to bark up.
>>>>
>>>
>>> Here is a set of drivers/fireware from Intel for those NICs:
>>>
>>> https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux-
>>>
>>> I will see if I can get a CentOS-6 build of the latest version of that
>>> from our older SRPM:
>>>
>>> http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6.centos.alt.src.rpm
>>>
>>> I am currently very busy with several c5, c6, c7 updates and the i686
>>> altarch c7 tree .. but I have this on my list.  In the meantime, maybe
>>> someone else could also see if those drivers help you (or you could try
>>> to compile / install it).
>>>
>>> Do you have another machine that you can use to see if you can duplicate
>>> the issue NOT running the xen.gz hypervisor boot, but just the straight
>>> kernel?
> 
> I can't actually reproduce this problem reliably.  It happens randomly
> when the servers are up and running anywhere between a few hours and a
> month or more, and I haven't been able to isolate any specific way to
> cause it to happen.  As a result I can't really test different solutions
> on different servers to see what helps.  I was hoping other people were
> seeing it so that I could get some direction.  If I can reproduce it, it
> won't take me very long to identify what the cause is.  Right now if I
> do upgrade the drivers on the systems I won't really know if it's fixed
> until I don't see another issue for several months.
> 
>> Actually .. I think this is the driver for you:
>>
>> https://downloadcenter.intel.com/download/13663
>>
>> And this explains how to make it work:
>>
>> http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/05767.html
> 
> The different combinations of NICs overlap both the e1000e and igb
> drivers, but the most egregious issues have been with the igb ones.
> I'll try to give this a shot and report back if I still see issues with
> a server after doing so, but it might be a week or two before I find out.

So the NICs giving issues in most cases were igb drivers.  I've tried
replacing the drivers on some HVs with the version you suggested, but it
doesn't seem to have helped with stability.  Any other ideas?

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-26 Thread Kevin Stange
On 01/26/2017 09:35 AM, Johnny Hughes wrote:
> On 01/26/2017 09:32 AM, Johnny Hughes wrote:
>> On 01/25/2017 11:49 AM, Kevin Stange wrote:
>>> On 01/24/2017 11:16 AM, Kevin Stange wrote:
>>>> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>>>>>> Kevin Stange,
>>>>>> It can be either kernel or update the NIC driver or firmware of the NIC
>>>>>> card. Hope that helps!
>>>>>>
>>>>>> Xlord
>>>>>> -Original Message-
>>>>>> From: CentOS-virt [mailto:centos-virt-boun...@centos.org] On Behalf Of 
>>>>>> Kevin
>>>>>> Stange
>>>>>> Sent: Tuesday, January 24, 2017 1:04 AM
>>>>>> To: centos-virt@centos.org
>>>>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>>>>>> Linux 3.18
>>>>>>
>>>> 
>>>>>>
>>>>>> Has anyone experienced similar issues with this configuration, and if so,
>>>>>> does anyone have tips on how to resolve the issues?
>>>>>
>>>>> Honeslty I would email Intel and see if they can help. This looks like
>>>>> the NIC decides something is wrong, throws off an PCIe error and
>>>>> then resets itself.
>>>>
>>>> This happens for several different NICs.  Is there a good contact at
>>>> Intel for this kind of thing, or should I just try to reach them through
>>>> their web site?
>>>>
>>>>> It could also be an error in the Linux stack which would "eat" an
>>>>> interrupt when migrating interrupts (which was fixed
>>>>> upstream, see below). Are you running irqbalance? Could you try
>>>>> turning it off?
>>>>
>>>> irqbalance is enabled on these servers.  I'll try disabling it.
>>>
>>> I had stopped irqbalance yesterday afternoon, but had a hypervisor's
>>> NICs fail anyway in early morning this morning, so I'm pretty sure this
>>> is not the right tree to bark up.
>>>
>>
>> Here is a set of drivers/fireware from Intel for those NICs:
>>
>> https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux-
>>
>> I will see if I can get a CentOS-6 build of the latest version of that
>> from our older SRPM:
>>
>> http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6.centos.alt.src.rpm
>>
>> I am currently very busy with several c5, c6, c7 updates and the i686
>> altarch c7 tree .. but I have this on my list.  In the meantime, maybe
>> someone else could also see if those drivers help you (or you could try
>> to compile / install it).
>>
>> Do you have another machine that you can use to see if you can duplicate
>> the issue NOT running the xen.gz hypervisor boot, but just the straight
>> kernel?

I can't actually reproduce this problem reliably.  It happens randomly
when the servers are up and running anywhere between a few hours and a
month or more, and I haven't been able to isolate any specific way to
cause it to happen.  As a result I can't really test different solutions
on different servers to see what helps.  I was hoping other people were
seeing it so that I could get some direction.  If I can reproduce it, it
won't take me very long to identify what the cause is.  Right now if I
do upgrade the drivers on the systems I won't really know if it's fixed
until I don't see another issue for several months.

> Actually .. I think this is the driver for you:
> 
> https://downloadcenter.intel.com/download/13663
> 
> And this explains how to make it work:
> 
> http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/05767.html

The different combinations of NICs overlap both the e1000e and igb
drivers, but the most egregious issues have been with the igb ones.
I'll try to give this a shot and report back if I still see issues with
a server after doing so, but it might be a week or two before I find out.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-25 Thread Kevin Stange
On 01/24/2017 11:16 AM, Kevin Stange wrote:
> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>>> Kevin Stange,
>>> It can be either kernel or update the NIC driver or firmware of the NIC
>>> card. Hope that helps!
>>>
>>> Xlord
>>> -Original Message-
>>> From: CentOS-virt [mailto:centos-virt-boun...@centos.org] On Behalf Of Kevin
>>> Stange
>>> Sent: Tuesday, January 24, 2017 1:04 AM
>>> To: centos-virt@centos.org
>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>>> Linux 3.18
>>>
> 
>>>
>>> Has anyone experienced similar issues with this configuration, and if so,
>>> does anyone have tips on how to resolve the issues?
>>
>> Honeslty I would email Intel and see if they can help. This looks like
>> the NIC decides something is wrong, throws off an PCIe error and
>> then resets itself.
> 
> This happens for several different NICs.  Is there a good contact at
> Intel for this kind of thing, or should I just try to reach them through
> their web site?
> 
>> It could also be an error in the Linux stack which would "eat" an
>> interrupt when migrating interrupts (which was fixed
>> upstream, see below). Are you running irqbalance? Could you try
>> turning it off?
> 
> irqbalance is enabled on these servers.  I'll try disabling it.

I had stopped irqbalance yesterday afternoon, but had a hypervisor's
NICs fail anyway in early morning this morning, so I'm pretty sure this
is not the right tree to bark up.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-24 Thread Kevin Stange
On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:
>> Kevin Stange,
>> It can be either kernel or update the NIC driver or firmware of the NIC
>> card. Hope that helps!
>>
>> Xlord
>> -Original Message-
>> From: CentOS-virt [mailto:centos-virt-boun...@centos.org] On Behalf Of Kevin
>> Stange
>> Sent: Tuesday, January 24, 2017 1:04 AM
>> To: centos-virt@centos.org
>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
>> Linux 3.18
>>

>>
>> Has anyone experienced similar issues with this configuration, and if so,
>> does anyone have tips on how to resolve the issues?
> 
> Honeslty I would email Intel and see if they can help. This looks like
> the NIC decides something is wrong, throws off an PCIe error and
> then resets itself.

This happens for several different NICs.  Is there a good contact at
Intel for this kind of thing, or should I just try to reach them through
their web site?

> It could also be an error in the Linux stack which would "eat" an
> interrupt when migrating interrupts (which was fixed
> upstream, see below). Are you running irqbalance? Could you try
> turning it off?

irqbalance is enabled on these servers.  I'll try disabling it.

> Did you have these issues with an earlier kernel?

The last kernel these boxes ran was 2.6.18-412.el5xen under CentOS 5 and
they were very stable, however the differences between 2.6.18 and 3.18
are immense, especially with features like ASPM and other power
management code.  We've run into ASPM issues on systems before going
from CentOS 5 to the CentOS 6 kernel 2.6.32, but not this particular
hardware, which is why my first thought was to look at ASPM.

They've all been upgraded to CentOS 6 and running the virt SIG kernel
kernel-3.18.44-20.el6.x86_64.  I haven't run any previous versions 3.18
or tried any other kernels.

It surprises me that we would have all these issues if there isn't a
more widespread problem considering the hardware is fairly maintain and
covers a lot of NIC chips.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

2017-01-23 Thread Kevin Stange
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by
hardware) that are experiencing stability issues which I haven't been
able to track down.  All three types seem to be having issues with NIC
and/or PCIe.  In most cases, the issues are unrecoverable and require a
hard boot to resolve.  All have Intel NICs.

Often the systems will remain stable for days or weeks, then suddenly
encounter one of these issues.  I have yet to tie the error to any
specific action on the systems and can't reproduce it reliably.

- Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs

Kernel messages upon failure:

pcieport :00:03.0: AER: Multiple Corrected error received: id=0018
pcieport :00:03.0: PCIe Bus Error: severity=Corrected,
type=Transaction Layer, id=0018(Receiver ID)
pcieport :00:03.0:   device [8086:340a] error
status/mask=2000/1001
pcieport :00:03.0:[13] Advisory Non-Fatal
pcieport :00:03.0:   Error of this Agent(0018) is reported first
igb :04:00.0: PCIe Bus Error: severity=Corrected, type=Physical
Layer, id=0400(Receiver ID)
igb :04:00.0:   device [8086:10a7] error status/mask=2001/2000
igb :04:00.0:[ 0] Receiver Error (First)
igb :04:00.1: PCIe Bus Error: severity=Corrected, type=Physical
Layer, id=0401(Receiver ID)
igb :04:00.1:   device [8086:10a7] error status/mask=2001/2000
igb :04:00.1:[ 0] Receiver Error (First)

This spams to the console continuously until hard booting.

- Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB

igb :82:00.0: Detected Tx Unit Hang
 Tx Queue <1>
 TDH  <43>
 TDT  <50>
 next_to_use  <50>
 next_to_clean<43>
buffer_info[next_to_clean]
 time_stamp   <12e6bc0b6> next_to_watch
 jiffies  <12e6bc8dc>
 desc.status  <1c8210>

This spams to the console continuously until hard booting.

- Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB

e1000e :04:00.0 eth2: Detected Hardware Unit Hang:
  TDH  
  TDT  <33>
  next_to_use  <33>
  next_to_clean
buffer_info[next_to_clean]:
  time_stamp   <138230862>
  next_to_watch
  jiffies  <138231ac0>
  next_to_watch.status <0>
MAC Status <80383>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>

This type of system, the NIC automatically recovers and I don't need to
reboot.

So far I tried using pcie_aspm=off to see that would help, but it
appears that the 3.18 kernel turns off ASPM by default on these due to
probing the BIOS.  Stability issues were not resolved by the changes.

On the latter system type I also turned off all offloading setting.  It
appears the stability increased slightly but it didn't fully resolve the
problem.  I haven't adjusted offload settings on the first two server
types yet.

I suspect this problem is related to the 3.18 kernel used by the virt
SIG, as we had these running Xen on CentOS 5's kernel with no issues for
years, and systems of these types used elsewhere in our facility are
stable under CentOS 6's standard kernel.  This affects more than one
server of each type, so I don't believe it is a hardware failure, or
else it's a hardware design flaw.

Has anyone experienced similar issues with this configuration, and if
so, does anyone have tips on how to resolve the issues?

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
ke...@steadfast.net | www.steadfast.net
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt