Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking

2022-03-19 Thread Noah Meyerhans
On Sat, Mar 19, 2022 at 10:41:39AM +0100, Salvatore Bonaccorso wrote:
> > >From the upstream discussion on the linux-pci mailing list [*]:
> > 
> > > Yes. My understanding is that the issue is because AWS is using older
> > > versions of Xen. They are in the process of updating their fleet to a
> > > newer version of Xen so the change introduced with Stefan's commit
> > > isn't an issue any longer.
> > > 
> > > I think the changes are scheduled to be completed in the next 10-12
> > > weeks. For now we are carrying a revert in the Fedora Kernel.
> > > 
> > > You can follow this Fedora CoreOS issue if you'd like to know more
> > > about when the change lands in their backend. We work closely with one
> > > of their partner engineers and he keeps us updated.
> > > https://github.com/coreos/fedora-coreos-tracker/issues/1066
> > 
> > Ideally we can revert the upstream commit from the stable kernels, since
> > otherwise Debian users on AWS Xen instance types may be stuck using
> > older, unsafe kernels.  Especially if we have time to include the change
> > in the upcoming bullseye and buster point releases.  If the kernel
> > updates for those stable updates have already been built, though, it
> > might be too late to matter.  By the time we publish our next kernel
> > builds, the AWS Xen update may be complete.
> 
> Wehere one can track the update status for their Xen version directly
> or is following the above the only reference?

It's just for reference; the deployment timeline isn't published.  As
far as I know, it's also subject to change in the event that unexpected
issues arise or it's preempted by some high severity issue.

> How frequent is this particular combination of hardware/software? We
> have the change already applied for a while in bullseye, buster would
> be impacted new since the last update done for security fixes

The impacted instance types aren't the most common, as they're not the
latest generation.  So I expect that the majority of the impact is felt
by people or organizations that haven't yet been able to make time to
switch to newer instance types.  The implication here, of course, is
that many of these deployments may be production environment where
stability is prioritized over migration to the new thing.

We get a little bit of data about what instance types are used with
Debian on AWS, but it's incomplete as it only reflects usage by AWS
customers who use access Debian via the AWS Marketplace.  Consider it
something like popcon data; it's essentially opt-in.  If the data we get
from the Marketplace covering the past 3 days worth of activity is
representative of the Debian usage in general, then it looks like
roughly 1% of Debian users on AWS are trying to use the impacted
instance types.

> Are there workarounds for the affected users of this combination? I
> see some options listed in 
> https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye 

People can use newer generation instance types, which are not impacted.
Depending on the use case, that could be a trivial change, but it could
also be disruptive.  Newer instance types aren't based on Xen at all and
expose a different hardware device model to the instance.  Debian
supports the newer instance types, but the end user workload may still
need additional nontrivial qualification.

> If we revert the commit it reverts a fix for a bug with Marvell NVME
> devices.
> 
> But we cannot just revert the commit for the cloud images.

Understood.

> If we know something about the release schedule from Amazon to update
> their Xen instances (which is the way to move forward, since upstream
> won't revert the commit) then we should leave the status as it is for
> bullseye (and now for buster). For bullseye there is there is
> CVE-2022-0847 fixes they would need to pick up.

Yes, the problem will go away when the Xen fleet is updated.  It sounds
like we're looking at roughly a 3 month timeline, after which point the
patch won't be a problem.  However, until then, people who need to use
Xen instances will be stuck either running an unsafe kernel or building
their own.

noah



Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking

2022-03-19 Thread Salvatore Bonaccorso
Hi Noah,

On Thu, Mar 17, 2022 at 09:54:30AM -0700, Noah Meyerhans wrote:
> >From the upstream discussion on the linux-pci mailing list [*]:
> 
> > Yes. My understanding is that the issue is because AWS is using older
> > versions of Xen. They are in the process of updating their fleet to a
> > newer version of Xen so the change introduced with Stefan's commit
> > isn't an issue any longer.
> > 
> > I think the changes are scheduled to be completed in the next 10-12
> > weeks. For now we are carrying a revert in the Fedora Kernel.
> > 
> > You can follow this Fedora CoreOS issue if you'd like to know more
> > about when the change lands in their backend. We work closely with one
> > of their partner engineers and he keeps us updated.
> > https://github.com/coreos/fedora-coreos-tracker/issues/1066
> 
> Ideally we can revert the upstream commit from the stable kernels, since
> otherwise Debian users on AWS Xen instance types may be stuck using
> older, unsafe kernels.  Especially if we have time to include the change
> in the upcoming bullseye and buster point releases.  If the kernel
> updates for those stable updates have already been built, though, it
> might be too late to matter.  By the time we publish our next kernel
> builds, the AWS Xen update may be complete.

Wehere one can track the update status for their Xen version directly
or is following the above the only reference?

How frequent is this particular combination of hardware/software? We
have the change already applied for a while in bullseye, buster would
be impacted new since the last update done for security fixes

Are there workarounds for the affected users of this combination? I
see some options listed in 
https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye 

If we revert the commit it reverts a fix for a bug with Marvell NVME
devices.

But we cannot just revert the commit for the cloud images.

If we know something about the release schedule from Amazon to update
their Xen instances (which is the way to move forward, since upstream
won't revert the commit) then we should leave the status as it is for
bullseye (and now for buster). For bullseye there is there is
CVE-2022-0847 fixes they would need to pick up.

Regards,
Salvatore



Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking

2022-03-17 Thread Noah Meyerhans
>From the upstream discussion on the linux-pci mailing list [*]:

> Yes. My understanding is that the issue is because AWS is using older
> versions of Xen. They are in the process of updating their fleet to a
> newer version of Xen so the change introduced with Stefan's commit
> isn't an issue any longer.
> 
> I think the changes are scheduled to be completed in the next 10-12
> weeks. For now we are carrying a revert in the Fedora Kernel.
> 
> You can follow this Fedora CoreOS issue if you'd like to know more
> about when the change lands in their backend. We work closely with one
> of their partner engineers and he keeps us updated.
> https://github.com/coreos/fedora-coreos-tracker/issues/1066

Ideally we can revert the upstream commit from the stable kernels, since
otherwise Debian users on AWS Xen instance types may be stuck using
older, unsafe kernels.  Especially if we have time to include the change
in the upcoming bullseye and buster point releases.  If the kernel
updates for those stable updates have already been built, though, it
might be too late to matter.  By the time we publish our next kernel
builds, the AWS Xen update may be complete.

noah

* 
https://lore.kernel.org/linux-pci/c4a65b9a-d1e2-bf0d-2519-aac718593...@redhat.com/



Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking

2022-02-25 Thread Noah Meyerhans
Control: reassign -1 src:linux
Control: tags -1 + upstream

> Amazon EC2 instance types with Enhanced Networking use the ixgbevf.ko
> driver.  The current AMIs successfully probe the ixgbevf driver and spawn
> dhclient as expected, but dhclient appears to never receive a lease.  Older
> AMIs do work on this class of instance.

Upstream commit 83dbf898a2d4 "PCI/MSI: Mask MSI-X vectors only on
success" seems to introduce a regression that breaks the "Enhanced
Networking" feature used on Amazon EC2 instances, which use PCI
passthrough access to Intel ethernet devices using the ixgbevf.ko
driver.  Systems using this hardware seem to probe their network
hardware as usual, and don't log any errors to the console, but are
never able to communicate over the NIC.

Device details:

00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
Physical Slot: 3
Flags: bus master, fast devsel, latency 64
Memory at f300 (64-bit, prefetchable) [size=16K]
Memory at f3004000 (64-bit, prefetchable) [size=16K]
Capabilities: 
Kernel driver in use: ixgbevf
Kernel modules: ixgbevf

The issue is present in Debian kernels in sid and experimental.

The patch has been backported to stable branches including those used in
our stable releaseѕ:

The 5.10.x (released with v5.10.88) is e5949933f313.  Since bullseye is
currently using v5.10.92, it is impacted.

The 4.19.x branch (released with v4.19.222) is 12ae8cd1c7e9.  Since
buster is still on v4.19.208, it is not yet impacted, but likely would
be with the next kernel update.

This issue has been reported elsewhere as well, for example Fedora
CoreOS at https://github.com/coreos/fedora-coreos-tracker/issues/1066

I have confirmed that reverting e5949933f313 from 5.10.x results in a
build that functions properly with this hardware on bullseye, but this
is probably not a reasonable thing to do generally.

noah



Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking

2022-02-23 Thread Noah Meyerhans
Package: cloud.debian.org
Severity: important

(I suspect this is actually a kernel issue, but I'm starting with
cloud.debian.org as that's where I've observed the issue and I want to rule
out cloud configuration issues.)

Amazon EC2 instance types with Enhanced Networking use the ixgbevf.ko
driver.  The current AMIs successfully probe the ixgbevf driver and spawn
dhclient as expected, but dhclient appears to never receive a lease.  Older
AMIs do work on this class of instance.

Working AMI:

{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2021-12-21T01:39:18.000Z",
"ImageId": "ami-055b24b622e97e043",
"ImageLocation": "136693071363/debian-11-amd64-20211220-862",
"ImageType": "machine",
"Public": true,
"OwnerId": "136693071363",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0b1ec9931c8475322",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Debian 11 (20211220-862)",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "debian-11-amd64-20211220-862",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
]
}

Not working AMI:

{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2022-01-21T15:02:04.000Z",
"ImageId": "ami-0d0d8694ba492c02b",
"ImageLocation": "136693071363/debian-11-amd64-20220121-894",
"ImageType": "machine",
"Public": true,
"OwnerId": "136693071363",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0f15c399cd68cf47c",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Debian 11 (20220121-894)",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "debian-11-amd64-20220121-894",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
]
}

Relevant logs from the instance boot:

Feb 23 22:04:09 debian kernel: [1.763348] ixgbevf :00:03.0 ens3: 
renamed from eth0
...
Feb 23 22:04:09 debian kernel: [5.283750] ixgbevf :00:03.0: NIC Link is 
Up 10 Gbps
Feb 23 22:04:09 debian kernel: [5.287656] IPv6: ADDRCONF(NETDEV_CHANGE): 
ens3: link becomes ready
...
Feb 23 22:04:09 debian systemd-udevd[239]: Using default interface naming 
scheme 'v247'.
Feb 23 22:04:09 debian systemd-udevd[234]: ethtool: autonegotiation is unset or 
enabled, the speed and duplex are not writable.
Feb 23 22:04:09 debian systemd-udevd[239]: ethtool: autonegotiation is unset or 
enabled, the speed and duplex are not writable.
...
Feb 23 22:04:09 debian cloud-ifupdown-helper: Generated configuration for ens3
...
Feb 23 22:04:09 debian systemd[1]: Found device 82599 Ethernet Controller 
Virtual Function.
...
Feb 23 22:04:09 debian dhclient[345]: Internet Systems Consortium DHCP Client 
4.4.1
Feb 23 22:04:09 debian dhclient[345]: Copyright 2004-2018 Internet Systems 
Consortium.
Feb 23 22:04:09 debian dhclient[345]: All rights reserved.
Feb 23 22:04:09 debian dhclient[345]: For info, please visit 
https://www.isc.org/software/dhcp/
Feb 23 22:04:09 debian dhclient[345]: 
Feb 23 22:04:09 debian dhclient[345]: Listening on LPF/ens3/02:e0:5c:07:ed:e7
Feb 23 22:04:09 debian dhclient[345]: Sending on   LPF/ens3/02:e0:5c:07:ed:e7
Feb 23 22:04:09 debian dhclient[345]: Sending on   Socket/fallback
Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 
port 67 interval 5
Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 
port 67 interval 13
Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 
port 67 interval 8
Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 
port 67 interval 13
Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 
port 67 interval 10
Feb 23 22:04:09 debia