Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking
On Sat, Mar 19, 2022 at 10:41:39AM +0100, Salvatore Bonaccorso wrote: > > >From the upstream discussion on the linux-pci mailing list [*]: > > > > > Yes. My understanding is that the issue is because AWS is using older > > > versions of Xen. They are in the process of updating their fleet to a > > > newer version of Xen so the change introduced with Stefan's commit > > > isn't an issue any longer. > > > > > > I think the changes are scheduled to be completed in the next 10-12 > > > weeks. For now we are carrying a revert in the Fedora Kernel. > > > > > > You can follow this Fedora CoreOS issue if you'd like to know more > > > about when the change lands in their backend. We work closely with one > > > of their partner engineers and he keeps us updated. > > > https://github.com/coreos/fedora-coreos-tracker/issues/1066 > > > > Ideally we can revert the upstream commit from the stable kernels, since > > otherwise Debian users on AWS Xen instance types may be stuck using > > older, unsafe kernels. Especially if we have time to include the change > > in the upcoming bullseye and buster point releases. If the kernel > > updates for those stable updates have already been built, though, it > > might be too late to matter. By the time we publish our next kernel > > builds, the AWS Xen update may be complete. > > Wehere one can track the update status for their Xen version directly > or is following the above the only reference? It's just for reference; the deployment timeline isn't published. As far as I know, it's also subject to change in the event that unexpected issues arise or it's preempted by some high severity issue. > How frequent is this particular combination of hardware/software? We > have the change already applied for a while in bullseye, buster would > be impacted new since the last update done for security fixes The impacted instance types aren't the most common, as they're not the latest generation. So I expect that the majority of the impact is felt by people or organizations that haven't yet been able to make time to switch to newer instance types. The implication here, of course, is that many of these deployments may be production environment where stability is prioritized over migration to the new thing. We get a little bit of data about what instance types are used with Debian on AWS, but it's incomplete as it only reflects usage by AWS customers who use access Debian via the AWS Marketplace. Consider it something like popcon data; it's essentially opt-in. If the data we get from the Marketplace covering the past 3 days worth of activity is representative of the Debian usage in general, then it looks like roughly 1% of Debian users on AWS are trying to use the impacted instance types. > Are there workarounds for the affected users of this combination? I > see some options listed in > https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye People can use newer generation instance types, which are not impacted. Depending on the use case, that could be a trivial change, but it could also be disruptive. Newer instance types aren't based on Xen at all and expose a different hardware device model to the instance. Debian supports the newer instance types, but the end user workload may still need additional nontrivial qualification. > If we revert the commit it reverts a fix for a bug with Marvell NVME > devices. > > But we cannot just revert the commit for the cloud images. Understood. > If we know something about the release schedule from Amazon to update > their Xen instances (which is the way to move forward, since upstream > won't revert the commit) then we should leave the status as it is for > bullseye (and now for buster). For bullseye there is there is > CVE-2022-0847 fixes they would need to pick up. Yes, the problem will go away when the Xen fleet is updated. It sounds like we're looking at roughly a 3 month timeline, after which point the patch won't be a problem. However, until then, people who need to use Xen instances will be stuck either running an unsafe kernel or building their own. noah
Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking
Hi Noah, On Thu, Mar 17, 2022 at 09:54:30AM -0700, Noah Meyerhans wrote: > >From the upstream discussion on the linux-pci mailing list [*]: > > > Yes. My understanding is that the issue is because AWS is using older > > versions of Xen. They are in the process of updating their fleet to a > > newer version of Xen so the change introduced with Stefan's commit > > isn't an issue any longer. > > > > I think the changes are scheduled to be completed in the next 10-12 > > weeks. For now we are carrying a revert in the Fedora Kernel. > > > > You can follow this Fedora CoreOS issue if you'd like to know more > > about when the change lands in their backend. We work closely with one > > of their partner engineers and he keeps us updated. > > https://github.com/coreos/fedora-coreos-tracker/issues/1066 > > Ideally we can revert the upstream commit from the stable kernels, since > otherwise Debian users on AWS Xen instance types may be stuck using > older, unsafe kernels. Especially if we have time to include the change > in the upcoming bullseye and buster point releases. If the kernel > updates for those stable updates have already been built, though, it > might be too late to matter. By the time we publish our next kernel > builds, the AWS Xen update may be complete. Wehere one can track the update status for their Xen version directly or is following the above the only reference? How frequent is this particular combination of hardware/software? We have the change already applied for a while in bullseye, buster would be impacted new since the last update done for security fixes Are there workarounds for the affected users of this combination? I see some options listed in https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye If we revert the commit it reverts a fix for a bug with Marvell NVME devices. But we cannot just revert the commit for the cloud images. If we know something about the release schedule from Amazon to update their Xen instances (which is the way to move forward, since upstream won't revert the commit) then we should leave the status as it is for bullseye (and now for buster). For bullseye there is there is CVE-2022-0847 fixes they would need to pick up. Regards, Salvatore
Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking
>From the upstream discussion on the linux-pci mailing list [*]: > Yes. My understanding is that the issue is because AWS is using older > versions of Xen. They are in the process of updating their fleet to a > newer version of Xen so the change introduced with Stefan's commit > isn't an issue any longer. > > I think the changes are scheduled to be completed in the next 10-12 > weeks. For now we are carrying a revert in the Fedora Kernel. > > You can follow this Fedora CoreOS issue if you'd like to know more > about when the change lands in their backend. We work closely with one > of their partner engineers and he keeps us updated. > https://github.com/coreos/fedora-coreos-tracker/issues/1066 Ideally we can revert the upstream commit from the stable kernels, since otherwise Debian users on AWS Xen instance types may be stuck using older, unsafe kernels. Especially if we have time to include the change in the upcoming bullseye and buster point releases. If the kernel updates for those stable updates have already been built, though, it might be too late to matter. By the time we publish our next kernel builds, the AWS Xen update may be complete. noah * https://lore.kernel.org/linux-pci/c4a65b9a-d1e2-bf0d-2519-aac718593...@redhat.com/
Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking
Control: reassign -1 src:linux Control: tags -1 + upstream > Amazon EC2 instance types with Enhanced Networking use the ixgbevf.ko > driver. The current AMIs successfully probe the ixgbevf driver and spawn > dhclient as expected, but dhclient appears to never receive a lease. Older > AMIs do work on this class of instance. Upstream commit 83dbf898a2d4 "PCI/MSI: Mask MSI-X vectors only on success" seems to introduce a regression that breaks the "Enhanced Networking" feature used on Amazon EC2 instances, which use PCI passthrough access to Intel ethernet devices using the ixgbevf.ko driver. Systems using this hardware seem to probe their network hardware as usual, and don't log any errors to the console, but are never able to communicate over the NIC. Device details: 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) Physical Slot: 3 Flags: bus master, fast devsel, latency 64 Memory at f300 (64-bit, prefetchable) [size=16K] Memory at f3004000 (64-bit, prefetchable) [size=16K] Capabilities: Kernel driver in use: ixgbevf Kernel modules: ixgbevf The issue is present in Debian kernels in sid and experimental. The patch has been backported to stable branches including those used in our stable releaseѕ: The 5.10.x (released with v5.10.88) is e5949933f313. Since bullseye is currently using v5.10.92, it is impacted. The 4.19.x branch (released with v4.19.222) is 12ae8cd1c7e9. Since buster is still on v4.19.208, it is not yet impacted, but likely would be with the next kernel update. This issue has been reported elsewhere as well, for example Fedora CoreOS at https://github.com/coreos/fedora-coreos-tracker/issues/1066 I have confirmed that reverting e5949933f313 from 5.10.x results in a build that functions properly with this hardware on bullseye, but this is probably not a reasonable thing to do generally. noah
Bug#1006346: cloud.debian.org: bullseye AMIs don't boot on Amazon EC2 Xen instances with Enhanced Networking
Package: cloud.debian.org Severity: important (I suspect this is actually a kernel issue, but I'm starting with cloud.debian.org as that's where I've observed the issue and I want to rule out cloud configuration issues.) Amazon EC2 instance types with Enhanced Networking use the ixgbevf.ko driver. The current AMIs successfully probe the ixgbevf driver and spawn dhclient as expected, but dhclient appears to never receive a lease. Older AMIs do work on this class of instance. Working AMI: { "Images": [ { "Architecture": "x86_64", "CreationDate": "2021-12-21T01:39:18.000Z", "ImageId": "ami-055b24b622e97e043", "ImageLocation": "136693071363/debian-11-amd64-20211220-862", "ImageType": "machine", "Public": true, "OwnerId": "136693071363", "PlatformDetails": "Linux/UNIX", "UsageOperation": "RunInstances", "State": "available", "BlockDeviceMappings": [ { "DeviceName": "/dev/xvda", "Ebs": { "DeleteOnTermination": true, "SnapshotId": "snap-0b1ec9931c8475322", "VolumeSize": 8, "VolumeType": "gp2", "Encrypted": false } } ], "Description": "Debian 11 (20211220-862)", "EnaSupport": true, "Hypervisor": "xen", "Name": "debian-11-amd64-20211220-862", "RootDeviceName": "/dev/xvda", "RootDeviceType": "ebs", "SriovNetSupport": "simple", "VirtualizationType": "hvm" } ] } Not working AMI: { "Images": [ { "Architecture": "x86_64", "CreationDate": "2022-01-21T15:02:04.000Z", "ImageId": "ami-0d0d8694ba492c02b", "ImageLocation": "136693071363/debian-11-amd64-20220121-894", "ImageType": "machine", "Public": true, "OwnerId": "136693071363", "PlatformDetails": "Linux/UNIX", "UsageOperation": "RunInstances", "State": "available", "BlockDeviceMappings": [ { "DeviceName": "/dev/xvda", "Ebs": { "DeleteOnTermination": true, "SnapshotId": "snap-0f15c399cd68cf47c", "VolumeSize": 8, "VolumeType": "gp2", "Encrypted": false } } ], "Description": "Debian 11 (20220121-894)", "EnaSupport": true, "Hypervisor": "xen", "Name": "debian-11-amd64-20220121-894", "RootDeviceName": "/dev/xvda", "RootDeviceType": "ebs", "SriovNetSupport": "simple", "VirtualizationType": "hvm" } ] } Relevant logs from the instance boot: Feb 23 22:04:09 debian kernel: [1.763348] ixgbevf :00:03.0 ens3: renamed from eth0 ... Feb 23 22:04:09 debian kernel: [5.283750] ixgbevf :00:03.0: NIC Link is Up 10 Gbps Feb 23 22:04:09 debian kernel: [5.287656] IPv6: ADDRCONF(NETDEV_CHANGE): ens3: link becomes ready ... Feb 23 22:04:09 debian systemd-udevd[239]: Using default interface naming scheme 'v247'. Feb 23 22:04:09 debian systemd-udevd[234]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 23 22:04:09 debian systemd-udevd[239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. ... Feb 23 22:04:09 debian cloud-ifupdown-helper: Generated configuration for ens3 ... Feb 23 22:04:09 debian systemd[1]: Found device 82599 Ethernet Controller Virtual Function. ... Feb 23 22:04:09 debian dhclient[345]: Internet Systems Consortium DHCP Client 4.4.1 Feb 23 22:04:09 debian dhclient[345]: Copyright 2004-2018 Internet Systems Consortium. Feb 23 22:04:09 debian dhclient[345]: All rights reserved. Feb 23 22:04:09 debian dhclient[345]: For info, please visit https://www.isc.org/software/dhcp/ Feb 23 22:04:09 debian dhclient[345]: Feb 23 22:04:09 debian dhclient[345]: Listening on LPF/ens3/02:e0:5c:07:ed:e7 Feb 23 22:04:09 debian dhclient[345]: Sending on LPF/ens3/02:e0:5c:07:ed:e7 Feb 23 22:04:09 debian dhclient[345]: Sending on Socket/fallback Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 5 Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 13 Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 8 Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 13 Feb 23 22:04:09 debian dhclient[345]: DHCPDISCOVER on ens3 to 255.255.255.255 port 67 interval 10 Feb 23 22:04:09 debia