On Tue, 28 Sept 2021 at 19:10, Noah Meyerhans <[email protected]> wrote: > > On Tue, Sep 28, 2021 at 09:48:15AM -0700, Ross Vandegrift wrote: > > > I'm regularly reinstantiating many AWS instances that run Debian > > > (bullseye). > > > I see some strange behavior, in that instance come up very > > > consistently in about 6 seconds (according to > > > /var/log/cloud-init-output.log). > > > But every now and then it takes much longer, in the order of 40 seconds. > > [snip] > > > > > > I guess this is not a problem with the Debian image, and I can > > > probably configure our deployment to cope with this delay, but I'm > > > curious as to why this happens. > > > > When it happens, take a look at /var/log/cloud-init.log - that'll > > contain details of cloud-init's actions, and timestamps. Hopefully the > > 40s issue will stand out.
The issue I have is that the very first line of /var/log/cloud-init.log already mentions the long Up time: 2021-09-28 16:13:50,130 - util.py[DEBUG]: Cloud-init v. 20.4.1 running 'init-local' at Tue, 28 Sep 2021 16:13:50 +0000. Up 42.21 seconds. vs the first line of that file on another host that didn't have the delay: 2021-09-28 15:07:27,909 - util.py[DEBUG]: Cloud-init v. 20.4.1 running 'init-local' at Tue, 28 Sep 2021 15:07:27 +0000. Up 5.52 seconds. So whatever happened, seems to have happened before cloud-init even started. The first lines of some other files in /var/log start 4 seconds further: root@ip-172-16-209-4:/var/log# head -1 *.log ==> auth.log <== Sep 28 16:13:54 ip-172-16-209-4 useradd[641]: new group: name=admin, GID=1000 ==> cloud-init-output.log <== Cloud-init v. 20.4.1 running 'init-local' at Tue, 28 Sep 2021 16:13:50 +0000. Up 42.21 seconds. ==> cloud-init.log <== 2021-09-28 16:13:50,130 - util.py[DEBUG]: Cloud-init v. 20.4.1 running 'init-local' at Tue, 28 Sep 2021 16:13:50 +0000. Up 42.21 seconds. ==> daemon.log <== Sep 28 16:13:54 ip-172-16-209-4 systemd[1]: [email protected]: Succeeded. ==> kern.log <== Sep 28 16:13:54 ip-172-16-209-4 kernel: [ 0.000000] Linux version 5.10.0-8-cloud-amd64 ([email protected]) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.46-4 (2021-08-03) ==> user.log <== Sep 28 16:13:55 ip-172-16-209-4 ec2: > `cloud-init analyze blame` is also helpful for analyzing cloud-init's > logs. No smoking guns in there either (i think): root@ip-172-16-209-4:~# cloud-init analyze blame -- Boot Record 01 -- 00.54200s (init-network/config-ssh) 00.32700s (init-local/search-Ec2Local) 00.20400s (modules-config/config-grub-dpkg) 00.16000s (init-network/check-cache) 00.09300s (init-network/config-users-groups) 00.08300s (modules-final/config-ssh-authkey-fingerprints) 00.07200s (init-network/config-resizefs) 00.06900s (modules-final/config-keys-to-console) 00.04800s (modules-config/config-apt-configure) 00.04200s (init-network/config-growpart) 00.02400s (modules-config/config-locale) 00.01300s (init-network/consume-user-data) 00.01000s (init-network/config-update_etc_hosts) 00.00400s (modules-final/config-final-message) 00.00200s (init-network/config-update_hostname) 00.00200s (init-network/config-mounts) 00.00200s (init-network/config-ca-certs) 00.00100s (modules-final/config-scripts-vendor) 00.00100s (modules-final/config-rightscale_userdata) 00.00100s (modules-final/config-phone-home) 00.00100s (modules-final/config-package-update-upgrade-install) 00.00100s (modules-final/config-chef) 00.00100s (modules-config/config-timezone) 00.00100s (modules-config/config-ssh-import-id) 00.00100s (modules-config/config-set-passwords) 00.00100s (modules-config/config-runcmd) 00.00100s (modules-config/config-ntp) 00.00100s (modules-config/config-byobu) 00.00100s (modules-config/config-apt-pipelining) 00.00100s (init-network/setup-datasource) 00.00100s (init-network/config-write-files) 00.00100s (init-network/config-set_hostname) 00.00100s (init-network/config-seed_random) 00.00100s (init-network/config-rsyslog) 00.00100s (init-network/config-disk_setup) 00.00100s (init-network/activate-datasource) 00.00100s (init-local/check-cache) 00.00000s (modules-final/config-scripts-user) 00.00000s (modules-final/config-scripts-per-once) 00.00000s (modules-final/config-scripts-per-instance) 00.00000s (modules-final/config-scripts-per-boot) 00.00000s (modules-final/config-salt-minion) 00.00000s (modules-final/config-puppet) 00.00000s (modules-final/config-power-state-change) 00.00000s (modules-final/config-mcollective) 00.00000s (modules-final/config-fan) 00.00000s (modules-config/config-emit_upstart) 00.00000s (modules-config/config-disable-ec2-metadata) 00.00000s (init-network/consume-vendor-data) 00.00000s (init-network/config-migrator) 00.00000s (init-network/config-bootcmd) 1 boot records analyzed > How often does this happen? What instance types are you using when you > experience this issue? Very roughly one out of ten VMs do this. I think I've found some clue, in the instance's system log: [ 2.845436] ena 0000:00:05.0: LLQ is not supported Fallback to host mode policy. [ 2.856881] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem febf4000, mac addr 02:3b:60:14:f2:ea [ 2.866980] ena 0000:00:05.0 ens5: renamed from eth0 Begin: Loading essential drivers ... done. Begin: Running /scripts/init-premount ... done. Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. Begin: Running /scripts/local-premount ... done. Begin: Will now check root file system ... fsck from util-linux 2.36.1 [/sbin/fsck.ext4 (1) -- /dev/nvme0n1p1] fsck.ext4 -a -C0 /dev/nvme0n1p1 /dev/nvme0n1p1: clean, 35442/516096 files, 242371/2064379 blocks done. [ 9.042449] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) done. Begin: Running /scripts/local-bottom ... [ 28.809447] nvme0n1: p1 p14 p15 GROWROOT: CHANGED: partition=1 start=262144 old:[ 28.819359] nvme0n1: p1 p14 p15 size=16515039 end=16777183 new: size=41680863 end=41943007 [ 40.228970] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) done. Begin: Running /scripts/init-bottom ... done. [ 40.258874] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 40.359770] systemd[1]: Inserted module 'autofs4' And this is from a non-problematic instance: [ 2.490628] ena 0000:00:05.0: LLQ is not supported Fallback to host mode policy. [ 2.502639] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem febf4000, mac addr 02:43:23:8c:27:0e [ 2.513028] ena 0000:00:05.0 ens5: renamed from eth0 Begin: Loading essential drivers ... done. Begin: Running /scripts/init-premount ... done. Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done. Begin: Running /scripts/local-premount ... done. Begin: Will now check root file system ... fsck from util-linux 2.36.1 [/sbin/fsck.ext4 (1) -- /dev/nvme0n1p1] fsck.ext4 -a -C0 /dev/nvme0n1p1 /dev/nvme0n1p1: clean, 35442/516096 files, 242371/2064379 blocks done. [ 2.648924] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) done. Begin: Running /scripts/local-bottom ... [ 2.685130] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/inp ut0 [ 3.137992] nvme0n1: p1 p14 p15 GROWROOT: CHANGED: partition=1 start=262144 old: size=16515039 end=16777183 new: size=41680863 end=41943007 [ 3.184491] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) done. Begin: Running /scripts/init-bottom ... done. [ 3.210828] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist. [ 3.368117] systemd[1]: Inserted module 'autofs4' These instances with the delay are t3a.medium or t3.medium. Their filesystems are all configured to be 20G. Dick -- Dick Visser Trust & Identity Service Operations Manager GÉANT
