Looking at both traces this looks to be consistently happen inside run_timer_softirq() and from the offset I would guess we are in the inlined __run_timers. Another noteworthy part is the value of RAX. This is the value of LIST_POISON2 which is used to mark an invalid pointer of a (hlist_node *)->pprev. So I would guess something modified the list of pending timers (those exist per-cpu) while softirq processing was working on them. The problem is to say what. Not sure a dump will help as often in those races the clues go away just after causing problems. I would maybe suspect the area of xen-netfront, given that, as far as I can tell, this has not happened on bare-metal servers and from the description rather seems to affect high traffic instances. Would it be possible to volunteer one affected instance and try mainline kernels (https://wiki.ubuntu.com/Kernel/MainlineBuilds) between 3.19 and 4.2 (4.0, 4.1) and/or after (4.3, maybe 4.4)? That would give a smaller delta to look at for what broke things (using the 4.0 and 4.1 kernels) or whether maybe it got fixed but not identified as a stable patch (when using 4.3 or 4.4).
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1534345 Title: Ubuntu 15.10 Crashing Frequently on EC2 Instances w/ Enhanced Networking Status in linux package in Ubuntu: Triaged Bug description: Lots of details and history of the problem here: https://askubuntu.com/questions/710747/after-upgrading- to-15-10-from-15-04-ec2-webservers-have-become-very-unstable 10 of my webservers have started crashing immediately following the 15.10 upgrade. As far as what exactly defines a "crash", Instance Status Checks fail, and I can no longer SSH to the machine. Background daemons running on the system stop responding, and nothing is written to the logs. After weeks of working with the AWS team, I finally fixed a netconsole issue via "echo 7 > /proc/sys/kernel/printk" and got netconsole working properly, and finally have a trace: [21410.260077] general protection fault: 0000 [#1] SMP [21410.261976] Modules linked in: isofs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables ppdev intel_rapl iosf_mbi xen_fbfront fb_sys_fops input_leds serio_raw i2c_piix4 parport_pc 8250_fintek parport mac_hid netconsole configfs autofs4 crct10dif_pclmul crc32_pclmul cirrus syscopyarea sysfillrect sysimgblt aesni_intel ttm aes_x86_64 drm_kms_helper lrw gf128mul glue_helper ablk_helper cryptd psmouse drm ixgbevf pata_acpi floppy [21410.264054] CPU: 0 PID: 26957 Comm: apache2 Not tainted 4.2.0-23-generic #28-Ubuntu [21410.264054] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015 [21410.264054] task: ffff8803f9809b80 ti: ffff8803f999c000 task.ti: ffff8803f999c000 [21410.264054] RIP: 0010:[<ffffffff810e5c36>] [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0 [21410.264054] RSP: 0000:ffff8803ff203e98 EFLAGS: 00010086 [21410.264054] RAX: dead000000200200 RBX: ffff8803ff20e9c0 RCX: ffff8803ff203ec8 [21410.264054] RDX: ffff8803ff203ec8 RSI: 0000000000011fc0 RDI: ffff8803ff20e9c0 [21410.264054] RBP: ffff8803ff203f08 R08: 000000000000a77a R09: 0000000000000000 [21410.264054] R10: 0000000000000020 R11: 0000000000000004 R12: 000000000000007c [21410.264054] R13: ffffffff8172aaf0 R14: 0000000000000000 R15: ffff8803af955be0 [21410.264054] FS: 00007fb0ce6e8780(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000 [21410.264054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [21410.264054] CR2: 00007fb0ce51e130 CR3: 00000003fb233000 CR4: 00000000001406f0 [21410.264054] Stack: [21410.264054] ffff8803ff203eb8 ffff8803ff20f5f8 ffff8803ff20f3f8 ffff8803ff20f1f8 [21410.264054] ffff8803ff20e9f8 ffff8803af955b58 dead000000200200 00000000f60fabc0 [21410.264054] 0000000000011fc0 0000000000000001 ffffffff81c0b0c8 0000000000000001 [21410.264054] Call Trace: [21410.264054] <IRQ> [21410.264054] [<ffffffff8107f846>] __do_softirq+0xf6/0x250 [21410.264054] [<ffffffff8107fb13>] irq_exit+0xa3/0xb0 [21410.264054] [<ffffffff814a4499>] xen_evtchn_do_upcall+0x39/0x50 [21410.264054] [<ffffffff817f1f6b>] xen_hvm_callback_vector+0x6b/0x70 [21410.264054] <EOI> [21410.264054] Code: 81 e6 00 00 20 00 48 85 d2 48 89 45 b8 0f 85 30 01 00 00 4c 89 7b 08 0f 1f 44 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 10 48 b8 00 02 20 00 00 00 ad de 49 c7 [21410.264054] RIP [<ffffffff810e5c36>] run_timer_softirq+0x116/0x2d0 [21410.264054] RSP <ffff8803ff203e98> I don't have a vmcore at the moment, but I'm trying to get one from AWS and should have one in the next couple of days. This is happening frequently and repeatedly since I first upgraded to 15.10 on early December. ubuntu@xxx-web-xx:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 15.10 Release: 15.10 Codename: wily ubuntu@xxx-web-xx:~$ uname -a Linux xxx-web-xx 4.2.0-23-generic #28-Ubuntu SMP Sun Dec 27 17:47:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux ubuntu@xxx-web-xx:~$ ProblemType: Bug DistroRelease: Ubuntu 15.10 Package: linux-image-4.2.0-23-generic 4.2.0-23.28 ProcVersionSignature: User Name 4.2.0-23.28-generic 4.2.6 Uname: Linux 4.2.0-23-generic x86_64 AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Jan 14 15:42 seq crw-rw---- 1 root audio 116, 33 Jan 14 15:42 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.19.1-0ubuntu5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Thu Jan 14 21:31:14 2016 Ec2AMI: ami-d5e7adbf Ec2AMIManifest: (unknown) Ec2AvailabilityZone: us-east-1d Ec2InstanceType: m4.xlarge Ec2Kernel: unavailable Ec2Ramdisk: unavailable IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99 MachineType: Xen HVM domU PciMultimedia: ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 cirrusdrmfb 1 xen ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-23-generic root=UUID=9bd55602-81dd-4868-8cfc-b7d63f8f8d7e ro console=tty1 console=ttyS0 crashkernel=256M@0M RelatedPackageVersions: linux-restricted-modules-4.2.0-23-generic N/A linux-backports-modules-4.2.0-23-generic N/A linux-firmware 1.149.3 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev' UpgradeStatus: Upgraded to wily on 2015-12-15 (29 days ago) dmi.bios.date: 12/07/2015 dmi.bios.vendor: Xen dmi.bios.version: 4.2.amazon dmi.chassis.type: 1 dmi.chassis.vendor: Xen dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/07/2015:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr: dmi.product.name: HVM domU dmi.product.version: 4.2.amazon dmi.sys.vendor: Xen To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534345/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : email@example.com Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp