Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2018-01-04 Thread Chris Boot
On 30/12/17 23:24, Chris Boot wrote:
> What makes me suspicious that these are related is that neither happens
> with a 4.13 kernel, but I get both of these cgroup-related problems with
> 4.14.
> 
> I wouldn't mind trying to bisect this, but I haven't done that for many
> years. Is there a nice way to do this with the Debian packaging or am I
> better off seeing if I can reproduce with vanilla upstream kernels and
> bisecting that? Or shall I give 4.15~rc5 from experimental a whirl instead?

I tried with linux-image-4.15.0-rc5-amd64_4.15~rc5-1~exp1 and my cgroup
issues no longer happen, so I think this is likely fixed in 4.15.

Unfortunately I'm now running into a KVM instability that feels like
#885166, so I'm going to go back to 4.13 shortly.

Cheers,
Chris

-- 
Chris Boot
bo...@debian.org

GPG: 8467 53CB 1921 3142 C56D  C918 F5C8 3C05 D9CE 



Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2017-12-30 Thread Chris Boot
On 25/12/17 23:09, Ben Hutchings wrote:
> On Sat, 2017-12-23 at 12:42 +, Chris Boot wrote:
>> Severity: serious
>> Justification: kernel panic
>>
>> I experimented a little and disabled cgroupv2 on that server. Because I 
>> had some issues during boot I attempted to enable 
>> NetworkManager-wait-online.service using systemd, but that instantly 
>> resulted in the following kernel panic:
> [...]
>> I don't know that this is the same bug at all, but I'm keeping it on
>> this report for now as it seems at least related somehow.
> 
> The log messages don't look even slightly related, so please move this 
> to a separate bug report.

I'm still not so certain - both sets of stack dumps fall somewhere
within cgroup space, and disabling systemd's cgroup accounting (not
enabled by default) avoids these conditions.

I like to run this system with the following all enabled in
/etc/systemd/system.conf:

DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultBlockIOAccounting=yes
DefaultMemoryAccounting=yes

These are useful for tools like systemd-cgtop for example.

With cgroupv2, I can avoid the error by disabling
DefaultMemoryAccounting. I was running for nearly 48 hours with this
configuration before rebooting to try without cgroupv2.

Without cgroupv2, it's DefaultCPUAccounting I need to disable to avoid
the panics when I run 'systemd daemon-reload'. I have yet to run into
the warning or OOM killer with memory accounting enabled but I'll let
you know if it happens.

What makes me suspicious that these are related is that neither happens
with a 4.13 kernel, but I get both of these cgroup-related problems with
4.14.

I wouldn't mind trying to bisect this, but I haven't done that for many
years. Is there a nice way to do this with the Debian packaging or am I
better off seeing if I can reproduce with vanilla upstream kernels and
bisecting that? Or shall I give 4.15~rc5 from experimental a whirl instead?

Thanks,
Chris

-- 
Chris Boot
bo...@debian.org



Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2017-12-25 Thread Ben Hutchings
On Sat, 2017-12-23 at 12:42 +, Chris Boot wrote:
> Severity: serious
> Justification: kernel panic
> 
> I experimented a little and disabled cgroupv2 on that server. Because I 
> had some issues during boot I attempted to enable 
> NetworkManager-wait-online.service using systemd, but that instantly 
> resulted in the following kernel panic:
[...]
> I don't know that this is the same bug at all, but I'm keeping it on
> this report for now as it seems at least related somehow.

The log messages don't look even slightly related, so please move this 
to a separate bug report.

Ben.

-- 
Ben Hutchings
The world is coming to an end.  Please log off.



signature.asc
Description: This is a digitally signed message part


Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2017-12-23 Thread Chris Boot
Severity: serious
Justification: kernel panic

I experimented a little and disabled cgroupv2 on that server. Because I 
had some issues during boot I attempted to enable 
NetworkManager-wait-online.service using systemd, but that instantly 
resulted in the following kernel panic:

[   69.485816] [ cut here ]
[   69.490485] WARNING: CPU: 1 PID: 1 at 
/build/linux-NHzxYj/linux-4.14.7/kernel/fork.c:419 __put_task_struct+0xf0/0x150
[   69.501108] Modules linked in: binfmt_misc vhost_net vhost tap tun 
xt_multiport devlink iptable_filter bridge 8021q garp mrp stp llc fuse i915 
nls_ascii nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ast ttm hci_uart 
btqca ghash_clmulni_intel drm_kms_helper btintel intel_cstate intel_uncore 
bluetooth efi_pstore intel_rapl_perf sg mei_me pcspkr drbg efivars iTCO_wdt 
joydev ansi_cprng evdev cdc_acm iTCO_vendor_support drm mei shpchp 
intel_pch_thermal ie31200_edac battery ecdh_generic intel_lpss_acpi rfkill 
intel_lpss mfd_core video button acpi_als acpi_power_meter acpi_pad kfifo_buf 
industrialio ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl 
lockd grace sunrpc efivarfs ip_tables x_tables autofs4 ext4
[   69.571558]  crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod ses 
enclosure sd_mod scsi_transport_sas hid_generic usbhid xhci_pci crc32c_intel 
xhci_hcd igb ixgbe i2c_algo_bit aesni_intel ahci dca aes_x86_64 libahci usbcore 
ptp crypto_simd libata cryptd megaraid_sas glue_helper usb_common i2c_i801 
pps_core mdio scsi_mod fan thermal i2c_hid hid
[   69.602324] CPU: 1 PID: 1 Comm: systemd Not tainted 4.14.0-2-amd64 #1 Debian 
4.14.7-1
[   69.610168] Hardware name: Supermicro Super Server/X11SSH-F, BIOS 2.0b 
07/27/2017
[   69.617665] task: a0552173a040 task.stack: b46243168000
[   69.623601] RIP: 0010:__put_task_struct+0xf0/0x150
[   69.628403] RSP: 0018:b4624316bda0 EFLAGS: 00010246
[   69.633644] RAX:  RBX: a054db731410 RCX: 0001
[   69.640795] RDX: b4624316be40 RSI: a054db731410 RDI: a054db731410
[   69.647944] RBP: b4624316bdb0 R08: 1000 R09: 000c
[   69.655095] R10: 0020 R11: a054b84a500b R12: b4624316bf18
[   69.662244] R13: a0551e7e7a00 R14: a054db731410 R15: a0551d8bcc00
[   69.669395] FS:  7f18193e4980() GS:a0554504() 
knlGS:
[   69.677498] CS:  0010 DS:  ES:  CR0: 80050033
[   69.683260] CR2: 5582ad204068 CR3: 0008603ae003 CR4: 003626e0
[   69.690410] DR0:  DR1:  DR2: 
[   69.697561] DR3:  DR6: fffe0ff0 DR7: 0400
[   69.704709] Call Trace:
[   69.707166]  css_task_iter_next+0x74/0x80
[   69.711195]  cgroup_procs_next+0x16/0x20
[   69.715130]  cgroup_seqfile_next+0x1a/0x20
[   69.719239]  kernfs_seq_next+0x27/0x60
[   69.722999]  seq_read+0x2ce/0x3f0
[   69.726327]  kernfs_fop_read+0x134/0x180
[   69.730263]  ? security_file_permission+0x9b/0xc0
[   69.734975]  __vfs_read+0x18/0x40
[   69.738295]  vfs_read+0x8e/0x130
[   69.741527]  SyS_read+0x55/0xc0
[   69.744675]  system_call_fast_compare_end+0xc/0x97
[   69.749483] RIP: 0033:0x7f1818d0076d
[   69.753061] RSP: 002b:7ffc19d1a880 EFLAGS: 0293 ORIG_RAX: 

[   69.760644] RAX: ffda RBX: 5582ad1d9440 RCX: 7f1818d0076d
[   69.767794] RDX: 1000 RSI: 5582ad1f60e0 RDI: 001d
[   69.774945] RBP: 7f1818fbc440 R08: 7f1818fc0188 R09: 1010
[   69.782094] R10: 0020 R11: 0293 R12: 
[   69.789669] R13:  R14: 001d R15: 5582ad0a15c0
[   69.797296] Code: 49 8b 94 24 d8 03 00 00 48 85 d2 74 06 f0 ff 4a 5c 74 2c 
48 8b 3d 29 42 e5 00 4c 89 e6 e8 c9 a7 19 00 eb a2 0f ff e9 4a ff ff ff <0f> ff 
8b 43 48 85 c0 0f 84 2b ff ff ff 0f ff e9 24 ff ff ff 48 
[   69.817117] ---[ end trace 29e4513e3e583259 ]---
[   69.822245] BUG: unable to handle kernel NULL pointer dereference at 
00b0
[   69.830959] IP: pids_free+0x15/0x40
[   69.834908] PGD 0 P4D 0 
[   69.837873] Oops:  [#1] SMP
[   69.841488] Modules linked in: binfmt_misc vhost_net vhost tap tun 
xt_multiport devlink iptable_filter bridge 8021q garp mrp stp llc fuse i915 
nls_ascii nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ast ttm hci_uart 
btqca ghash_clmulni_intel drm_kms_helper btintel intel_cstate intel_uncore 
bluetooth efi_pstore intel_rapl_perf sg mei_me pcspkr drbg efivars iTCO_wdt 
joydev ansi_cprng evdev cdc_acm iTCO_vendor_support drm mei shpchp 
intel_pch_thermal ie31200_edac battery ecdh_generic intel_lpss_acpi rfkill 
intel_lpss mfd_core video button acpi_als acpi_power_meter acpi_pad kfifo_buf 
industrialio ipmi_si 

Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2017-12-23 Thread Chris Boot
Package: src:linux
Version: 4.14.7-1
Followup-For: Bug #883413

Dear kernel maintainers,

This problem is still occuring with the latest 4.14 upload. Once this
warning has happened, prolonged operation leads to spurious OOM kills of
system processes which makes the system unusable.

Best regards,
Chris

-- Package-specific info:
** Version:
Linux version 4.14.0-2-amd64 (debian-kernel@lists.debian.org) (gcc version 
7.2.0 (Debian 7.2.0-18)) #1 SMP Debian 4.14.7-1 (2017-12-22)

** Command line:
BOOT_IMAGE=/boot/vmlinuz-4.14.0-2-amd64 root=/dev/mapper/vg_tarquin-rootfs ro 
intel_iommu=on vsyscall=emulate scsi_mod.use_blk_mq=Y dm_mod.use_blk_mq=Y 
intel_pstate=passive i915.disable_display=true apparmor=0 
systemd.unified_cgroup_hierarchy=1 quiet

** Tainted: W (512)
 * Taint on warning.

** Kernel log:
[ 2457.473503] [ cut here ]
[ 2457.473507] WARNING: CPU: 6 PID: 19171 at 
/build/linux-NHzxYj/linux-4.14.7/mm/page_counter.c:27 
page_counter_cancel+0x1b/0x20
[ 2457.473508] Modules linked in: binfmt_misc vhost_net vhost tap tun 
xt_multiport iptable_filter devlink bridge 8021q garp mrp stp llc fuse 
nls_ascii nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul hci_uart 
ghash_clmulni_intel ast btqca intel_cstate btintel intel_uncore ttm efi_pstore 
evdev joydev bluetooth drm_kms_helper intel_rapl_perf cdc_acm sg pcspkr efivars 
iTCO_wdt iTCO_vendor_support drm mei_me intel_pch_thermal shpchp mei 
ie31200_edac drbg ansi_cprng ecdh_generic rfkill battery intel_lpss_acpi 
intel_lpss mfd_core video acpi_als kfifo_buf acpi_power_meter acpi_pad 
industrialio button nfsd nfs_acl lockd ipmi_si auth_rpcgss grace ipmi_devintf 
ipmi_msghandler sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16
[ 2457.473536]  mbcache jbd2 crc32c_generic fscrypto ecb dm_mod ses enclosure 
sd_mod scsi_transport_sas hid_generic usbhid crc32c_intel aesni_intel 
aes_x86_64 crypto_simd cryptd glue_helper ahci libahci i2c_i801 xhci_pci 
xhci_hcd igb ixgbe i2c_algo_bit libata dca usbcore megaraid_sas ptp usb_common 
pps_core mdio scsi_mod fan thermal i2c_hid hid
[ 2457.473550] CPU: 6 PID: 19171 Comm: check_ups Not tainted 4.14.0-2-amd64 #1 
Debian 4.14.7-1
[ 2457.473551] Hardware name: Supermicro Super Server/X11SSH-F, BIOS 2.0b 
07/27/2017
[ 2457.473551] task: 96625ad34000 task.stack: a50a43ad8000
[ 2457.473553] RIP: 0010:page_counter_cancel+0x1b/0x20
[ 2457.473553] RSP: 0018:a50a43adbb90 EFLAGS: 00010097
[ 2457.473554] RAX:  RBX: 9662ed1948c0 RCX: 
[ 2457.473554] RDX: 2ea6fa404338 RSI: 0001 RDI: 9662ed1948c0
[ 2457.473555] RBP: a50a43adbb90 R08: 966345012200 R09: 
[ 2457.473555] R10: a50a43adbcb0 R11: 0100 R12: 0001
[ 2457.473556] R13: 9662ed194800 R14: 96631d201000 R15: 9662570fe600
[ 2457.473556] FS:  7fcb1ac90480() GS:96634518() 
knlGS:
[ 2457.473557] CS:  0010 DS:  ES:  CR0: 80050033
[ 2457.473557] CR2: 7ffeba7085e8 CR3: 0007b6aba005 CR4: 003626e0
[ 2457.473558] DR0:  DR1:  DR2: 
[ 2457.473558] DR3:  DR6: fffe0ff0 DR7: 0400
[ 2457.473559] Call Trace:
[ 2457.473562]  page_counter_uncharge+0x22/0x40
[ 2457.473563]  drain_stock.isra.37+0x38/0xa0
[ 2457.473564]  refill_stock+0x47/0x80
[ 2457.473565]  mem_cgroup_uncharge_skmem+0x27/0x40
[ 2457.473567]  __sk_mem_reduce_allocated+0x7a/0xe0
[ 2457.473568]  __sk_mem_reclaim+0x1e/0x20
[ 2457.473570]  tcp_v4_destroy_sock+0x213/0x230
[ 2457.473572]  tcp_v6_destroy_sock+0x12/0x20
[ 2457.473573]  inet_csk_destroy_sock+0x4b/0x110
[ 2457.473574]  tcp_done+0x8d/0x90
[ 2457.473575]  tcp_rcv_state_process+0x9d3/0xe80
[ 2457.473577]  ? sk_reset_timer+0x18/0x30
[ 2457.473577]  ? tcp_schedule_loss_probe+0x12f/0x170
[ 2457.473579]  tcp_v6_do_rcv+0x1c4/0x410
[ 2457.473580]  ? tcp_v6_do_rcv+0x1c4/0x410
[ 2457.473581]  __release_sock+0x83/0xd0
[ 2457.473582]  release_sock+0x30/0xa0
[ 2457.473583]  tcp_close+0x16d/0x3f0
[ 2457.473585]  inet_release+0x3c/0x60
[ 2457.473586]  inet6_release+0x30/0x40
[ 2457.473587]  sock_release+0x1f/0x80
[ 2457.473588]  sock_close+0x12/0x20
[ 2457.473589]  __fput+0xe7/0x220
[ 2457.473590]  fput+0xe/0x10
[ 2457.473592]  task_work_run+0x97/0xc0
[ 2457.473593]  exit_to_usermode_loop+0xc0/0xd0
[ 2457.473594]  syscall_return_slowpath+0x8d/0x90
[ 2457.473596]  system_call_fast_compare_end+0x95/0x97
[ 2457.473597] RIP: 0033:0x7fcb1a446390
[ 2457.473597] RSP: 002b:7ffeba709f28 EFLAGS: 0246 ORIG_RAX: 
0003
[ 2457.473598] RAX:  RBX:  RCX: 7fcb1a446390
[ 2457.473598] RDX: 1fff RSI: 7ffeba709f70 RDI: 
[ 2457.473599] RBP: 0006 R08:  R09: 
[ 2457.473599] R10:  R11: 

Bug#883413: linux-image-4.14.0-1-amd64: WARN_ON_ONCE in page_counter_cancel() in mm/page_counter.c

2017-12-03 Thread Chris Boot
Package: src:linux
Version: 4.14.2-1
Severity: important
Tags: upstream

Hi kernel maintainers,

I've just switched to the 4.14 kernel and pretty quickly hit a strange
(to me) sequence of WARN_ON_ONCE() followed by NUT's upsd getting killed
by the OOM killer despite it not being in a restricted cgroup.

Probably the most relevant tweak for this issue is my use of cgroupv2
(systemd.unified_cgroup_hierarchy=1).

I have not yet tried to reboot my system since running into this (aside
from upsd getting repeatedly killed, it seems to work).

Please let me know if there is anything I can do to help debug this.

Thanks,
Chris

-- Package-specific info:
** Version:
Linux version 4.14.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 
7.2.0 (Debian 7.2.0-16)) #1 SMP Debian 4.14.2-1 (2017-11-30)

** Command line:
BOOT_IMAGE=/boot/vmlinuz-4.14.0-1-amd64 root=/dev/mapper/vg_tarquin-rootfs ro 
intel_iommu=on vsyscall=emulate scsi_mod.use_blk_mq=Y dm_mod.use_blk_mq=Y 
intel_pstate=passive systemd.unified_cgroup_hierarchy=1 quiet

** Tainted: W (512)
 * Taint on warning.

** Kernel log:
[ 2420.733243] [ cut here ]
[ 2420.733247] WARNING: CPU: 5 PID: 20290 at 
/build/linux-ZSFHrj/linux-4.14.2/mm/page_counter.c:27 
page_counter_cancel+0x1b/0x20
[ 2420.733248] Modules linked in: dm_crypt loop algif_skcipher af_alg 
kyber_iosched bfq binfmt_misc vhost_net vhost tap tun xt_multiport 
iptable_filter bridge devlink 8021q garp mrp stp llc fuse intel_rapl nls_ascii 
nls_cp437 x86_pkg_temp_thermal intel_powerclamp coretemp vfat fat kvm_intel kvm 
hci_uart btqca irqbypass btintel bluetooth crct10dif_pclmul crc32_pclmul i915 
ghash_clmulni_intel ast intel_cstate efi_pstore j
oydev evdev intel_uncore ttm sg drbg intel_rapl_perf cdc_acm pcspkr 
drm_kms_helper ansi_cprng efivars drm iTCO_wdt shpchp mei_me 
iTCO_vendor_support mei intel_pch_thermal ie31200_edac battery ecdh_generic 
intel_lpss_acpi intel_lpss rfkill mfd_core video acpi_als acpi_power_meter 
kfifo_buf industrialio acpi_pad button ipmi_si ipmi_devintf ipmi_msghandler 
nfsd nfs_acl lockd auth_rpcgss
[ 2420.733274]  grace sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 
mbcache jbd2 crc32c_generic fscrypto ecb dm_mod ses enclosure 
scsi_transport_sas sd_mod hid_generic usbhid crc32c_intel aesni_intel 
aes_x86_64 crypto_simd cryptd glue_helper i2c_i801 ahci xhci_pci igb ixgbe 
libahci xhci_hcd i2c_algo_bit dca libata usbcore megaraid_sas ptp usb_common 
pps_core mdio scsi_mod fan thermal i2c_hid hid
[ 2420.733289] CPU: 5 PID: 20290 Comm: check_ups Not tainted 4.14.0-1-amd64 #1 
Debian 4.14.2-1
[ 2420.733290] Hardware name: Supermicro Super Server/X11SSH-F, BIOS 2.0b 
07/27/2017
[ 2420.733290] task: 9f6d3af50040 task.stack: bb78c6948000
[ 2420.733292] RIP: 0010:page_counter_cancel+0x1b/0x20
[ 2420.733292] RSP: 0018:bb78c694bb90 EFLAGS: 00010097
[ 2420.733293] RAX:  RBX: 9f6d29fcdcc0 RCX: 
[ 2420.733293] RDX: 3c0b3a402678 RSI: 0001 RDI: 9f6d29fcdcc0
[ 2420.733294] RBP: bb78c694bb90 R08: 9f6d850120f0 R09: 
[ 2420.733294] R10: bb78c694bcb0 R11: 0100 R12: 0001
[ 2420.733295] R13: 9f6d29fcdc00 R14: 9f6d4e353000 R15: 9f6d43bc3780
[ 2420.733295] FS:  7f1043d50480() GS:9f6d8514() 
knlGS:
[ 2420.733296] CS:  0010 DS:  ES:  CR0: 80050033
[ 2420.733296] CR2: 7ffe7c8c8628 CR3: 00078c404003 CR4: 003626e0
[ 2420.733297] DR0:  DR1:  DR2: 
[ 2420.733297] DR3:  DR6: fffe0ff0 DR7: 0400
[ 2420.733298] Call Trace:
[ 2420.733300]  page_counter_uncharge+0x22/0x40
[ 2420.733301]  drain_stock.isra.37+0x38/0xa0
[ 2420.733302]  refill_stock+0x47/0x80
[ 2420.733303]  mem_cgroup_uncharge_skmem+0x27/0x40
[ 2420.733305]  __sk_mem_reduce_allocated+0x7a/0xe0
[ 2420.733306]  __sk_mem_reclaim+0x1e/0x20
[ 2420.733308]  tcp_v4_destroy_sock+0x213/0x230
[ 2420.733310]  tcp_v6_destroy_sock+0x12/0x20
[ 2420.733311]  inet_csk_destroy_sock+0x4b/0x100
[ 2420.733312]  tcp_done+0x8d/0x90
[ 2420.733313]  tcp_rcv_state_process+0x9d3/0xe80
[ 2420.733314]  ? sk_reset_timer+0x18/0x30
[ 2420.733315]  ? tcp_schedule_loss_probe+0x11e/0x160
[ 2420.733316]  tcp_v6_do_rcv+0x1c4/0x410
[ 2420.733317]  ? tcp_v6_do_rcv+0x1c4/0x410
[ 2420.733318]  __release_sock+0x83/0xd0
[ 2420.733319]  release_sock+0x30/0xa0
[ 2420.733320]  tcp_close+0x167/0x3f0
[ 2420.733322]  inet_release+0x3c/0x60
[ 2420.733323]  inet6_release+0x30/0x40
[ 2420.733325]  sock_release+0x1f/0x80
[ 2420.733326]  sock_close+0x12/0x20
[ 2420.733327]  __fput+0xe7/0x220
[ 2420.733328]  fput+0xe/0x10
[ 2420.70]  task_work_run+0x97/0xc0
[ 2420.71]  exit_to_usermode_loop+0xc0/0xd0
[ 2420.72]  syscall_return_slowpath+0x8d/0x90
[ 2420.73]  system_call_fast_compare_end+0x95/0x97
[ 2420.74] RIP: 0033:0x7f1043506390
[ 2420.75] RSP: