Hi, We are currently running on production with Kernel Version : 4.13.0-1011-azure #14-Ubuntu. We are about to update to Kernel Version : 4.13.0-1016-azure #14-Ubuntu.
Is there any workaround on how to solve the issue? or any update on when the Updated kernel version will be released? The issue affects our production causing us the machines to freeze and fail every couple of days -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1772264 Title: watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783] Status in linux-azure package in Ubuntu: New Bug description: Hello Team, I have a Customer who is experiencing this issue once every 2 days and here are the details of the bug : May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783] May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper hid_hyperv cryptd hyperv_fb pata_acpi hv_utils cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard hv_netvsc pps_core May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java Not tainted 4.13.0-1011-azure #14-Ubuntu May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017 May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 task.stack: ffffb5c4cd014000 May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0010:fsnotify+0x1f9/0x4f0 May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd017e08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: ffff8ba0f6246020 RCX: 00000000ffffffff May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 0000000000000000 RDI: ffffffff9bc57020 May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 0000000000000000 R09: 0000000000000000 May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 0000000000000000 R12: 0000000000000000 May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 0000000000000000 R15: 0000000000000000 May 14 05:24:21 localhost kernel: [6006808.160055] FS: 00007f154838a700(0000) GS:ffff8ba0fd740000(0000) knlGS:0000000000000000 May 14 05:24:21 localhost kernel: [6006808.160055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 0000000165c38000 CR4: 00000000001406e0 May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace: May 14 05:24:21 localhost kernel: [6006808.160055] ? new_sync_write+0xe5/0x140 May 14 05:24:21 localhost kernel: [6006808.160055] vfs_write+0x15a/0x1b0 May 14 05:24:21 localhost kernel: [6006808.160055] ? syscall_trace_enter+0xcd/0x2f0 May 14 05:24:21 localhost kernel: [6006808.160055] SyS_write+0x55/0xc0 May 14 05:24:21 localhost kernel: [6006808.160055] do_syscall_64+0x61/0xd0 May 14 05:24:21 localhost kernel: [6006808.160055] entry_SYSCALL64_slow_path+0x25/0x25 May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. #“This indicates Softlockup error message stored in the RIP Register” May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f15483872a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 00007f1548389380 RCX: 00007f489076a2dd May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 00007f1548387310 RDI: 000000000000063c May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 00007f1548387310 R09: 00007f418a55b0b8 May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 0000000000000293 R12: 0000000000001740 May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 000000000000063c R15: 00007f40000051e0 The customer is using Elastic and hence he submitted a issue in Elastic Search github post which they are pointing that this is a Kernel issue and not a elastic search Issue : Attached github post for reference : https://github.com/elastic/elasticsearch/issues/30667 For now, I have asked him to increase the kernel.watchdog_thresh parameter from 10 to 20 to relax the situation. The customer wants to know for sure whether this is a Kernel bug. I also asked him to perform Kernel update. However, if he is confirmed that this is a bug in the current Kernel, he is willing to do so in all the 65 servers. The customer also submitted a bug to the Java process team which seems to be causing the issue, There reply was it is a kernel issue and the following launchpad link was given although I personally think that is not really the case here. However, I may be wrong : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717 This is the Information regarding the Performance of Java process within the customer's CPU Avg. Load: Avg=3, max=9 CPU: Avg=29, max=73 MEM: Avg=18, max=23 CGROUP: ubuntu@prod-elasticsearch-data-008:~$ cat /proc/1399/cgroup 12:rdma:/ 11:devices:/system.slice/elasticsearch.service 10:pids:/system.slice/elasticsearch.service 9:cpuset:/ 8:blkio:/system.slice/elasticsearch.service 7:memory:/system.slice/elasticsearch.service 6:perf_event:/ 5:cpu,cpuacct:/system.slice/elasticsearch.service 4:net_cls,net_prio:/ 3:freezer:/ 2:hugetlb:/ 1:name=systemd:/system.slice/elasticsearch.service /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" Kernel Version : 4.13.0-1011-azure #14-Ubuntu Please let me know your thoughts given the above information. Also, if extra information required, I will be happy to gather and provide you Regards, Sriharsha B S, Microsoft Azure Linux Team To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1772264/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

