Hi,

We are currently running on production with Kernel Version :
4.13.0-1011-azure #14-Ubuntu. We are about to update to Kernel Version :
4.13.0-1016-azure #14-Ubuntu.

Is there any workaround on how to solve the issue? or any update on when
the Updated kernel version will be released?

The issue affects our production causing us the machines to freeze and
fail every couple of days

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1772264

Title:
  watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

Status in linux-azure package in Ubuntu:
  New

Bug description:
  Hello Team,

  I have a Customer who is experiencing this issue once every 2 days and
  here are the details of the bug :

  May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup 
- CPU#5 stuck for 23s! [java:5783]
  May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs 
msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 
nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables 
x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser 
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
pcbc aesni_intel aes_x86_64 crypto_simd glue_helper hid_hyperv cryptd hyperv_fb 
pata_acpi hv_utils cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard 
hv_netvsc pps_core
  May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: 
java Not tainted 4.13.0-1011-azure #14-Ubuntu
  May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
  May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 
task.stack: ffffb5c4cd014000
  May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 
0010:fsnotify+0x1f9/0x4f0
  May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd017e08 
EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
  May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: 
ffff8ba0f6246020 RCX: 00000000ffffffff
  May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 
0000000000000000 RDI: ffffffff9bc57020
  May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 
0000000000000000 R09: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 
0000000000000000 R12: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 
0000000000000000 R15: 0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] FS:  
00007f154838a700(0000) GS:ffff8ba0fd740000(0000) knlGS:0000000000000000
  May 14 05:24:21 localhost kernel: [6006808.160055] CS:  0010 DS: 0000 ES: 
0000 CR0: 0000000080050033
  May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 
0000000165c38000 CR4: 00000000001406e0
  May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
  May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
new_sync_write+0xe5/0x140
  May 14 05:24:21 localhost kernel: [6006808.160055]  vfs_write+0x15a/0x1b0
  May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
syscall_trace_enter+0xcd/0x2f0
  May 14 05:24:21 localhost kernel: [6006808.160055]  SyS_write+0x55/0xc0
  May 14 05:24:21 localhost kernel: [6006808.160055]  do_syscall_64+0x61/0xd0
  May 14 05:24:21 localhost kernel: [6006808.160055]  
entry_SYSCALL64_slow_path+0x25/0x25
  May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. 
#“This indicates Softlockup error message stored in the RIP Register”
  May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f15483872a0 
EFLAGS: 00000293 ORIG_RAX: 0000000000000001
  May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 
00007f1548389380 RCX: 00007f489076a2dd
  May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 
00007f1548387310 RDI: 000000000000063c
  May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 
00007f1548387310 R09: 00007f418a55b0b8
  May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 
0000000000000293 R12: 0000000000001740
  May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 
000000000000063c R15: 00007f40000051e0

  The customer is using Elastic and hence he submitted a issue in
  Elastic Search github post which they are pointing that this is a
  Kernel issue and not a elastic search Issue :

  Attached github post for reference :
  https://github.com/elastic/elasticsearch/issues/30667

  For now, I have asked him to increase the kernel.watchdog_thresh
  parameter from 10 to 20 to relax the situation.

  The customer wants to know for sure whether this is a Kernel bug. I
  also asked him to perform Kernel update. However, if he is confirmed
  that this is a bug in the current Kernel, he is willing to do so in
  all the 65 servers.

  The customer also submitted a bug to the Java process team which seems to be 
causing the issue, 
  There reply was it is a kernel issue and the following launchpad link was 
given although I personally think that is not really the case here. However, I 
may be wrong :

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717

  This is the Information regarding the Performance of Java process
  within the customer's CPU

  Avg. Load: Avg=3, max=9
  CPU: Avg=29, max=73
  MEM: Avg=18, max=23

  CGROUP:
  ubuntu@prod-elasticsearch-data-008:~$ cat /proc/1399/cgroup
  12:rdma:/
  11:devices:/system.slice/elasticsearch.service
  10:pids:/system.slice/elasticsearch.service
  9:cpuset:/
  8:blkio:/system.slice/elasticsearch.service
  7:memory:/system.slice/elasticsearch.service
  6:perf_event:/
  5:cpu,cpuacct:/system.slice/elasticsearch.service
  4:net_cls,net_prio:/
  3:freezer:/
  2:hugetlb:/
  1:name=systemd:/system.slice/elasticsearch.service

  /etc/lsb-release
  DISTRIB_ID=Ubuntu
  DISTRIB_RELEASE=16.04
  DISTRIB_CODENAME=xenial
  DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

  Kernel Version : 4.13.0-1011-azure #14-Ubuntu

  Please let me know your thoughts given the above information. Also, if
  extra information required, I will be happy to gather and provide you

  Regards,
  Sriharsha B S,
  Microsoft Azure Linux Team

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1772264/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to