Revisiting this bug, is this issue still persisting?

** Changed in: linux (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1633223

Title:
  rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and
  docker

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  ---Problem Description---
  Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 
3.19.0-58. The system is running docker containers, and has the NVIDIA GPU 
driver loaded. We've seen about 4 stalls in the last month, all with the 
3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.
    
  ---uname output---
  Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  2 x NVIDIA K80 GPU adapter:
  $ lspci | grep NV
  0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
  0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 

   
  Machine Type = 8247-42L 
   
  ---System Hang---
   Usual symptom is that the system is unresponsive except maybe for ping and 
writing the stall-detection messages to the console. Login/getty isn't 
available either via ssh nor on the console. System must be power cycled to 
recover.
   
  Attached is the kernel log from a stall detection on May 18th. The detection 
first occurs at: May 18 15:17:55.

  The system is later rebooted and those messages indicate the kernel
  (3.19.0-58) and NVIDIA driver version (352.93) that were active at the
  time.

  We've suffered 3 or 4 stalls since, all with the same kernel, but some
  with a newer NVIDIA driver (361.49).

  Unfortunately, information about the newer stalls wasn't preserved in
  the various log files (and we're not capturing the console
  constantly), so we don't have detailed data for those.

  We'd welcome any suggestions for how to collect additional data for
  these occurrences.

  I can't say for sure that we haven't seen the stalls on other systems,
  but they're occuring fairly frequently on this system, and it's
  unusual in that it's running both Docker and NVIDIA GPU driver. So
  maybe aufs or the NVIDIA driver are somehow involved.

  From the kern.log,

  The Call trace points to some kind of deadlock in aufs -

  May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
  May 18 15:17:55 dldev1 kernel: [713670.798628] cc1             R  running 
task        0 99183  99173 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798643] [c000000fa64673a0] 
[c0000000000cf004] wake_up_worker+0x44/0x60 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798671] [c000000fa6467570] 
[c000000fa64675d0] 0xc000000fa64675d0
  May 18 15:17:55 dldev1 kernel: [713670.798676] [c000000fa64675d0] 
[c000000000a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798679] [c000000fa64677f0] 
[c000000fa6467850] 0xc000000fa6467850
  May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
  May 18 15:17:55 dldev1 kernel: [713670.798684] cc1             D 
00001000005d9410     0 99427  99405 0x00040004
  May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
  May 18 15:17:55 dldev1 kernel: [713670.798691] [c0000017efdd3460] 
[c0000017efdd34a0] 0xc0000017efdd34a0 (unreliable)
  May 18 15:17:55 dldev1 kernel: [713670.798695] [c0000017efdd3630] 
[c0000017efdd3690] 0xc0000017efdd3690
  May 18 15:17:55 dldev1 kernel: [713670.798698] [c0000017efdd3690] 
[c000000000a1b050] __schedule+0x370/0x900
  May 18 15:17:55 dldev1 kernel: [713670.798702] [c0000017efdd38b0] 
[c000000000a1f128] rwsem_down_write_failed+0x288/0x400
  May 18 15:17:55 dldev1 kernel: [713670.798706] [c0000017efdd3940] 
[c000000000a1e538] down_write+0x88/0x90
  May 18 15:17:55 dldev1 kernel: [713670.798716] [c0000017efdd3970] 
[d00000001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798724] [c0000017efdd39a0] 
[d00000001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798733] [c0000017efdd39e0] 
[d00000001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
  May 18 15:17:55 dldev1 kernel: [713670.798737] [c0000017efdd3aa0] 
[c0000000002c88f8] lookup_fast+0x368/0x3b0
  May 18 15:17:55 dldev1 kernel: [713670.798740] [c0000017efdd3b10] 
[c0000000002cb620] path_lookupat+0x180/0x970
  May 18 15:17:55 dldev1 kernel: [713670.798743] [c0000017efdd3be0] 
[c0000000002cbe68] filename_lookup+0x58/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798746] [c0000017efdd3c30] 
[c0000000002cde04] user_path_at_empty+0x84/0xe0
  May 18 15:17:55 dldev1 kernel: [713670.798749] [c0000017efdd3d20] 
[c0000000002be744] vfs_fstatat+0x84/0x140
  May 18 15:17:55 dldev1 kernel: [713670.798753] [c0000017efdd3d80] 
[c0000000002bee14] SyS_newlstat+0x34/0x60
  May 18 15:17:55 dldev1 kernel: [713670.798757] [c0000017efdd3e30] 
[c000000000009258] system_call+0x38/0xd0

  Other info from the log -

  May 18 16:00:00 dldev1 kernel: [   11.802963] nvidia 0002:03:00.0: enabling 
device (0140 -> 0142)
  May 18 16:00:00 dldev1 kernel: [   11.803114] nvidia 0002:04:00.0: enabling 
device (0140 -> 0142)
  May 18 16:00:00 dldev1 kernel: [   11.803221] nvidia 0006:03:00.0: enabling 
device (0140 -> 0142)
  May 18 16:00:00 dldev1 kernel: [   11.803349] nvidia 0006:04:00.0: enabling 
device (0140 -> 0142)
  May 18 16:00:00 dldev1 kernel: [   11.803598] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0002:03:00.0 on minor 0
  May 18 16:00:00 dldev1 kernel: [   11.803654] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0002:04:00.0 on minor 1
  May 18 16:00:00 dldev1 kernel: [   11.803700] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0006:03:00.0 on minor 2
  May 18 16:00:00 dldev1 kernel: [   11.803742] [drm] Initialized nvidia-drm 
0.0.0 20150116 for 0006:04:00.0 on minor 3
  May 18 16:00:00 dldev1 kernel: [   11.803749] NVRM: loading NVIDIA UNIX 
ppc64le Kernel Module  352.93  Tue Apr  5 17:31:42 PDT 2016
  ....

  May 18 16:00:01 dldev1 kernel: [  278.416678] aufs 3.x-rcN-20150105
  May 18 16:00:01 dldev1 kernel: [  278.952335] bridge: automatic filtering via 
arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter 
if you need this.
  May 18 16:00:01 dldev1 kernel: [  278.953351] Bridge firewalling registered
  May 18 16:00:01 dldev1 kernel: [  278.957164] nf_conntrack version 0.5.0 
(16384 buckets, 65536 max)
  May 18 16:00:01 dldev1 kernel: [  278.997765] ip_tables: (C) 2000-2006 
Netfilter Core Team
  May 18 16:00:01 dldev1 kernel: [  279.013814] nvidia 0002:03:00.0: Using 
64-bit DMA iommu bypass
  May 18 16:00:03 dldev1 kernel: [  280.480596] IPv6: ADDRCONF(NETDEV_UP): 
docker0: link is not ready
  May 18 16:00:04 dldev1 kernel: [  282.035907] nvidia 0002:04:00.0: Using 
64-bit DMA iommu bypass
  May 18 16:00:07 dldev1 kernel: [  285.065144] nvidia 0006:03:00.0: Using 
64-bit DMA iommu bypass
  May 18 16:00:10 dldev1 kernel: [  288.150339] nvidia 0006:04:00.0: Using 
64-bit DMA iommu bypass
  ...
  May 18 16:01:55 dldev1 kernel: [  393.265785] aufs 
au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the 
permission bits on the lower branch
  May 18 16:01:56 dldev1 kernel: [  393.328552] aufs 
au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the 
permission bits on the lower branch
  May 18 16:01:56 dldev1 kernel: [  393.366023] device veth0beb24c entered 
promiscuous mode
  May 18 16:01:56 dldev1 kernel: [  393.367228] IPv6: ADDRCONF(NETDEV_UP): 
veth0beb24c: link is not ready
  May 18 16:01:56 dldev1 kernel: [  393.367235] docker0: port 1(veth0beb24c) 
entered forwarding state
  May 18 16:01:56 dldev1 kernel: [  393.367314] docker0: port 1(veth0beb24c) 
entered forwarding state
  May 18 16:01:56 dldev1 kernel: [  393.367825] docker0: port 1(veth0beb24c) 
entered disabled state
  May 18 16:01:56 dldev1 kernel: [  393.526038] docker0: port 1(veth0beb24c) 
entered disabled state
  May 18 16:01:56 dldev1 kernel: [  393.530143] device veth0beb24c left 
promiscuous mode
  May 18 16:01:56 dldev1 kernel: [  393.530148] docker0: port 1(veth0beb24c) 
entered disabled state

  > Hi Breno,
  > 
  > As per Ubuntu support, all new fixes will target 16.04 and Canonical will
  > not automatically fix 14.04. If one needs a fix in 14.04, it has to go
  > through as a special request to Canonical. So should we check if this
  > problem happens on Ubuntu 16.04?

  That is correct. Ubuntu 14.04 fixes will be SRU. I.e, they need to be
  fixed in 16.10 first, and then backport to 16.04 an 14.04.

  Kernel 4.4 for Ubuntu 16.04.5 is already in proposed. Does it contain
  the bug also? I understand that kernel 3.19 is already EOL.

  The requested docker info; we're using the docker packages provided by
  Canonical:

  $ dpkg -l | egrep 'docker|aufs'
  ii  aufs-tools                             1:3.2+20130722-1.1                 
        ppc64el      Tools to manage aufs filesystems
  ii  docker.io                              1.9.1~git20151210-18bfacb          
        ppc64el      Linux container runtime

  $ docker version
  Client:
   Version:      1.9.1
   API version:  1.21
   Go version:   go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) 
[ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
   Git commit:   18bfacb
   Built:        Thu Dec 17 19:19:02 UTC 2015
   OS/Arch:      linux/ppc64le

  Server:
   Version:      1.9.1
   API version:  1.21
   Go version:   go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) 
[ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
   Git commit:   18bfacb
   Built:        Thu Dec 17 19:19:02 UTC 2015
   OS/Arch:      linux/ppc64le

  
  $ docker info
  Containers: 8
  Images: 27
  Server Version: 1.9.1
  Storage Driver: aufs
   Root Dir: /var/lib/docker/aufs
   Backing Filesystem: extfs
   Dirs: 43
   Dirperm1 Supported: true
  Execution Driver: native-0.2
  Logging Driver: json-file
  Kernel Version: 3.19.0-58-generic
  Operating System: Ubuntu 14.04.4 LTS
  CPUs: 192
  Total Memory: 127.5 GiB
  Name: dldev1
  ID: NKEF:PLNM:HHXN:3DAD:3CLU:Z2GP:AZLR:MZRO:ADYA:5GOV:YZRN:DWOS
  WARNING: No swap limit support

  I had said: "we're using the docker packages provided by Canonical".

  That's not true.  The 'docker.io' package we're using is from:

  deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/14_04/docker-ppc64el/
  trusty main

  The aufs.ko does come from Canonical:

  $ dpkg -S aufs.ko
  linux-image-extra-3.19.0-58-generic: 
/lib/modules/3.19.0-58-generic/kernel/ubuntu/aufs/aufs.ko

  Another occurrence this afternoon, again while building a software
  project in a Docker container.

  Yesterday was building 'Caffe', which is mostly C or C++. Today was
  building 'TensorFlow' which includes a lot of Java.

  Still seeing this hang on Ubuntu 16.04 with 4.4.0-38 kernel.

  Similar scenario--building TensorFlow in a container often causes the
  problem. I assume because the build is creating lots of filesystem
  artifacts.

  Seems that running 'sync' every few seconds during the build makes the
  problem less likely to strike.

  Will attach kernel log from an instance on 16.04 / 4.4.0-38 from Oct
  7th.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1633223/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to