Just to re-summarize this:

The problem here is when a net namespace is being cleaned up, the thread
cleaning it up gets hung inside net/core/dev.c netdev_wait_allrefs()
function.  This is called every time anything calls rtnl_unlock(), which
happens a lot, including during net namespace cleanup.  This function
waits for all netdev device references to be released.

The thing(s) holding dev references here are dst objects from sockets
(at least, that's what is failing to release the dev refs, in my
debugging).  When a net namespace is cleaned up, all its sockets' call
dst_dev_put() for their dsts, which moves the netdev refs from the
actual net namespace interfaces to the netns loopback device (this is
why the message is typically 'waiting for lo to become free', but
sometimes it waits for non-lo to become free if the dst_dev_put hasn't
been called for a dst).

The dsts are then put, which should reduce their ref count to 0 and free
them, which then also reduces the lo device's ref count.  In correct
operation, this reduces the lo device's ref count to 0, which allows
netdev_wait_allrefs() to complete.

This is where the problem arrives:

1. the first possibility a kernel socket that hasn't closed, and still has 
dst(s).  This is possible because while normal (user) sockets take a reference 
to their net namespace, and the net namespace does not begin cleanup until all 
its references are released (meaning, no user sockets are still open), kernel 
sockets do not take a reference to their net namespace.  So, net namespace 
cleanup begins while kernel sockets are still open.  While the net namespace is 
cleaning up, it ifdown's all its interfaces, which cause the kernel sockets to 
close, and free their dsts which releases all netdev interface refs, which 
allows the net namespace to close.  If, however, one of the kernel sockets 
doesn't pay attention to the net namespace cleaning up, it may hang around, 
failing to free its dsts, and failing to release the netdev interface 
reference(s).  That causes this bug.  In some cases (maybe most or all, I'm not 
sure) the socket eventually times out, and does release its dsts, allowing the 
net namespace to finish cleaning up and exiting.

My recent patch 4ee806d51176ba7b8ff1efd81f271d7252e03a1d fixes *one*
case of this happening - a TCP kernel socket remaining open, trying in
vain to complete its TCP FIN sequence (which will never complete since
all net namespace interfaces are down and will never come back up, since
the net namespace is cleaning up).  The TCP socket eventually would time
out the FIN sequence failure, and close, allowing the net namespace to
finish cleaning up, but it takes ~2 minutes or so.  There very well
could be more cases of this happening with other kernel sockets.

2. alternately, even if all kernel sockets behave correctly and cleanup 
immediately during net namespace cleanup, it's possible for a dst object to 
leak a reference.  This has happened before, there are likely existing dst 
leaks, and there will likely be dst leaks accidentally introduced in the 
future.  When a dst leaks, its refcount never reaches 0, and so it never 
releases its reference on its interface (which remember, is moved to the net 
namespace loopback interface when its socket is closing).  Unfortunately, a dst 
that has leaked a reference will never free itself, and will never release its 
reference to the loopback interface.  Thus, netdev_wait_allrefs() will hang 

Actually fixing either cause above is the correct thing to do, of course, but 
that's a long-term thing, as even if all causes currently in the kernel are 
fixed, more leaks probably will be accidentally introduced in the future.  And 
while mem leaks are not good, the more serious problem caused by this is that 
the netdev_wait_allrefs() hang happens while the thread is holding the global 
net mutex - that prevents the creation of any new net namespaces, and prevents 
the cleanup of any existing net namespaces (it also blocks 
registering/unregistering any pernet subsystem or pernet device).  This can 
render the entire system unusable, and requires a reboot to correct.

Starting from the assumption that it's unsafe to ever ignore the interface's 
refcount, and just free it anyway after some time period, I have to allow that 
when a dst leaks, its net namespace will also have to leak, since the dst will 
always hold a reference to the net namespace loopback (or other) interface.  
Given that, what we need to do is move the call to netdev_wait_allrefs() that 
is hanging, to outside of the net mutex locked section.  I'm working on a patch 
to do just that currently.

You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.

  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
  In Progress
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  In Progress
Status in linux source package in Bionic:
  In Progress

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.



  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

      unregister_netdevice: waiting for lo to become free. Usage count =

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-

To manage notifications about this bug go to:

Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to