Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v guests 
failed to bring up network interfaces on boot, logging "A link change request 
failed with some changes committed already. Interface eth0 may have been left 
with an inconsistent configuration, please check."  Running 'ifconfig eth0 up' 
appears to fix the problem temporarily so I went about bisecting which landed 
on:

commit be9c798d0d13ae609a91177323ac816545c39d28
Author: Stephen Hemminger <step...@networkplumber.org>
Date:   Mon May 14 15:32:18 2018 -0700

    hv_netvsc: common detach logic
    
    [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
    
    Make common function for detaching internals of device
    during changes to MTU and RSS. Make sure no more packets
    are transmitted and all packets have been received before
    doing device teardown.

    Change the wait logic to be common and use usleep_range().

    Changes transmit enabling logic so that transmit queues are disabled
    during the period when lower device is being changed. And enabled
    only after sub channels are setup. This avoids issue where it could
    be that a packet was being sent while subchannel was not initialized.

    Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
    Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
    Signed-off-by: David S. Miller <da...@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>


The removal of which does indeed fix the problem.  That led me to:

commit 52acf73b6e9a6962045feb2ba5a8921da2201915
Author: Dexuan Cui <de...@microsoft.com>
Date:   Wed Jun 6 21:32:51 2018 +0000

    hv_netvsc: Fix a network regression after ifdown/ifup
    
    Recently people reported the NIC stops working after
    "ifdown eth0; ifup eth0". It turns out in this case the TX queues are not
    enabled, after the refactoring of the common detach logic: when the NIC
    has sub-channels, usually we enable all the TX queues after all
    sub-channels are set up: see rndis_set_subchannel() ->
    netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
    the number of channels doesn't change, we also must make sure the TX queues
    are enabled. The patch fixes the regression.
    
    Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
    Signed-off-by: Dexuan Cui <de...@microsoft.com>
    Cc: Stephen Hemminger <sthem...@microsoft.com>
    Cc: K. Y. Srinivasan <k...@microsoft.com>
    Cc: Haiyang Zhang <haiya...@microsoft.com>
    Signed-off-by: David S. Miller <da...@davemloft.net>

Which sounded very promising, but does not seem to fully fix the issue.  Doing 
some more digging, I was able to determine that the message coincides with 'ip 
link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the hv_netvsc 
driver loads.  If I delay the mtu change until well after the driver loads, 
everything works fine.  If I unload hv_netvsc and then reload it and apply the 
mtu change immediately, the failure re-occurs.  So something is racy here, and 
the above doesn't entirely address it.

I'm happy to test out any suggested patches and/or do additional debugging if 
anyone has any suggestions.

(oh, and I did also try 4.18-rc1 and the problem still persists)
_______________________________________________
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

Reply via email to