Hi Alex,

  I apologize for my mistake and for being unclear. I meant "ndo_stop".

  You are right that they should be called with the RTNL lock held. And we were 
doing that. What is probably wrong is that we were not
calling
        clear_bit(__LINK_STATE_START, &dev->state);
This only happens if we call dev_close.
If this bit is not cleared, netif_running will return true in the reset task. 
Then napi_disable will be called again. So it's not about the RTNL lock
but about the __LINK_STATE_START bit.

  I agree that calling ndo_stop directly is probably wrong. And so far this is 
the only way we can make the driver hang.

 I still think that reset and open could race. As I said I can't reproduce 
this. But the second call to netif_running and napi_enable in the reset task
is not protected by __I40EVF_IN_CRITICAL_TASK. And the call in i40evf_open to 
i40evf_up_complete is not protected either.

  Best wishes,
    Codrut 

-----Original Message-----
From: Alexander Duyck [mailto:alexander.du...@gmail.com] 
Sent: Thursday, August 03, 2017 4:08 AM
To: Codrut Grosu <cogr...@ixiacom.com>
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] Possible bug in the i40evf driver

On Tue, Aug 1, 2017 at 8:29 AM, Codrut Grosu <cogr...@ixiacom.com> wrote:
>    Hi,
>
>   We believe there might be a bug in the i40evf driver.
>
>   We think that there is race issue between i40evf_reset_task and i40evf_down 
> / i40evf_open.
> The reason is that the functions napi_enable / napi_disable must be 
> called in pairs in order not to loop indefinitely (or crash).
>
> Consider the following:
>
> ifconfig eth1 mtu 1000 & ifconfig eth1 up
>
> What happens now is that the change of mtu schedules a reset. Suppose 
> the reset task starts, and the first call to netif_running (after 
> continue_reset) returns false. Before the thread reaches the second 
> call to netif_running, i40evf_open starts in another thread. Then the second 
> netif_running in reset will return true, and we will have 2 consecutive calls 
> of napi_enable.
>
> We could not reproduce this particular situation in practice (probably due to 
> the short timing).
> However, we did hang the driver using a call to ndo_close() followed 
> quickly by "ethtool -G eth1 rx 4096". In this case netif_running will 
> return true always (as we bypassed the call to dev_close), the reset 
> will be scheduled before the interface finishes going down, and 2 calls to 
> napi_disable will happen.

So this last statement isn't exactly clear. There isn't an ndo_close() in the 
kernel last I knew. There is an ndo_stop() and an i40evf_close(), but there 
isn't an ndo_close(). Can you clarify which one of these you were calling?

I ask because ndo_stop() and i40evf_close() should only be called with the RTNL 
lock held. It sounds like you may not be doing that and that could cause a 
collsion with something like an "ethtool -G" command because the ethtool ioctl 
is taking the RTNL lock to avoid such collisions and if your call that is 
getting into i40evf_close() isn't holding the RTNL then that might explain the 
issue.

- Alex
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to