On Mon, Apr 09, 2018 at 03:26:41PM +0200, Nélio Laranjeiro wrote: > On Mon, Apr 09, 2018 at 12:28:04PM +0000, Shahaf Shuler wrote: > > Monday, April 9, 2018 11:28 AM, Nélio Laranjeiro: > > > Subject: Re: [PATCH] net/mlx5: fix link status initialization > > > > > > On Sun, Apr 08, 2018 at 01:09:27PM +0000, Shahaf Shuler wrote: > > > > Thursday, April 5, 2018 9:51 AM, Nélio Laranjeiro: > > > > > Subject: Re: [PATCH] net/mlx5: fix link status initialization > > > > > > > > > > On Thu, Apr 05, 2018 at 05:35:57AM +0000, Shahaf Shuler wrote: > > > > > > Wednesday, April 4, 2018 3:11 PM, Nélio Laranjeiro: > > > > > > > Subject: Re: [PATCH] net/mlx5: fix link status initialization > > > > > > > > > > > > > > On Wed, Apr 04, 2018 at 09:58:33AM +0000, Shahaf Shuler wrote: > > > > > > > > Wednesday, April 4, 2018 10:30 AM, Nélio Laranjeiro: > > > > > > > > > Subject: Re: [PATCH] net/mlx5: fix link status > > > > > > > > > initialization > > > > > > > > > > > > > > > > > > On Tue, Apr 03, 2018 at 07:48:17AM +0300, Shahaf Shuler wrote: > > > > > > > > [..] > > > > > > > > > > > > > > > > > > According to your analysis, this is only necessary when the LCS > > > > > > > is configured in the device. Why not adding this call to > > > > > > > mlx5_dev_interrupt_handler_install() which is responsible to > > > > > > > install the LCS callback. > > > > > > > > > > > > I think it is good practice whether or not LSC is set. > > > > > > The link status should be initialized to the correct value after > > > > > > the probe. > > > > > > > > > > There is no guarantee the link will be accurate, at probe time the > > > > > link may be up so internal information has a status up with a speed > > > > > with > > > this patch. > > > > > The application probes a second port, in the mean time the link of > > > > > the first port goes down, the interrupt is still not installed and > > > > > the internal status becomes wrong (still up whereas the port is down). > > > > > > > > > > Finally at start, the device installs the handler, but the link is > > > > > still down whereas internally it is up, the application will call > > > > > rte_eth_link_get_nowait() which will directly copy the wrong > > > > > internal status to the application. > > > > > > > > This is not correct. > > > > Using Verbs, the async_fd on which link status interrupts are reported > > > > is > > > created on probe. > > > > Even if the interrupt handler is not installed, interrupts still > > > > trigger on this fd. They will be processed when the interrupt handler > > > > will be installed as part of the port start. > > > > So in fact you have the whole trace on the link status changes waiting > > > > to be processed upon port start. > > > > > > Right, but in such case, this patch still does not solves the issue. > > > Until the dev_start() is called, the link may be inconsistent with the > > > real > > > status. > > > > > > example: > > > > > > pci_probe --> link is up. > > > leaving pci probe, the link goes downs --> internally the PMD has a link > > > up. > > > > > > Until the dev_start() is called any call to rte_eth_link_get_nowait() > > > will copy > > > the internal PMD status which is not accurate. > > > > > > From this point, the issue seems to fully come from > > > rte_eth_link_get_nowait() which should not make any copy until the port is > > > not started, until then the link may be inconsistent between the driver > > > and > > > the device. > > > > Right. However fixing it only on the ethdev layer is not enough. > > > > Assume the link was up from the beginning. > > For the case were the call for rte_eth_link_get_nowait happens only > > after the port start, the link will still be wrong as it will be > > reported as down (and no interrupt to fix it). > > > > So to summarize I plan to have this patch (with modification of the > > wait_to_complete) along with another fix for ethdev. > > Do we have an agreement? > > Sorry but no, as MLX5 is not the only PMD impacted by this issue and a > solution can be done for all of them (ixgbe, virtio, MLX5 I did not > check the others). > > Your proposition does not solves the following case neither, when the > link goes down and the application stops the port due to that, if the > application waits using the same API to restart the port, it will never > occur as the interrupts handlers are no more installed and thus the > internal link will never be updated remaining indefinitely down. This > also impact the three PMD above. > > Commit b77d21cc2364 ("ethdev: add link status get/set helper functions") > has been integrated a little too fast introducing this bug. > > All those issues can be fixed by adding another statement in the if of > the function rte_eth_link_get_nowait() > > - if (dev->data->dev_conf.intr_conf.lsc) > + if (dev->data->dev_conf.intr_conf.lsc && dev->data->dev_started) > rte_eth_linkstatus_get(dev, eth_link); > else { > RTE_FUNC_PTR_OR_RET(*dev->dev_ops->link_update); > (*dev->dev_ops->link_update)(dev, 1); > *eth_link = dev->data->dev_link; > } > > Doing this fixes all PMD having the same bug, and remove your specific > patch for MLX5 which becomes useless.
Making our private little discussion public. Indeed with my proposal MLX5 still needs to make a first link update to have the internal value set in case the link is consistent until the dev_start() (i.e. up before the probing), otherwise, the returned value from rte_eth_link_get_nowait() will be down when the device is started. Regards, -- Nélio Laranjeiro 6WIND