Hi,

On Wed, Jun 6, 2018 at 6:03 PM, Tobias Hommel <netdev-l...@genoetigt.de> wrote:
> Sorry no progress until now, I currently do not get time to have a deeper look
> into that. We're back to 4.1.6 right now.

Thanks for letting me know. In the project I am currently involved in,
we unfortunately don't have the option of reverting the kernel, so we
are finding ways to live with the error. We have been looking into the
error a bit more, and have made the following observations:

* First of all, as discussed earlier in the thread, the error is
triggered by dst_orig being NULL. Our current work-around is just to
return from xfrm_lookup if dst_orig is NULL and this seems to work
fine, the error doesn't happen that often (in our use-cases at least).
* The machine we use for testing (and where we first saw the error) is
used as initiator.
* When we compare the logs from Strongswan with the ones from the
kernel, it seems that the error is typically triggered when a tunnels
is teared down/about to come up. We need quite a lot of tunnels for
the error to trigger, usually around 30+. I guess this might point to
some race or some condition not being met when packets are
sent/received.
* We see the error much more frequently when hardware encryption is enabled.
* Yesterday, we upgraded the kernel from 4.14.34 to 4.14.48, and the
error happens much less frequently. I see that 4.14.48 includes
several IPsec fixes (for example the previously mentioned ("xfrm: Fix
a race in the xdst pcpu cache.")).

BR,
Kristian

Reply via email to