Hi Ilya, On 6/14/2022 4:03 PM, Ilya Maximets wrote:
On 6/14/22 10:27, Oz Shlomo via dev wrote:On 6/8/2022 3:16 AM, Frode Nordahl wrote:On Tue, Jun 7, 2022 at 12:16 AM Ilya Maximets <[email protected]> wrote:On 5/31/22 23:48, Ilya Maximets wrote:On 5/31/22 21:15, Frode Nordahl wrote:On Mon, May 30, 2022 at 5:25 PM Frode Nordahl<snip>I've pushed the first part of the fix here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-May%2F394450.html&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pq0CABjao2UWojg6yZut7RL%2FZEeuRou0qUVZKNYP3rQ%3D&reserved=0Thanks! I saw that and I tend to think that it is correct. I'll try to test it and apply in the next couple of days. One question about the test above: which entity actually adds the ct_state to the packet or at which moment that happens? I see it, but I'm not sure I fully understand that. Looks like I'm missing smething obvious.I found what is going on there - kernel simply tracks everything if not told to do so, so ICMP packets create the ct entry and subsequent packets re-use it, so icmp replies have +trk set while entering OVS.Great, my hunch was that something along these lines was happening as well, I have to admit the test case was found by locating something closest to the real life use case and it proved to work as a good test for this condition.---- Let's have some summary of the issues discovered here so far, including a few new issues: 1. ct states set externally are not tracked correctly by OVS. Fix: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-May%2F394450.html&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pq0CABjao2UWojg6yZut7RL%2FZEeuRou0qUVZKNYP3rQ%3D&reserved=0 Status: LGTM, will apply soon. This fixes the problem originally reported by Liam, IIUC. Right?Yes.2. Kernel ct() actions are trying to re-use the cached connection after the tuple changes. This ends up to be the OVN hairpin issue as reported here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fubuntu%2F%2Bsource%2Fovn%2F%2Bbug%2F1967856&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=25B7VbtRFguupC7VoNjZK%2FWlasu%2BMSTUzJkszvEpDaQ%3D&reserved=0 Proposed Fix: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fnetdev%2F20220606221140.488984-1-i.maximets%40ovn.org%2FT%2F%23u&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=areRLYsAEbare7yo%2FxmIF9k2tMw2v8ZQkwHcR%2FEvV%2Bo%3D&reserved=0 Status: Needs review.I can confirm that the proposed fix resolves the OVN hairpin issue. It also looks simple enough to be backportable all the way to where we would need it (kernel 5.4.0). I'll have a look at giving this wider exposure in an internal CI environment as a canary for any unintended consequences if that would be helpful.Sorry for jumping in late on this, as this patch was already accepted to the kernel. However, I have a concern that this patch would break the tc datapath when ovs hw offload is enabled. IIUC then this patch adds an implicit ovs_ct_clear call for 5-tuple modify actions. However, this implicit action will not apply to flows that use the tc datapath. Forde, can you verify that indeed this fix breaks the OVN hairpin use case when hw offload is enabled.Hi, Oz. I don't think that this kernel fix breaks the TC datapath as the packets processed by the openvswitch kernel module will not go back to TC for further processing, IIUC. Also, it's not a full ct_clear, because we're not clearing the flow key.
A flow datapath is either in tc or in ovs. If hardware offload is enabled then ovs will create a tc flower entry.Therefore, packets for that flow will be processed by tc and not openvswitch. Note that hardware offload may be enabled even if there is no supporting hardware. TC software datapath is designed to be functionally equivalent to ovs.
tc is processed before openvswitch in the kernel pipeline. Therefore, if a packet is matched by tc then it will not continue to openvswitch. Therefore my concern is that openvswitch change will not apply if ovs hardware offload is enabled.
But I agree that the original bug exists in TC as well, since TC just copied the ct() recircuation optimization from the openvswitch. So, if there are subsequent ct actions with pedit in between, TC will have the same problem with misclassification as OVS had before the kernel fix 2061ecfdf235.
Right.
So, the similar fix should be implemented for TC as well. However,
I'm not sure how to actually do that, because ct and pedit are
not really connected in the kernel. The issue might be fixed as a
side effect from fixes for the issue #5 in the list here, I guess,
but it's not really a correct fix. The reason why it should be
fixed in the kernel is because user doesn't really know that TC
or openvswitch module cached that connection, the user didn't ask
it to be cached and re-used, they only wanted to populate the
current flow key with the ct_{state,mark,label} or commit some
changes. TC/openvswitch kernel module decided to cache the nfct,
so it should handle possible mismatch if the packet got changed.
Does that make sense?
Indeed changing the tc pedit action is not a possibility. We did copy the caching optimization from ovs when implementing tc act_ct. I wonder if we could remove the optimization.According to the comment in the code the caching mechanism was designed to optimize the ct(commit) execution, as ovs connections have to be explicitly commited. Perhaps we can also consider the other approach that you suggested, verifying that the cached 5-tuples was not changed.
However, I do remember that OVN pod to external pipeline actually relies on this optimization when executing ct(nat) -> recirc -> ct for identical zones. Without the optimization the second ct would miss because the natted entry was never commited to the ct table.
3. ct_clear is not a reversible action, because it required to pass the packet through conntrack again in order to restore the conntrack state. Fix: To move the CT_CLEAR to a group of irreversible actions in the reversible_actions() in ofproto/ofproto-dpif-xlate.c. Status: An easy fix that does nothing useful without fixes for later issues, because we clear the ct_state before the patch_port_output() processing and optimizing out the CT_CLEAR action. 4. (new!) reversible_actions() optimization is logically incorrect. The reason is that reversible_actions() doesn't look further than a list of actions of the first OF rule in the second bridge. If the list of action contains resubmit, there can still be irreversible actions and packet will be irreversibly modified by the patch port processing without cloning. Simple reproducer: diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at index dbb3b6dda..be30ad5bf 100644 --- a/tests/ofproto-dpif.at +++ b/tests/ofproto-dpif.at @@ -8736,7 +8736,8 @@ OVS_VSWITCHD_START( AT_CHECK([ovs-ofctl -O OpenFlow13 add-meter br1 'meter=1 pktps stats bands=type=drop rate=2']) AT_CHECK([ovs-ofctl del-flows br0]) AT_CHECK([ovs-ofctl -O OpenFlow13 add-flow br0 in_port=local,ip,actions=2,1]) -AT_CHECK([ovs-ofctl -O OpenFlow13 add-flow br1 in_port=1,ip,actions=meter:1,3]) +AT_CHECK([ovs-ofctl -O OpenFlow13 add-flow br1 "in_port=1,ip,actions=resubmit(,7)"]) +AT_CHECK([ovs-ofctl -O OpenFlow13 add-flow br1 table=7,in_port=1,ip,actions=meter:1,3]) AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(100),eth(src=f8:bc:12:44:34:b6,dst=f8:bc:12:46:58:e0),eth_type(0x0800),ipv4(src=10.1.1.22,dst=10.0.0.3,proto=6,tos=0,ttl=64,frag=no),tcp(src=53295,dst=8080)'], [0], [stdout]) AT_CHECK([tail -1 stdout], [0], --- Status: unclear. One idea is to reverse the logic and check datapath actions for being reversible when going up in recursion instead of trying to predict all the future actions while going down. Ideas are welcome. Hammerhead approach: Mark 'resubmit' as irreversible action. But that will effectively mean that we're always cloning on patch port output, which is not great, see the issue #8. 5. Packets after the ct() in a non-forked pipeline must be untracked. Status: unclear. Fix for the issue #2 will cover issues for already tracked packets, but if the packet is supposed to be untracked, but it is tracked, then we must emit the ct_clear action from the userspace. The most viable approach, I guess, is the previously proposed 'pending_ct_clear' fix. Alternative is to always emit ct_clear right after the ct() action and not loose our minds trying to track the pending action in all the weird cases. 6. Conntrack state must not be preserved while going through the patch port. State: unclear. The right solution would be to emit the ct_clear datapath action after cloning for the patch port. Few problems here: - current code in ofproto/ofproto-dpif-xlate.c will not actually allow us to do that, we need to fix the issue #4 first to be able to inject new actions post factum. - ct_clear is non-reversible (issue #3) meaning if we're adding ct-clear, we have to clone even if the patch port processing doesn't have any ct() actions or recirculations. Excessive ct_clear's and clones can be avoided by fixing the issue #4 and examining the list of datapath actions created by the patch port processing afterwards to decide if we need to clone or insert ct_clear. 7. (new!) Userspace conntrack implementation fails to perform correct lookup if the cached connection was cleared during the tuple change. The issue can be seen with the OVN hairpin issue reproducer: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fubuntu%2F%2Bsource%2Fovn%2F%2Bbug%2F1967856%2Fcomments%2F6&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mHExnl%2FcXLslkJFucMQ%2BvVKjS%2FSHDECQNx5vnAKQa5s%3D&reserved=0 by running 'make check-system-userspace'. I'm not 100% sure what is going on there, but the test fails because packets are marked as invalid due to conn_not_found() -> valid_new() failure for the icmp reply. 8. Not really a bug, but all above implies adding more clone()'s and they are not playing well with HW offloading, i.e. simply breaking it. So, it makes sense to think how we can minimize the impact.Thank you for breaking down and detailing all of the issues we have discovered as part of this discussion (and some more!). HW offloading is important to us as well and OVS/OVN plays an important part in developing the technology into something that will soon be part of everyone's daily lives. It's hard work to retain the full flexibility/programmability of OVS in this world and I'm convinced it will be worth it. I'll have a look and contribute with any thoughts. I also wonder if any of the hardware guys here have opinions and want to chime in on how we best solve these challenges?_______________________________________________ dev mailing list [email protected] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=05%7C01%7Cozsh%40nvidia.com%7C7ab5490f9c334e9d877f08da4e0652fc%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637908086346878848%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yRHJVmrxgLuNDm3nD1cJNYHby%2F0aPvkP2act7S8GpfQ%3D&reserved=0
_______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
