On 1/11/23 16:09, Ales Musil wrote: > > > On Wed, Jan 11, 2023 at 3:52 PM Ilya Maximets <[email protected] > <mailto:[email protected]>> wrote: > > On 1/11/23 14:09, Ales Musil wrote: > > > > > > On Wed, Jan 11, 2023 at 1:34 PM Ilya Maximets <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> wrote: > > > > On 1/11/23 07:53, Ales Musil wrote: > > > > > > > > > On Tue, Jan 10, 2023 at 11:08 PM Ilya Maximets > <[email protected] <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>> <mailto:[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> wrote: > > > > > > On 12/16/22 13:30, Ales Musil wrote: > > > > Currently, the CT can be flushed by dpctl only be specifying > > > > > > *by > > > > > > > the whole 5-tuple. This is not very convenient when there > are > > > > only some fields known to the user of CT flush. Add new > struct > > > > ofputil_ct_match which represents the generic filtering > that can > > > > be done for CT flush. The match is done only on fields that > are > > > > non-zero with exception to the icmp fields. > > > > > > > > This allows the filtering just within dpctl, however > > > > it is a preparation for OpenFlow extension. > > > > > > > > Reported-at: https://bugzilla.redhat.com/2120546 > <https://bugzilla.redhat.com/2120546> <https://bugzilla.redhat.com/2120546 > <https://bugzilla.redhat.com/2120546>> <https://bugzilla.redhat.com/2120546 > <https://bugzilla.redhat.com/2120546> <https://bugzilla.redhat.com/2120546 > <https://bugzilla.redhat.com/2120546>>> > > > > Signed-off-by: Ales Musil <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>> <mailto:[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> > > > > Acked-by: Paolo Valerio <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>> <mailto:[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> > > > > --- > > > > v5: Add ack from Paolo. > > > > v4: Fix a flush all scenario. > > > > v3: Rebase on top of master. > > > > Address the C99 comment and missing dpif_close call. > > > > v2: Rebase on top of master. > > > > Address comments from Paolo. > > > > --- > > > > > > Hi, Ales. Thanks for working on this change. > > > Sorry for review taking so long. I initially though there > > > will be mostly a few cosmetic/style changes needed. But > > > longer I look at this code, more questions I have. See below. > > > > > > Dumitru, maybe you can also help answer some of them? > > > > > > Best regards, Ilya Maximets. > > > > > > > > > Hi Ilya, > > > > > > thank you for the review. Before proceeding I tried to answer > your concerns/questions. > > > Let me know what you think. > > > > > > Thanks, > > > Ales > > > > > > > > > > > > > NEWS | 2 + > > > > include/openvswitch/ofp-util.h | 28 +++ > > > > lib/automake.mk <http://automake.mk> <http://automake.mk > <http://automake.mk>> <http://automake.mk <http://automake.mk> > <http://automake.mk <http://automake.mk>>> | 2 + > > > > lib/ct-dpif.c | 208 +++++++++------------ > > > > lib/ct-dpif.h | 4 +- > > > > lib/dpctl.c | 42 ++--- > > > > lib/dpctl.man | 3 +- > > > > lib/ofp-ct-util.c | 326 > +++++++++++++++++++++++++++++++++ > > > > lib/ofp-ct-util.h | 36 ++++ > > > > tests/system-traffic.at <http://system-traffic.at> > <http://system-traffic.at <http://system-traffic.at>> > <http://system-traffic.at <http://system-traffic.at> > <http://system-traffic.at <http://system-traffic.at>>> | 82 ++++++++- > > > > 10 files changed, 586 insertions(+), 147 deletions(-) > > > > create mode 100644 lib/ofp-ct-util.c > > > > create mode 100644 lib/ofp-ct-util.h > > > > > > > > diff --git a/NEWS b/NEWS > > > > index 265375e1c..ff8904b02 100644 > > > > --- a/NEWS > > > > +++ b/NEWS > > > > @@ -14,6 +14,8 @@ Post-v3.0.0 > > > > 10 Gbps link speed by default in case the actual link > speed cannot be > > > > determined. Previously it was 10 Mbps. Values can > still be overridden > > > > by specifying 'max-rate' or '[r]stp-path-cost' > accordingly. > > > > + - ovs-dpctl and related ovs-appctl commands: > > > > > > "ovs-dpctl and 'ovs-appctl dpctl/'" might be a better header. > > > > > > > + * "flush-conntrack" is capable of handling partial > 5-tuple. > > > > > > > > > > > > v3.0.0 - 15 Aug 2022 > > > > diff --git a/include/openvswitch/ofp-util.h > b/include/openvswitch/ofp-util.h > > > > index 091a09cad..84937ae26 100644 > > > > --- a/include/openvswitch/ofp-util.h > > > > +++ b/include/openvswitch/ofp-util.h > > > > @@ -19,6 +19,9 @@ > > > > > > > > #include <stdbool.h> > > > > #include <stdint.h> > > > > +#include <sys/types.h> > > > > +#include <netinet/in.h> > > > > + > > > > #include "openvswitch/ofp-protocol.h" > > > > > > > > struct ofp_header; > > > > @@ -27,6 +30,31 @@ struct ofp_header; > > > > extern "C" { > > > > #endif > > > > > > > > +struct ofputil_ct_tuple { > > > > + struct in6_addr src; > > > > + struct in6_addr dst; > > > > + > > > > + union { > > > > + ovs_be16 src_port; > > > > + ovs_be16 icmp_id; > > > > + }; > > > > + union { > > > > + ovs_be16 dst_port; > > > > + struct { > > > > + uint8_t icmp_code; > > > > + uint8_t icmp_type; > > > > + }; > > > > + }; > > > > +}; > > > > + > > > > +struct ofputil_ct_match { > > > > + uint8_t ip_proto; > > > > + uint16_t l3_type; > > > > + > > > > + struct ofputil_ct_tuple tuple_orig; > > > > + struct ofputil_ct_tuple tuple_reply; > > > > +}; > > > > > > These structures are very similar to existing ct_dpif_tuple. > > > We can keep them for now, but I think, we'll need to replace > > > ct_dpif_tuple with the new ofputil_ct_tuple in the future. > > > We may need some changes in ofputil_ct_tuple for that though. > > > > > > > > > I was thinking about replacing them, however it would require a > much > > > bigger change, which was out of scope of this RFE. Also I was not > entirely > > > sure if there wasn't any specific reason for the ct_dpif_tuple > specifically > > > the address part. It seems like optimization to me. > > > > > > Anyway it should be possible to unify those if needed. > > > > > > > > > > > > > > > OTOH, I'm not sure I understand why we need both tuples here. > > > Can we have a tuple and a flag that indicates a direction > > > instead? > > > > > > > > > That is also a possibility. However that wouldn't fit too that > nicely in the > > > whole match struct, we could probably duplicate the proto field > and get rid of the > > > l3_type. That way we could supply just two tuples instead, > getting rid of the match completely. > > > > The point was to have only one tuple. See below. > > > > > > > > > > > > > > There are several issues with having both original and reply > > > tuples, IIUC: > > > > > > 1. None of the command line interfaces support specifying > > > reply tuple. > > > > > > > > > That was the intention of the original command AFAICT, we could > modify the commands > > > to support that. However I'm not sure if there is any benefit in > that. > > > > ovs-ofctl supposed to mimic the OpenFlow request, so it should > > support everything that OpenFlow request can have. > > > > > > I see, sure that can definitely be done, however we should probably > think how this can > > be done in a more universal way that could be used also for dpctl. > > > > > > > > > > > > > > > > 2. If the reply tuple will be specified via OpenFlow, it > > > will be ignored if the original direction tuple is > > > fully specified. > > > > > > > > > That is ok IMO, because a fully specified original should be > enough to determine the one > > > entry that needs to be erased. So the reply direction in that > would be an extra. > > > > That supports the point that we don't need two tuples in the > request. > > > > > > Right but only if we are able to supply the full original direction. > See below. > > > > > > > > > > > > > > > > > > 3. Netfilter kernel interface allows specifying only one > > > tuple, not both, AFAICT. > > > > > > > > > Right, that's the reason why, when we match, we use the original > tuple to erase that > > > entry. > > > > > > > > > > > > I'm not a conntrack expert, so I have a question: Is it > possible > > > that addresses and ports in original and reply directions are > > > not mirrored copies? Or do we care about these cases? > > > > > > > > > Not sure if it's possible, but even if it would be, the current > implementation > > > would account for that, with the exception of specifying fully > the original > > > direction. > > > > > > > > > > > > If we want to flush all entries related to the same IP:PORT, > > > we will not be able to do that in a single OpenFlow request, > > > because we'll need to supply the same IP and PORT as src > > > and dst, or as src in both original and reply directions. > > > None of these can fit into a single request. > > > > > > > > > Is that really a problem? You could do that in two, one for the > original > > > one for reply. This is also very similar to the interface > supported by conntrack-tools, > > > with those you couldn't remove all entries in a single call as > well. We could modify > > > the interface that would allow matching on tuple without > direction specification, > > > that would translate to both directions, however there is still > the question if that's needed. > > > > > > It's not a problem, but another supporting point for the interface > > being redundant. We don't seem to need two tuples in the request, > > because users will have to send two different requests anyway. > > > > So, the request should have: > > > > 1. zone > > 2. nw_proto > > 3. tuple > > 4. tuple type ('original' or 'reply'), in case we want to allow > > users to flush based on reply tuple. But this seems a bit > > redundant as well. > > > > Something like: > > > > struct ofputil_ct_tuple { > > } > > > > struct ofputil_ct_match { > > uint8_t ip_proto; > > uint16_t l3_type; > > // enum DIRECTION direction; > > struct ofputil_ct_tuple tuple; > > }; > > > > > > And the OpenFlow part: > > > > /* CT flush available TLVs. */ > > enum nx_ct_flush_tlv_type { > > NXT_CT_ZONE_ID, /* u8: CT zone id. */ > > NXT_CT_TUPLE, /* Nested NXT_CT_TUPLE_* attributes. > */ > > }; > > > > enum nx_ct_flush_tuple_type { > > NXT_CT_TUPLE_SRC, /* be128: source IPv6 or mapped IPv4 > address. */ > > NXT_CT_TUPLE_DST, /* be128: destination IPv6 or mapped > IPv4 address. */ > > NXT_CT_TUPLE_SRC_PORT, /* be16: source port. */ > > NXT_CT_TUPLE_DST_PORT, /* be16: destination port. */ > > NXT_CT_TUPLE_ICMP_ID, /* be16: ICMP id. */ > > NXT_CT_TUPLE_ICMP_TYPE, /* u8: ICMP type. */ > > NXT_CT_TUPLE_ICMP_CODE, /* u8: ICMP code. */ > > // NXT_CT_TUPLE_DIRECTION, /* enum > nx_ct_flush_tuple_direction: CT tuple direction. */ > > }; > > > > //enum nx_ct_flush_tuple_direction { > > // NXT_CT_TUPLE_DIRECTION_ORIG = 1, > > // NXT_CT_TUPLE_DIRECTION_REPLY = 2, > > //} > > > > struct nx_ct_flush { > > uint8_t nw_proto; /* IP protocol. */ > > uint8_t pad[7]; /* Align to 64-bit (must be zero). */ > > /* Followed by ... */ > > }; > > > > > > Such interface will also allow us in to have both direction > specified > > if we would want that in the future. For now we should check that > > each TLV type is specified once, i.e. only one tuple is in the > request. > > > > We may or may not add NXT_CT_TUPLE_DIRECTION support now. Can add > it > > in later releases if necessary. Tuples without that TLV can be > treated > > as original direction. > > > > > > The original idea was something similar but as stated below that > wouldn't really help > > us in the OVN case. And there is probably no significant benefit in > having partial match > > with only single direction at time. I'm not sure if I'm not missing > something obvious. > > > > > > > > > > > > > > > > At this point is there actually any value in flushing basing > > > on reply direction? Shouldn't we be able to flush the same > > > entries by flushing by dst in original direction, if we will > > > need to issue multiple requests anyway? > > > > > > > > > That assumption would be fine when you can specify a full tuple > for the original direction. > > > The problem is that for example OVN we know only the original > DST_IP:DST_PORT (VIP) > > > and the reply SRC_IP:SRC_PORT (backend). > > > > Isn't original DST_IP:DST_PORT equal to reply SRC_IP:SRC_PORT ? > > > > > > Unfortunately no, it's DNATed so for example > > 'ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"' > > will have CT entries as follows: > > > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=1,dport=80),reply=(src=172.16.1.2,dst=192.168.1.2,sport=80,dport=1),zone=0,mark=2 > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=2,dport=80),reply=(src=172.16.1.3,dst=192.168.1.2,sport=80,dport=2),zone=0,mark=2 > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=3,dport=80),reply=(src=172.16.1.4,dst=192.168.1.2,sport=80,dport=3),zone=0,mark=2 > > > > We cannot simply filter on the original direction, as we cannot infer > the SRC part (without filtering all entries etc.). > > OK. This is what I was asking above and you said > 'Not sure if it's possible'. So, it is possible, > I see. > > > Yeah my bad I did not connect the dots right away. > > > > > > > > > > > > > > That means we cannot fully specify the original direction > > > without doing the filtering ourselves. It would be possible with > extra steps from the user. If this interface > > > is too generic/complicated for that purpose we could say that the > flush extension supports only full orig tuple > > > and leave the filtering up to the user. That would definitely > work, it wouldn't be that elegant for the user however > > > from the code change perspective much simpler. > > > > I'm not sure I understand that part. I'm not arguing aginst partial > > matches. Or do you mean something else? > > > > > > > > I can see a certain benefit of keeping it in OvS: > > > > > > 1) The filtering complexity is hidden from the user. > > > 2) The data transfer between OvS and the user is much smaller > (even if you need to send two matches). > > > 3) Once implemented it's just ready to be used without the need > of additional thoughts going into it. > > > > > > > > > > > > > > > > > > Did you start working on the OVN side of things? Which > > > commands OVN will actually use? > > > > > > > > > Yes, I did. The plan is that OVN will use the extension. > > > > I understand that. The question was how the actual request will > > look like. > > > > > > The testing version used something like this: > > > > memset(match, 0, sizeof *match); > > > > struct ofputil_ct_tuple *orig = &match->tuple_orig; > > struct ofputil_ct_tuple *reply = &match->tuple_reply; > > > > if (IN6_IS_ADDR_V4MAPPED(&tuple->vip_ip)) { > > match->l3_type = AF_INET; > > } else { > > match->l3_type = AF_INET6; > > } > > > > match->ip_proto = tuple->proto; > > > > orig->dst = tuple->vip_ip; > > orig->dst_port = htons(tuple->vip_port); > > > > reply->src = tuple->backend_ip; > > reply->src_port = htons(tuple->backend_port); > > > > tuple is the struct containing lb info, so VIP, VIP_PORT, BACKEND, > BACKEND_PORT, PROTO. > > match is the "struct ofputil_ct_match". > > > > This is with older version of this RFE but it should illustrate the > point. > > Assuming the current v5 implementation of the extension, will the end > result of the request be different if we remove the assignment of > 'orig=>dst' and 'orig->dst_port' from the code above? > > > It might delete more than we actually want to. Without specifying the LBs > 5-tuple > we might always remove more entries than we should. As we can have multiple > backends > for the same VIP (that's the whole point of LB), however nothing prevents us > from two VIPs > pointing to the same backend. Which would cause an issue if we will go only > with the reply part.
OK. That makes sense. Let's keep both tuples then. But we need to allow ovs-ofctl to accept both tuples. And we should use a "shortcut" only if reply tuple is not specified, because if I'm asking to delete: orig=(src=192.168.1.2,dst=30.0.0.1,sport=1,dport=80),reply=(src=172.16.1.3,dst=192.168.1.2,sport=80,dport=1) I'm not expecting the following entry to be deleted: orig=(src=192.168.1.2,dst=30.0.0.1,sport=1,dport=80),reply=(src=172.16.1.2,dst=192.168.1.2,sport=80,dport=1) Even though these entries probably cannot exist at the same time, we should not delete the entry that we wasn't asked to delete. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
