On Wed, Jan 11, 2023 at 3:52 PM Ilya Maximets <[email protected]> wrote:
> On 1/11/23 14:09, Ales Musil wrote: > > > > > > On Wed, Jan 11, 2023 at 1:34 PM Ilya Maximets <[email protected] > <mailto:[email protected]>> wrote: > > > > On 1/11/23 07:53, Ales Musil wrote: > > > > > > > > > On Tue, Jan 10, 2023 at 11:08 PM Ilya Maximets <[email protected] > <mailto:[email protected]> <mailto:[email protected] <mailto: > [email protected]>>> wrote: > > > > > > On 12/16/22 13:30, Ales Musil wrote: > > > > Currently, the CT can be flushed by dpctl only be specifying > > > > > > *by > > > > > > > the whole 5-tuple. This is not very convenient when there are > > > > only some fields known to the user of CT flush. Add new > struct > > > > ofputil_ct_match which represents the generic filtering that > can > > > > be done for CT flush. The match is done only on fields that > are > > > > non-zero with exception to the icmp fields. > > > > > > > > This allows the filtering just within dpctl, however > > > > it is a preparation for OpenFlow extension. > > > > > > > > Reported-at: https://bugzilla.redhat.com/2120546 < > https://bugzilla.redhat.com/2120546> <https://bugzilla.redhat.com/2120546 > <https://bugzilla.redhat.com/2120546>> > > > > Signed-off-by: Ales Musil <[email protected] <mailto: > [email protected]> <mailto:[email protected] <mailto:[email protected]>>> > > > > Acked-by: Paolo Valerio <[email protected] <mailto: > [email protected]> <mailto:[email protected] <mailto: > [email protected]>>> > > > > --- > > > > v5: Add ack from Paolo. > > > > v4: Fix a flush all scenario. > > > > v3: Rebase on top of master. > > > > Address the C99 comment and missing dpif_close call. > > > > v2: Rebase on top of master. > > > > Address comments from Paolo. > > > > --- > > > > > > Hi, Ales. Thanks for working on this change. > > > Sorry for review taking so long. I initially though there > > > will be mostly a few cosmetic/style changes needed. But > > > longer I look at this code, more questions I have. See below. > > > > > > Dumitru, maybe you can also help answer some of them? > > > > > > Best regards, Ilya Maximets. > > > > > > > > > Hi Ilya, > > > > > > thank you for the review. Before proceeding I tried to answer your > concerns/questions. > > > Let me know what you think. > > > > > > Thanks, > > > Ales > > > > > > > > > > > > > NEWS | 2 + > > > > include/openvswitch/ofp-util.h | 28 +++ > > > > lib/automake.mk <http://automake.mk> <http://automake.mk < > http://automake.mk>> | 2 + > > > > lib/ct-dpif.c | 208 +++++++++------------ > > > > lib/ct-dpif.h | 4 +- > > > > lib/dpctl.c | 42 ++--- > > > > lib/dpctl.man | 3 +- > > > > lib/ofp-ct-util.c | 326 > +++++++++++++++++++++++++++++++++ > > > > lib/ofp-ct-util.h | 36 ++++ > > > > tests/system-traffic.at <http://system-traffic.at> < > http://system-traffic.at <http://system-traffic.at>> | 82 > ++++++++- > > > > 10 files changed, 586 insertions(+), 147 deletions(-) > > > > create mode 100644 lib/ofp-ct-util.c > > > > create mode 100644 lib/ofp-ct-util.h > > > > > > > > diff --git a/NEWS b/NEWS > > > > index 265375e1c..ff8904b02 100644 > > > > --- a/NEWS > > > > +++ b/NEWS > > > > @@ -14,6 +14,8 @@ Post-v3.0.0 > > > > 10 Gbps link speed by default in case the actual link > speed cannot be > > > > determined. Previously it was 10 Mbps. Values can > still be overridden > > > > by specifying 'max-rate' or '[r]stp-path-cost' > accordingly. > > > > + - ovs-dpctl and related ovs-appctl commands: > > > > > > "ovs-dpctl and 'ovs-appctl dpctl/'" might be a better header. > > > > > > > + * "flush-conntrack" is capable of handling partial > 5-tuple. > > > > > > > > > > > > v3.0.0 - 15 Aug 2022 > > > > diff --git a/include/openvswitch/ofp-util.h > b/include/openvswitch/ofp-util.h > > > > index 091a09cad..84937ae26 100644 > > > > --- a/include/openvswitch/ofp-util.h > > > > +++ b/include/openvswitch/ofp-util.h > > > > @@ -19,6 +19,9 @@ > > > > > > > > #include <stdbool.h> > > > > #include <stdint.h> > > > > +#include <sys/types.h> > > > > +#include <netinet/in.h> > > > > + > > > > #include "openvswitch/ofp-protocol.h" > > > > > > > > struct ofp_header; > > > > @@ -27,6 +30,31 @@ struct ofp_header; > > > > extern "C" { > > > > #endif > > > > > > > > +struct ofputil_ct_tuple { > > > > + struct in6_addr src; > > > > + struct in6_addr dst; > > > > + > > > > + union { > > > > + ovs_be16 src_port; > > > > + ovs_be16 icmp_id; > > > > + }; > > > > + union { > > > > + ovs_be16 dst_port; > > > > + struct { > > > > + uint8_t icmp_code; > > > > + uint8_t icmp_type; > > > > + }; > > > > + }; > > > > +}; > > > > + > > > > +struct ofputil_ct_match { > > > > + uint8_t ip_proto; > > > > + uint16_t l3_type; > > > > + > > > > + struct ofputil_ct_tuple tuple_orig; > > > > + struct ofputil_ct_tuple tuple_reply; > > > > +}; > > > > > > These structures are very similar to existing ct_dpif_tuple. > > > We can keep them for now, but I think, we'll need to replace > > > ct_dpif_tuple with the new ofputil_ct_tuple in the future. > > > We may need some changes in ofputil_ct_tuple for that though. > > > > > > > > > I was thinking about replacing them, however it would require a > much > > > bigger change, which was out of scope of this RFE. Also I was not > entirely > > > sure if there wasn't any specific reason for the ct_dpif_tuple > specifically > > > the address part. It seems like optimization to me. > > > > > > Anyway it should be possible to unify those if needed. > > > > > > > > > > > > > > > OTOH, I'm not sure I understand why we need both tuples here. > > > Can we have a tuple and a flag that indicates a direction > > > instead? > > > > > > > > > That is also a possibility. However that wouldn't fit too that > nicely in the > > > whole match struct, we could probably duplicate the proto field > and get rid of the > > > l3_type. That way we could supply just two tuples instead, getting > rid of the match completely. > > > > The point was to have only one tuple. See below. > > > > > > > > > > > > > > There are several issues with having both original and reply > > > tuples, IIUC: > > > > > > 1. None of the command line interfaces support specifying > > > reply tuple. > > > > > > > > > That was the intention of the original command AFAICT, we could > modify the commands > > > to support that. However I'm not sure if there is any benefit in > that. > > > > ovs-ofctl supposed to mimic the OpenFlow request, so it should > > support everything that OpenFlow request can have. > > > > > > I see, sure that can definitely be done, however we should probably > think how this can > > be done in a more universal way that could be used also for dpctl. > > > > > > > > > > > > > > > > 2. If the reply tuple will be specified via OpenFlow, it > > > will be ignored if the original direction tuple is > > > fully specified. > > > > > > > > > That is ok IMO, because a fully specified original should be > enough to determine the one > > > entry that needs to be erased. So the reply direction in that > would be an extra. > > > > That supports the point that we don't need two tuples in the request. > > > > > > Right but only if we are able to supply the full original direction. See > below. > > > > > > > > > > > > > > > > > > 3. Netfilter kernel interface allows specifying only one > > > tuple, not both, AFAICT. > > > > > > > > > Right, that's the reason why, when we match, we use the original > tuple to erase that > > > entry. > > > > > > > > > > > > I'm not a conntrack expert, so I have a question: Is it > possible > > > that addresses and ports in original and reply directions are > > > not mirrored copies? Or do we care about these cases? > > > > > > > > > Not sure if it's possible, but even if it would be, the current > implementation > > > would account for that, with the exception of specifying fully the > original > > > direction. > > > > > > > > > > > > If we want to flush all entries related to the same IP:PORT, > > > we will not be able to do that in a single OpenFlow request, > > > because we'll need to supply the same IP and PORT as src > > > and dst, or as src in both original and reply directions. > > > None of these can fit into a single request. > > > > > > > > > Is that really a problem? You could do that in two, one for the > original > > > one for reply. This is also very similar to the interface > supported by conntrack-tools, > > > with those you couldn't remove all entries in a single call as > well. We could modify > > > the interface that would allow matching on tuple without direction > specification, > > > that would translate to both directions, however there is still > the question if that's needed. > > > > > > It's not a problem, but another supporting point for the interface > > being redundant. We don't seem to need two tuples in the request, > > because users will have to send two different requests anyway. > > > > So, the request should have: > > > > 1. zone > > 2. nw_proto > > 3. tuple > > 4. tuple type ('original' or 'reply'), in case we want to allow > > users to flush based on reply tuple. But this seems a bit > > redundant as well. > > > > Something like: > > > > struct ofputil_ct_tuple { > > } > > > > struct ofputil_ct_match { > > uint8_t ip_proto; > > uint16_t l3_type; > > // enum DIRECTION direction; > > struct ofputil_ct_tuple tuple; > > }; > > > > > > And the OpenFlow part: > > > > /* CT flush available TLVs. */ > > enum nx_ct_flush_tlv_type { > > NXT_CT_ZONE_ID, /* u8: CT zone id. */ > > NXT_CT_TUPLE, /* Nested NXT_CT_TUPLE_* attributes. */ > > }; > > > > enum nx_ct_flush_tuple_type { > > NXT_CT_TUPLE_SRC, /* be128: source IPv6 or mapped IPv4 > address. */ > > NXT_CT_TUPLE_DST, /* be128: destination IPv6 or mapped > IPv4 address. */ > > NXT_CT_TUPLE_SRC_PORT, /* be16: source port. */ > > NXT_CT_TUPLE_DST_PORT, /* be16: destination port. */ > > NXT_CT_TUPLE_ICMP_ID, /* be16: ICMP id. */ > > NXT_CT_TUPLE_ICMP_TYPE, /* u8: ICMP type. */ > > NXT_CT_TUPLE_ICMP_CODE, /* u8: ICMP code. */ > > // NXT_CT_TUPLE_DIRECTION, /* enum > nx_ct_flush_tuple_direction: CT tuple direction. */ > > }; > > > > //enum nx_ct_flush_tuple_direction { > > // NXT_CT_TUPLE_DIRECTION_ORIG = 1, > > // NXT_CT_TUPLE_DIRECTION_REPLY = 2, > > //} > > > > struct nx_ct_flush { > > uint8_t nw_proto; /* IP protocol. */ > > uint8_t pad[7]; /* Align to 64-bit (must be zero). */ > > /* Followed by ... */ > > }; > > > > > > Such interface will also allow us in to have both direction specified > > if we would want that in the future. For now we should check that > > each TLV type is specified once, i.e. only one tuple is in the > request. > > > > We may or may not add NXT_CT_TUPLE_DIRECTION support now. Can add it > > in later releases if necessary. Tuples without that TLV can be > treated > > as original direction. > > > > > > The original idea was something similar but as stated below that > wouldn't really help > > us in the OVN case. And there is probably no significant benefit in > having partial match > > with only single direction at time. I'm not sure if I'm not missing > something obvious. > > > > > > > > > > > > > > > > At this point is there actually any value in flushing basing > > > on reply direction? Shouldn't we be able to flush the same > > > entries by flushing by dst in original direction, if we will > > > need to issue multiple requests anyway? > > > > > > > > > That assumption would be fine when you can specify a full tuple > for the original direction. > > > The problem is that for example OVN we know only the original > DST_IP:DST_PORT (VIP) > > > and the reply SRC_IP:SRC_PORT (backend). > > > > Isn't original DST_IP:DST_PORT equal to reply SRC_IP:SRC_PORT ? > > > > > > Unfortunately no, it's DNATed so for example > > 'ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"' > > will have CT entries as follows: > > > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=1,dport=80),reply=(src=172.16.1.2,dst=192.168.1.2,sport=80,dport=1),zone=0,mark=2 > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=2,dport=80),reply=(src=172.16.1.3,dst=192.168.1.2,sport=80,dport=2),zone=0,mark=2 > > > tcp,orig=(src=192.168.1.2,dst=30.0.0.1,sport=3,dport=80),reply=(src=172.16.1.4,dst=192.168.1.2,sport=80,dport=3),zone=0,mark=2 > > > > We cannot simply filter on the original direction, as we cannot infer > the SRC part (without filtering all entries etc.). > > OK. This is what I was asking above and you said > 'Not sure if it's possible'. So, it is possible, > I see. > Yeah my bad I did not connect the dots right away. > > > > > > > > > > > > That means we cannot fully specify the original direction > > > without doing the filtering ourselves. It would be possible with > extra steps from the user. If this interface > > > is too generic/complicated for that purpose we could say that the > flush extension supports only full orig tuple > > > and leave the filtering up to the user. That would definitely > work, it wouldn't be that elegant for the user however > > > from the code change perspective much simpler. > > > > I'm not sure I understand that part. I'm not arguing aginst partial > > matches. Or do you mean something else? > > > > > > > > I can see a certain benefit of keeping it in OvS: > > > > > > 1) The filtering complexity is hidden from the user. > > > 2) The data transfer between OvS and the user is much smaller > (even if you need to send two matches). > > > 3) Once implemented it's just ready to be used without the need of > additional thoughts going into it. > > > > > > > > > > > > > > > > > > Did you start working on the OVN side of things? Which > > > commands OVN will actually use? > > > > > > > > > Yes, I did. The plan is that OVN will use the extension. > > > > I understand that. The question was how the actual request will > > look like. > > > > > > The testing version used something like this: > > > > memset(match, 0, sizeof *match); > > > > struct ofputil_ct_tuple *orig = &match->tuple_orig; > > struct ofputil_ct_tuple *reply = &match->tuple_reply; > > > > if (IN6_IS_ADDR_V4MAPPED(&tuple->vip_ip)) { > > match->l3_type = AF_INET; > > } else { > > match->l3_type = AF_INET6; > > } > > > > match->ip_proto = tuple->proto; > > > > orig->dst = tuple->vip_ip; > > orig->dst_port = htons(tuple->vip_port); > > > > reply->src = tuple->backend_ip; > > reply->src_port = htons(tuple->backend_port); > > > > tuple is the struct containing lb info, so VIP, VIP_PORT, BACKEND, > BACKEND_PORT, PROTO. > > match is the "struct ofputil_ct_match". > > > > This is with older version of this RFE but it should illustrate the > point. > > Assuming the current v5 implementation of the extension, will the end > result of the request be different if we remove the assignment of > 'orig=>dst' and 'orig->dst_port' from the code above? > > It might delete more than we actually want to. Without specifying the LBs 5-tuple we might always remove more entries than we should. As we can have multiple backends for the same VIP (that's the whole point of LB), however nothing prevents us from two VIPs pointing to the same backend. Which would cause an issue if we will go only with the reply part. -- Ales Musil Senior Software Engineer - OVN Core Red Hat EMEA <https://www.redhat.com> [email protected] IM: amusil <https://red.ht/sig> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
