Thanks for the detailed review Justin.

I found a couple other issues, like ‘<=’ instead of ‘<’ for min frag
If I missed responding to any of your comments, pls let me know.
I’ll send out responses to the other patches; just in the middle of some other 
tasks at the moment.

On 6/5/18, 11:38 PM, "ovs-dev-boun...@openvswitch.org on behalf of Justin 
Pettit" <ovs-dev-boun...@openvswitch.org on behalf of jpet...@ovn.org> wrote:

    
    
    > On Apr 8, 2018, at 7:53 PM, Darrell Ball <dlu...@gmail.com> wrote:
    > 
    > diff --git a/lib/conntrack.c b/lib/conntrack.c
    > index 2b20e93..987c034 100644
    > --- a/lib/conntrack.c
    > +++ b/lib/conntrack.c
    > ...
    > @@ -1308,6 +1311,8 @@ conntrack_execute(struct conntrack *ct, struct 
dp_packet_batch *pkt_batch,
    >                   const struct nat_action_info_t *nat_action_info,
    >                   long long now)
    > {
    
    I think it would be worth mentioning in the description of 
conntrack_execute() that it also performs fragmentation reassembly.

sure
    
    The comment states that "All the packets should have the same 'dl_type', 
however, I believe this code requires that.  Can you change that "should" to a 
"must"?

Sure
It is day 1 and externally imposed, so I’ll splice out as a separate patch 
series.

    
    > diff --git a/lib/ipf.c b/lib/ipf.c
    > new file mode 100644
    > index 0000000..3837c60
    > --- /dev/null
    > +++ b/lib/ipf.c
    > ...
    > 
    > +enum {
    > +    IPF_INVALID_IDX = -1,
    > +    IPF_V4_FRAG_SIZE_LBOUND = 400,
    > +    IPF_V4_FRAG_SIZE_MIN_DEF = 1200,
    > +    IPF_V6_FRAG_SIZE_LBOUND = 1280,
    > +    IPF_V6_FRAG_SIZE_MIN_DEF = 1280,
    > +    IPF_MAX_FRAGS_DEFAULT = 1000,
    > +    IPF_NFRAG_UBOUND = 5000,
    > +};
    
    Is there a reason you're using enums instead of #defines?  I don't see 
anything wrong--it's just not the typically style I've seen in the OVS code 
base.

I prefer enums for type safety and visibility

    > +struct ipf_list {
    > +    struct hmap_node node;
    > +    struct ovs_list exp_node;
    > +    struct ovs_list complete_node;
    
    Can a packet be on both the expiry and completed list at the same time?  If 
a single list node could be used, it would be a bit smaller and then some of 
the functions
     could be simplified, such as: ipf_list_remove() could be simplified, and 
ipf_expiry_list_remove() and ipf_completed_list_remove() could be removed.

Both list nodes were needed for a special case that I previously removed, but 
both are not needed anymore.
However, I ended up adding more APIs and also kept ipf_expiry_list_remove() to 
hide implementation details.
    
    > +    struct ipf_frag *frag_list;
    > +    struct ipf_list_key key;
    > +    struct dp_packet *reass_execute_ctx;
    > +    long long expiration;
    > +    int last_sent_idx;
    > +    int last_inuse_idx;
    > +    int size;
    > +    uint8_t state;
    > +};
    
    Can you define at least a couple of the less obvious members?  For example, 
what 'size' means, point 'state' to the appropriate enum, and stating that 
'expiration' is
    in milliseconds.

Sure, pointing to the enum is good to do.
    
    > +static struct hmap frag_lists OVS_GUARDED;
    > +static struct ovs_list frag_exp_list OVS_GUARDED;
    > +static struct ovs_list frag_complete_list OVS_GUARDED;
    > +static struct ovs_list reassembled_pkt_list OVS_GUARDED;
    
    Is it worth using OVS_GUARDED_BY(), since it's always guarded by the same 
lock?

In this case, as you mentioned, it is not that important to use OVS_GUARDED_BY,
but I think OVS_GUARDED_BY is still better, hence I’ll switch over to it. 
Thanks

    
    > +static void
    > +ipf_count(bool v4, enum ipf_counter_type cntr)
    > +{
    > +    if (v4) {
    > +        switch (cntr) {
    > +        case IPF_COUNTER_NFRAGS_ACCEPTED:
    > +            atomic_count_inc(&n4frag_accepted);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_COMPL_SENT:
    > +            atomic_count_inc(&n4frag_completed_sent);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_EXPD_SENT:
    > +            atomic_count_inc(&n4frag_expired_sent);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_TOO_SMALL:
    > +            atomic_count_inc(&n4frag_too_small);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_OVERLAP:
    > +            atomic_count_inc(&n4frag_overlap);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS:
    > +        default:
    > +            OVS_NOT_REACHED();
    > +        }
    > +    } else {
    > +        switch (cntr) {
    > +        case IPF_COUNTER_NFRAGS_ACCEPTED:
    > +            atomic_count_inc(&n6frag_accepted);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_COMPL_SENT:
    > +            atomic_count_inc(&n6frag_completed_sent);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_EXPD_SENT:
    > +            atomic_count_inc(&n6frag_expired_sent);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_TOO_SMALL:
    > +            atomic_count_inc(&n6frag_too_small);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS_OVERLAP:
    > +            atomic_count_inc(&n6frag_overlap);
    > +            break;
    > +        case IPF_COUNTER_NFRAGS:
    > +        default:
    > +            OVS_NOT_REACHED();
    > +        }
    > +    }
    > +}
    
    There's a fair amount of redundancy in this function.  What about something 
a little more compact like the following?

The ternary operator is preferred here due to more compact code length and less 
“break”s
I just wanted to make sure it was correct b4 compacting.

    -=-=-=-=-=-=-
    static void
    ipf_count(bool v4, enum ipf_counter_type cntr)
    {
        switch (cntr) {
        case IPF_COUNTER_NFRAGS_ACCEPTED:
            v4 ? atomic_count_inc(&n4frag_accepted)
               : atomic_count_inc(&n6frag_accepted);
            break;
        case IPF_COUNTER_NFRAGS_COMPL_SENT:
            v4 ? atomic_count_inc(&n4frag_completed_sent)
               : atomic_count_inc(&n6frag_completed_sent);
            break;
        case IPF_COUNTER_NFRAGS_EXPD_SENT:
            v4 ? atomic_count_inc(&n4frag_expired_sent)
               : atomic_count_inc(&n6frag_expired_sent);
            break;
        case IPF_COUNTER_NFRAGS_TOO_SMALL:
            v4 ? atomic_count_inc(&n4frag_too_small)
               : atomic_count_inc(&n6frag_too_small);
            break;
        case IPF_COUNTER_NFRAGS_OVERLAP:
            v4 ? atomic_count_inc(&n4frag_overlap)
               : atomic_count_inc(& n6frag_overlap);
            break;
        case IPF_COUNTER_NFRAGS:
        default:
            OVS_NOT_REACHED();
        }   
    }   
    -=-=-=-=-=-=-
    
    > +static bool
    > +ipf_get_enabled(void)
    > +{
    > +    bool ifp_v4_enabled_;
    > +    bool ifp_v6_enabled_;
    > +    atomic_read_relaxed(&ifp_v4_enabled, &ifp_v4_enabled_);
    > +    atomic_read_relaxed(&ifp_v6_enabled, &ifp_v6_enabled_);
    > +    return ifp_v4_enabled_ || ifp_v6_enabled_;
    > +}
    > +
    > +static bool
    > +ipf_get_v4_enabled(void)
    > +{
    > +    bool ifp_v4_enabled_;
    > +    atomic_read_relaxed(&ifp_v4_enabled, &ifp_v4_enabled_);
    > +    return ifp_v4_enabled_;
    > +}
    > +
    > +static bool
    > +ipf_get_v6_enabled(void)
    > +{
    > +    bool ifp_v6_enabled_;
    > +    atomic_read_relaxed(&ifp_v6_enabled, &ifp_v6_enabled_);
    > +    return ifp_v6_enabled_;
    > +}
    
    This is minor, but what about simplifying ipf_get_enabled() to something 
like the following:

Yep; it was not consolidated yet.
    
    -=-=-=-=-=-=-
    static bool
    ipf_get_enabled(void)
    {
        return ipf_get_v4_enabled() || ipf_get_v6_enabled();
    }
    -=-=-=-=-=-=-
    
    > +static void
    > +ipf_expiry_list_add(struct ipf_list *ipf_list, long long now)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > +    enum {
    > +        IPF_FRAG_LIST_TIMEOUT_DEFAULT = 15000,
    > +    };
    
    Is there a reason you defined this as an enum in the function?  

I want the narrowest scope; where it is used.

  Usually these sorts of definitions are defined together so that they can be 
easily modified.
  Also, I usually think of a default as something that can be modified, but 
this doesn't seem configurable.

I don’t want this modifiable, at least for now.
“_DEFAULT” can be removed I guess for clarity; I’ll do that.

    
    > +/* Symmetric */
    > +static uint32_t
    > +ipf_list_key_hash(const struct ipf_list_key *key, uint32_t basis)
    > +{
    > +    uint32_t hsrc, hdst, hash;
    > +    hsrc = hdst = basis;
    > +    hsrc = ipf_addr_hash_add(hsrc, &key->src_addr);
    > +    hdst = ipf_addr_hash_add(hdst, &key->dst_addr);
    > +    hash = hsrc ^ hdst;
    
    Is there a reason for the assignment into 'hsrc' and 'hdst'?  Can't 'basis' 
just be put directly in the two ipf_addr_hash_add() arguments directly?  You 
could also
    remove an extra line by moving the declaration onto the same line as their 
assignment:

This is a bit more code in C form, but is more clear since we are doing a 
*_hash_add()
to the running hash, which just happens to start at “basis” here.
I usually write:

uint32_t hsrc, hdst;
hsrc = hdst = basis;
hsrc = ipf_addr_hash_add(hsrc, &key->src_addr);
hdst = ipf_addr_hash_add(hdst, &key->dst_addr);
uint32_t hash = hsrc ^ hdst;

    
    -=-=-=-=-=-=-
        uint32_t hsrc = ipf_addr_hash_add(basis, &key->src_addr);
        uint32_t hdst = ipf_addr_hash_add(basis, &key->dst_addr);
        uint32_t hash = hsrc ^ hdst;
    -=-=-=-=-=-=-
    
    > +static bool
    > +ipf_list_complete(const struct ipf_list *ipf_list)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > +    for (int i = 0; i < ipf_list->last_inuse_idx; i++) {
    > +        if (ipf_list->frag_list[i].end_data_byte + 1
    > +            != ipf_list->frag_list[i + 1].start_data_byte) {
    > +            return false;
    > +        }
    > +    }
    > +    return true;
    > +}
    
    I think it would be good if this code either asserted that 
'ipf_list->state' is 'IPF_LIST_STATE_FIRST_LAST_SEEN' or make it clear in the 
comment.

There is only one caller of this function here:

    if (next_state == IPF_LIST_STATE_FIRST_LAST_SEEN) {
        ipf_sort(ipf_list->frag_list, ipf_list->last_inuse_idx);
        if (ipf_list_complete(ipf_list)) {



    > +/* Called on a sorted complete list of fragments. */
    > +static struct dp_packet *
    > +ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > +    struct ipf_frag *frag_list = ipf_list->frag_list;
    > +    struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
    > +    struct ip_header *l3 = dp_packet_l3(pkt);
    > +    int len = ntohs(l3->ip_tot_len);
    > +    size_t add_len;
    > +    size_t ip_hdr_len = IP_IHL(l3->ip_ihl_ver) * 4;
    > +
    > +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
    > +        add_len = frag_list[i].end_data_byte -
    > +                         frag_list[i].start_data_byte + 1;
    > +        len += add_len;
    > +        if (len > IPV4_PACKET_MAX_SIZE) {
    > +            dp_packet_delete(pkt);
    > +            return NULL;
    
    Should this be incrementing some counters and/or kicking out all the 
fragments with the invalid bit set?

It will fall into a common code path where the fragments will expire and sent 
with invalid flag set and this
will be accounted for generally.
I intended to add a warn log here, just too many distractions; since it is per 
packet, it will be rate limited as well.


    > +/* Called on a sorted complete list of fragments. */
    > +static struct dp_packet *
    > +ipf_reassemble_v6_frags(struct ipf_list *ipf_list)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    >     struct ipf_frag *frag_list = ipf_list->frag_list;
    >     struct dp_packet *pkt = dp_packet_clone(frag_list[0].pkt);
    >     struct  ovs_16aligned_ip6_hdr *l3 = dp_packet_l3(pkt);
    >     int pl = ntohs(l3->ip6_plen) - sizeof(struct ovs_16aligned_ip6_frag);
    
    Does the fragment reassembly properly handle the case where there are 
multiple IPv6 header extension headers in addition to the fragment one?  
    Can you add a test case for that?

The existing code should be agnostic.
But, I added a test case anyways; the new test also does out of order.
    
    > ...
    > +    size_t l3_size = tail - (char *)l3 -pad;
    > +    size_t l4_size = tail - (char *)l4 -pad;
    
    Please put a space before the 'pad's.

sure
    
    > +    size_t l3_hlen = l3_size - l4_size;
    > +    size_t add_len;
    > +
    > +    for (int i = 1; i <= ipf_list->last_inuse_idx; i++) {
    > +        add_len = frag_list[i].end_data_byte -
    > +                          frag_list[i].start_data_byte + 1;
    > +        pl += add_len;
    > +        if (pl > IPV6_PACKET_MAX_DATA) {
    > +            dp_packet_delete(pkt);
    > +            return NULL;
    
    Same question about incrementing counters and/or kicking out all the 
fragments with the invalid bit set?

It will fall into a common code path where the fragments will expire and sent 
with invalid flag set and this
will be accounted for generally.
I intended to add a warn log here, just too many distractions; since it is per 
packet, it will be rate limited as well.
    
    > +        }
    > +        l3 = dp_packet_l3(frag_list[i].pkt);
    > +        dp_packet_put(pkt, (char *)l3 + l3_hlen, add_len);
    > +    }
    > +    l3 = dp_packet_l3(pkt);
    > +    l4 = dp_packet_l4(pkt);
    > +    tail = dp_packet_tail(pkt);
    > +    pad = dp_packet_l2_pad_size(pkt);
    > +    l3_size = tail - (char *)l3 -pad;
    > +
    > +    uint8_t nw_proto = l3->ip6_nxt;
    > +    uint8_t nw_frag = 0;
    > +    const void *data = l3 + 1;
    > +    size_t datasize = l3_size - sizeof *l3;
    > +
    > +    const struct ovs_16aligned_ip6_frag *frag_hdr = NULL;
    > +    if (!parse_ipv6_ext_hdrs(&data, &datasize, &nw_proto, &nw_frag, 
&frag_hdr)
    > +        || !nw_frag || !frag_hdr) {
    > +        return NULL;
    
    Doesn't this need to free 'pkt'?

This is a bug – thank you.
It should do exactly what the error case a few lines above does:
            dp_packet_delete(pkt);
            return NULL;
I intended to add a warn log here as it is likely impossible to hit this case.

    
    > ...
    > +    struct ovs_16aligned_ip6_frag *fh =
    > +        CONST_CAST(struct ovs_16aligned_ip6_frag *, frag_hdr);
    > +    fh->ip6f_offlg = 0;;
    
    There's an extra semicolon.

ok
    
    > +    l3->ip6_plen = htons(pl);
    > +    l3->ip6_ctlun.ip6_un1.ip6_un1_nxt = nw_proto;
    > +    return pkt;
    
    Are IPv6 stacks supposed to handle fragment extension headers even if it's 
in a single packet?

I don’t think it is defined, per se.
However, it is likely a subterfuge for an externally received packet.
    
    > +/* Called when a valid fragment is added. */
    > +static void
    > +ipf_list_state_transition(struct ipf_list *ipf_list, bool ff, bool lf,
    > +                          bool v4)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > ...
    > +    case IPF_LIST_STATE_FIRST_SEEN:
    > +        if (ff) {
    > +            next_state = IPF_LIST_STATE_FIRST_SEEN;
    
    This transition isn't possible, right?  Should we just assert?

If we had previously received a first packet and now receive a packet that is 
“not first” and “not last”,
we will be here.
    
    > ...
    > +    case IPF_LIST_STATE_LAST_SEEN:
    > +        if (ff) {
    > +            next_state = IPF_LIST_STATE_FIRST_LAST_SEEN;
    > +        } else if (lf) {
    > +            next_state = IPF_LIST_STATE_LAST_SEEN;
    
    This transition also isn't possible, right?  Should we just assert?

If we had previously received a last packet and now receive a packet that is 
“not last” and “not first”,
we will be here.

    
    > ...
    > +    case IPF_LIST_STATE_COMPLETED:
    > +        next_state = curr_state;
    > +        break;
    
    Can we get into this state without a dupe?


Since dupe is already filtered before in updated code flow, this is impossible 
to hit.
It should just fall thru. to other such cases (for 
documentation/maintainability reasons):
    case IPF_LIST_STATE_REASS_FAIL:
    case IPF_LIST_STATE_NUM:
    default:
        OVS_NOT_REACHED();

    
    > +static bool
    > +ipf_v4_key_extract(const struct dp_packet *pkt, ovs_be16 dl_type,
    > +                   uint16_t zone, struct ipf_list_key *key,
    > +                   uint16_t *start_data_byte, uint16_t *end_data_byte,
    > +                   bool *ff, bool *lf)
    > +{
    > ...
    > +    const struct eth_header *l2 = dp_packet_eth(pkt);
    > +    const struct ip_header *l3 = dp_packet_l3(pkt);
    > +
    > +    if (!l2 || !l3) {
    > +        return false;
    > +    }
    > +
    > +    const char *tail = dp_packet_tail(pkt);
    > +    uint8_t pad = dp_packet_l2_pad_size(pkt);
    > +    size_t size = tail - (char *)l3 -pad;
    
    Can you add a space so it doesn't look like a negative 'pad'?

sure
    
    > ...
    > +    if (!dp_packet_ip_checksum_valid(pkt) && csum(l3, ip_hdr_len) != 0) {
    > +        return false;
    > +    }
    
    Haven't we already essentially already done the first half of this check in 
the earlier call to dp_packet_ip_checksum_bad()?


This is intentionally super conservative using external APIs just in case there 
are some “gaps” there now or in the future..

    
    > +static bool
    > +ipf_v6_key_extract(const struct dp_packet *pkt, ovs_be16 dl_type,
    > +                uint16_t zone, struct ipf_list_key *key,
    > +                uint16_t *start_data_byte, uint16_t *end_data_byte,
    > +                bool *ff, bool *lf)
    > +{
    > ...
    > +    const char *tail = dp_packet_tail(pkt);
    > +    uint8_t pad = dp_packet_l2_pad_size(pkt);
    > +    size_t l3_size = tail - (char *)l3 -pad;
    > +    size_t l4_size = tail - (char *)l4 -pad;
    
    Can you also add a space before these 'pad's?

sure

    
    > ...
    > +    /* We are not supporting parsing of the routing header header
    > +     * to use as the dst address part of the key. */
    
    "header" is used twice in this comment.

Got it

    
    > +    key->dst_addr.ipv6 = l3->ip6_dst;
    > +    /* Not used for key for V6. */
    > +    key->nw_proto = 0;
    
    Can you put the comment on the same line as 'key->nw_proto' so that it's 
clear that it's only this case that isn't used for IPv6?

sure
    
    > +static bool
    > +ipf_handle_frag(struct dp_packet *pkt, ovs_be16 dl_type, uint16_t zone,
    > +                long long now, uint32_t hash_basis)
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > ...
    > 
    > +    struct ipf_list *ipf_list =
    > +        ipf_list_key_lookup(&key, hash);
    
    This will fit on the same line.

Got it
    
    > +    enum {
    > +        IPF_FRAG_LIST_MIN_INCREMENT = 4,
    > +        IPF_UNBOUNDED_FRAG_LIST_SIZE = 65535,
    > +    };
    > +
    > +    int max_frag_list_size;
    > +    if (v4) {
    > +        max_frag_list_size = max_v4_frag_list_size;
    > +    } else {
    > +        max_frag_list_size = IPF_UNBOUNDED_FRAG_LIST_SIZE;
    > +    }
    
    Why is v6 unbounded?

Because of the variability of extension headers, I don’t set a calculated hard 
limit
upfront but the code does that thru the min frag and max packet size checks.
I added a comment to that effect and changed “UNBOUNDED” to “MAX” to avoid
the “PRACTICALLY_UNBOUNDED” concept.
    
    > +/* Handles V4 fragments right now. */
    > +static void
    > +ipf_extract_frags_from_batch(struct dp_packet_batch *pb, ovs_be16 
dl_type,
    > +                             uint16_t zone, long long now, uint32_t 
hash_basis)
    > +{
    
    I believe this code also handles IPv6.

Oops, that comment has been outdated for a while.
    
    > +    const size_t pb_cnt = dp_packet_batch_size(pb);
    > +    int pb_idx; /* Index in a packet batch. */
    > +    struct dp_packet *pkt;
    > +
    > +    DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
    > +        ipf_lock_lock(&ipf_lock);
    > +
    > +        if (!ipf_handle_frag(pkt, dl_type, zone, now, hash_basis)) {
    > +            dp_packet_batch_refill(pb, pkt, pb_idx);
    
    If fragment handling is disabled or the packet being processed isn't a 
fragment, it will cause this batch refill to always be triggered.  I don't know 
the
    implications, but is that correct?

For disabled case, we have protection in place here:

    if (ipf_get_enabled()) {
        ipf_extract_frags_from_batch(pb, dl_type, zone, now, hash_basis);
    }

The work involved for checking for fragments per packet is more than the 
amortized batch manipulation cost
(assigning pointers and incrementing indices).

I had done a bit of optimizing for existing invalid feeding into conntrack. 

    
    > +/* This would be used in a rare case where a list cannot be sent. The 
only
    > + * reason known right now is a mempool source check,which exists due to 
DPDK
    > + * support, where packets are no longer being received on any port with a
    > + * source matching the fragment.
    > + * Returns true if the list was purged. */
    > +static bool
    > +ipf_purge_list_check(struct ipf_list *ipf_list, long long now)
    
    There should be a space in "check,which".

Got it
    
    > +    OVS_REQUIRES(ipf_lock)
    > +{
    > +    enum {
    > +        /* 10 minutes. */
    > +        IPF_FRAG_LIST_TIMEOUT_PURGE = 600000,
    > +    };
    > +
    > +    if (now < ipf_list->expiration + IPF_FRAG_LIST_TIMEOUT_PURGE) {
    > +        return false;
    > +    }
    
    Ten minutes seems kind of long.  Is that standard?

This is a pathological dpdk corner case; standard does not apply here.

    
    > +static void
    > +ipf_send_completed_frags(struct dp_packet_batch *pb, long long now, bool 
v4)
    > +{
    > +    if (ovs_list_is_empty(&frag_complete_list)) {
    > +        return;
    > +    }
    > ...
    > +static void
    > +ipf_send_expired_frags(struct dp_packet_batch *pb, long long now, bool 
v4)
    > +{
    > ...
    > +    if (ovs_list_is_empty(&frag_exp_list)) {
    > +        return;
    > +    }
    > ...
    > +static void
    > +ipf_post_execute_reass_pkts(struct dp_packet_batch *pb, bool v4)
    > +{
    > +    if (ovs_list_is_empty(&reassembled_pkt_list)) {
    > +        return;
    > +    }
    
    These functions all call ovs_list_is_empty() and then take the lock if it's 
not.  I don't think that call is atomic, but I suppose the consequences of 
getting the wrong
    answer aren't that serious, since we'll be called later.  As I mention 
later, we could side-step all of this by just taking the lock in the calling 
function, and then call
    each of these functions without the lock/unlock portion in each.

It is intentional to not take the lock before, for performance reasons for the 
common case of no fragment processing needed.
    
    > +
    > +    ipf_lock_lock(&ipf_lock);
    > +    struct reassembled_pkt *rp, *next;
    > +
    > +    LIST_FOR_EACH_SAFE (rp, next, rp_list_node, &reassembled_pkt_list) {
    > +        const size_t pb_cnt = dp_packet_batch_size(pb);
    > +        int pb_idx;
    > +        struct dp_packet *pkt;
    > +        /* Inner batch loop is constant time since batch size is <=
    > +         * NETDEV_MAX_BURST. */
    > +        DP_PACKET_BATCH_REFILL_FOR_EACH (pb_idx, pb_cnt, pkt, pb) {
    > +            if (pkt == rp->list->reass_execute_ctx) {
    > +                for (int i = 0; i <= rp->list->last_inuse_idx; i++) {
    > +                    rp->list->frag_list[i].pkt->md.ct_label = 
pkt->md.ct_label;
    > +                    rp->list->frag_list[i].pkt->md.ct_mark = 
pkt->md.ct_mark;
    > +                    rp->list->frag_list[i].pkt->md.ct_state = 
pkt->md.ct_state;
    > +                    rp->list->frag_list[i].pkt->md.ct_zone = 
pkt->md.ct_zone;
    > +                    rp->list->frag_list[i].pkt->md.ct_orig_tuple_ipv6 =
    > +                        pkt->md.ct_orig_tuple_ipv6;
    > +                    if (pkt->md.ct_orig_tuple_ipv6) {
    > +                        
rp->list->frag_list[i].pkt->md.ct_orig_tuple.ipv6 =
    > +                            pkt->md.ct_orig_tuple.ipv6;
    
    It looks like 'rp->list->frag_list[i].pkt->md.ct_orig_tuple_ipv6' is 
getting set twice.


Those are different fields (ct_orig_tuple_ipv6 vs ct_orig_tuple.ipv6), so its 
ok.

    > ...
    > +                if (v4) {
    > ...
    > +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
    > +                                                 frag_ip, reass_ip);
    
    This should be indented in a bit more.

ok
    
    > +                    l3_frag->ip_src = l3_reass->ip_src;
    > +
    > +                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
    > +                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
    > +                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
    > +                                                     frag_ip, reass_ip);
    > +                    l3_frag->ip_dst = l3_reass->ip_dst;
    > +                } else {
    > +                    struct  ovs_16aligned_ip6_hdr *l3_frag =
    > +                        dp_packet_l3(rp->list->frag_list[0].pkt);
    > +                    struct  ovs_16aligned_ip6_hdr *l3_reass =
    > +                        dp_packet_l3(pkt);
    > +                    l3_frag->ip6_src = l3_reass->ip6_src;
    > +                    l3_frag->ip6_dst = l3_reass->ip6_dst;
    > +                }
    
    Why do the IP addresses need to be updated here?

NAT

    
    > void
    > ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
    >                          ovs_be16 dl_type, uint16_t zone, uint32_t 
hash_basis)
    > {                        
    >     if (ipf_get_enabled()) {
    >         ipf_extract_frags_from_batch(pb, dl_type, zone, now, hash_basis);
    >     }   
    >     
    >     if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
    >         ipf_execute_reass_pkts(pb);
    >     }   
    > }  
    > 
    > +void
    > +ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
    > +                          ovs_be16 dl_type)
    > +{
    > +    if (ipf_get_enabled() || atomic_count_get(&nfrag)) {
    > +        ipf_post_execute_reass_pkts(pb, dl_type == htons(ETH_TYPE_IP));
    > +        ipf_send_completed_frags(pb, dl_type == htons(ETH_TYPE_IP), now);
    > +        ipf_send_expired_frags(pb, now, dl_type == htons(ETH_TYPE_IP));
    
    I think it would be a bit more clear if you defined a 'v4' bool and just 
use that as the argument to these functions.

sure
    
    This is very minor, but the last two functions have the same arguments, but 
only in a different order.  It would be more consistent if they were the same.


Yeah, I remember wanting to come back to this; I like the parameters in the 
same order as well.

    
    Also, all these callers essentially need 'ipf_lock'.  What about just 
taking it here and then calling all these functions with annotation that they 
require the lock?


It is intentional to not take the lock here, for performance reasons for the 
common case of no fragment processing needed.

    
    > +    }
    > +}
    
    I think it would be worth mentioning what ipf_preprocess_conntrack() and 
ipf_postprocess_conntrack() each do and that they expect the same ethertype
    for all the packets in the batch.

sure

    
    > +void
    > +ipf_destroy(void)
    > +{
    > +    ipf_lock_lock(&ipf_lock);
    > +
    > +    struct ipf_list *ipf_list;
    > +    HMAP_FOR_EACH_POP (ipf_list, node, &frag_lists) {
    > +        struct dp_packet *pkt;
    > +        while (ipf_list->last_sent_idx < ipf_list->last_inuse_idx) {
    > +            pkt = ipf_list->frag_list[ipf_list->last_sent_idx + 1].pkt;
    > +            dp_packet_delete(pkt);
    > +            atomic_count_dec(&nfrag);
    
    Since there's a count, do you think it's worth asserting that 'nfrag' is 
zero at the end of this function?

I prefer not to fail the system, because a packet gets counted twice somehow.
But, we can certainly “warn” log this with the final nfrag count reported.
    
    > diff --git a/lib/ipf.h b/lib/ipf.h
    > new file mode 100644
    > index 0000000..5861e96
    > --- /dev/null
    > +++ b/lib/ipf.h
    > ...
    > +/* Collects and reassembles fragments which are to be sent through
    > + * conntrack, if fragment processing is enabled or fragments are
    > + * in flight. */
    > +void
    > +ipf_preprocess_conntrack(struct dp_packet_batch *pb, long long now,
    > +                         ovs_be16 dl_type, uint16_t zone, uint32_t 
hash_basis);
    > +
    > +/* Updates the state of fragments associated with reassembled packets and
    > + * sends out fragments that are either associated with completed
    > + * packets or expired, if fragment processing is enabled or fragments are
    > + * in flight. */
    > +void
    > +ipf_postprocess_conntrack(struct dp_packet_batch *pb, long long now,
    > +                          ovs_be16 dl_type);
    > +
    > +void
    > +ipf_init(void);
    > +
    > +void
    > +ipf_destroy(void);
    
    This is minor, but for prototypes, the style guide puts the return type and 
function name on the same line.

sure
    
    > diff --git a/tests/system-traffic.at b/tests/system-traffic.at
    > index 33a89c8..115bca9 100644
    > --- a/tests/system-traffic.at
    > +++ b/tests/system-traffic.at
    > @@ -1758,7 +1758,6 @@ AT_CLEANUP
    > 
    > AT_SETUP([conntrack - IPv4 fragmentation])
    > CHECK_CONNTRACK()
    > -CHECK_CONNTRACK_FRAG()
    > OVS_TRAFFIC_VSWITCHD_START()
    
    This patch pulls out all the callers of CHECK_CONNTRACK_FRAG(), but leaves 
the definition.  Patch 10 removes the definition.  Should we just remove it 
here?

I had some potential use maybe, but if I remember, we can just remove it and 
add back if/when needed.
 
    You later on introduce more tests, including checks for overlapping 
fragments.  What about introducing those before this patch?  I think they'll 
still get
    exercised by the Linux kernel version, and then this code will also be 
exercised a bit more.

I have been thinking of a separate series, which makes it easier to manage.
I have some other tests that I wanted to submit as well related to a regression.

    Related, did you run the fragment tests under valgrind?

Yep, I checked a few things.
That reminds me of one of the patches in my queue for enabling Valgrind for 
userspace datapath.

    
    Thanks,
    
    --Justin
    
    
    _______________________________________________
    dev mailing list
    d...@openvswitch.org
    
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Cdball%40vmware.com%7Cdbb8d9e955d54c797acd08d5cb7806fd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636638638803968465&sdata=31YcLRJ3t6vlYTN5KrLNoaSVE7hGGKVWgzZxhLAwmc8%3D&reserved=0
    


















_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to