On 12.07.2019 13:49, Ilya Maximets wrote:
> On 11.07.2019 17:40, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Ilya Maximets [mailto:[email protected]]
>>> Sent: Thursday, July 11, 2019 3:14 PM
>>> To: Van Haaren, Harry <[email protected]>; [email protected]
>>> Cc: [email protected]; Stokes, Ian <[email protected]>; Michal Orsák
>>> <[email protected]>
>>> Subject: Re: [PATCH v10 0/5] dpcls func ptrs & optimizations
>>>
>>> On 09.07.2019 15:34, Harry van Haaren wrote:
>>>> Hey All,
>>>>
>>>>
>>>> Here a v10 of the DPCLS Function Pointer patchset, as has been
>>>> presented at OVS Conf in Nov '18, and discussed on the ML since then.
>>>> I'm aware of the soft-freeze for 2.12, I feel this patchset has had
>>>> enough reviews/versions/testing to be merged in 2.12.
>>>>
>>>> Thanks Ilya and Ian for review comments on v9, they should all be addressed
>>>> in this v10.
>>>>
>>>> Thanks Malvika Gupta for testing (Tested-by tag added to patches) and also
>>>> for reporting ARM performance gains, see here for details:
>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/360088.html
>>>>
>>>>
>>>> Regards, -Harry
>>>
>>> Hi, Harry.
>>> Thanks for working on this.
>>
>> My pleasure - it’s a nice part of OVS. And there's lots more to do :)
>>
>>
>>> I performed some tests with this version in my usual PVP with bonded PHY
>>> setup and here are some observations:
>>>
>>> * Bug that redirected packets to wrong rules is gone. At least I can't
>>> catch it in my testing anymore. Assuming it's fixed now.
>>>
>>> * dpcls performance boost for 512B packets is around 12% in compare with
>>> current master.
>>
>> Ah great! Glad to hear its giving you performance.
>>
>>
>>> Few remarks about the test scenario:
>>> All packets mostly goes through the NORMAL action with vlan push/pop.
>>> Packets that goes from VM to balanced-tcp bonded PHY goes through
>>> recirculation. Datapath flows for them looks like this:
>>>
>>> Before recirculation:
>>> recirc_id=0,eth,ip,vlan_tci=0x0000/0x1fff,dl_src=aa:16:3e:24:30:dd,dl_dst=aa:b
>>> b:cc:dd:ee:11,nw_frag=no
>>>
>>> After recirculation:
>>> recirc_id=0x1,dp_hash=0xf5/0xff,eth,ip,dl_vlan=42,dl_vlan_pcp=0,nw_frag=no
>>>
>>> I have 256 flows in datapath for different 'dp_hash'es.
>>>
>>> So, even if the number of ipv4 flows is as high as 256K, I have about ~270
>>> datapath
>>> flows in dpcls. (This gives a huge advantage to dpcls over EMC and SMC).
>>
>> Right - I'm a big fan of the consistent performance characteristic of DPCLS,
>> which is due to its wildcarding capabilities and lack of caching concepts.
>>
>>
>>> All the flows fits into 5+1 case, i.e. optimized function
>>> dpcls_subtable_lookup_mf_u0w5_u1w1 used.
>>>
>>> Most interesting observation:
>>>
>>> * New version of dpcls lookup outperforms SMC in this setup even on
>>> relatively small number of flows. With 8K flows dpcls faster than SMC
>>> by 1.5% and by 5.7% with 256K flows.
>>> Of course, SMC is 10% faster than dpcls with 8 flows, but it's not very
>>> interesting because no-one can beat EMC in this area.
>>>
>>> I'd like to read the code more carefully tomorrow and probably give some
>>> more feedback.
>>>
>>> Best regards, Ilya Maximets.
>>
>> Thanks for your comments - please do prioritize feedback ASAP, because as
>> you know the 2.12 soft-freeze is already in effect.
>>
>> I'll work on Ian's comments on v10, but hold off sending v11 until there
>> is some feedback from you too :)
>
> Few comments:
>
> 1. I don't really like that new dpcls depends on the internal structure of
> 'struct flowmap'. At least, we need to add a build assert to notify
> anyone who will try to change it.
>
> 2. No need to expose so many internal stuff to every module that includes
> 'dpif-netdev.h'. Can we create 'dpif-netdev-private.h' instead and move
> all the subtable related stuff there?
>
> 3. IMHO, it's better to apply some preprocessor optimizations to make it
> easier to add new optimized functions and decrease code duplication.
>
> 4. Please, add a space between 'ULLONG_FOR_EACH_1' and '('.
>
> 5. A lot of comments needs to be re-formatted according to coding style,
> i.e. first letter capitalized + period at the end.
>
> 6. "generic optimized" sounds weird.
One more thing. Could you, please, rename patches like:
dpif-netdev: implement function pointers/subtable
to
dpif-netdev: Implement function pointers/subtable.
This is not a strict requirement but a common formatting for OVS (capital
letter + period).
>
> Here is some incremental I'm suggesting which covers comments 1, 3 and 4:
>
> diff --git a/lib/dpif-netdev-lookup-generic.c
> b/lib/dpif-netdev-lookup-generic.c
> index 12406f4c1..2d210bf93 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -31,6 +31,9 @@
>
> VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
>
> +/* Lookup functions below depends on the internal structure of a flowmap. */
> +BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
> +
> /* netdev_flow_key_flatten_unit:
> * Given a packet, table and mf_masks, this function iterates over each bit
> * set in the subtable, and calculates the appropriate metadata to store in
> the
> @@ -129,8 +132,8 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
> {
> const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
> const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> -
> uint64_t not_match = 0;
> +
> for (int i = 0; i < mf_bits_total; i++) {
> not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
> }
> @@ -163,7 +166,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
> int i;
>
> /* Flatten the packet metadata into the blocks_scratch[] using subtable
> */
> - ULLONG_FOR_EACH_1(i, keys_map) {
> + ULLONG_FOR_EACH_1 (i, keys_map) {
> netdev_flow_key_flatten(keys[i],
> &subtable->mask,
> mf_masks,
> @@ -173,9 +176,10 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
> }
>
> /* Hash the now linearized blocks of packet metadata */
> - ULLONG_FOR_EACH_1(i, keys_map) {
> + ULLONG_FOR_EACH_1 (i, keys_map) {
> uint32_t hash = 0;
> uint32_t i_off = i * bit_count_total;
> +
> for (int h = 0; h < bit_count_total; h++) {
> hash = hash_add64(hash, blocks_scratch[i_off + h]);
> }
> @@ -188,12 +192,13 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
> */
> uint32_t found_map;
> const struct cmap_node *nodes[NETDEV_MAX_BURST];
> +
> found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
>
> /* Verify that packet actually matched rule. If not found, a hash
> * collision has taken place, so continue searching with the next node.
> */
> - ULLONG_FOR_EACH_1(i, found_map) {
> + ULLONG_FOR_EACH_1 (i, found_map) {
> struct dpcls_rule *rule;
>
> CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> @@ -236,41 +241,25 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable
> *subtable,
> subtable->mf_bits_set_unit1);
> }
>
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
> - uint64_t *blocks_scratch,
> - uint32_t keys_map,
> - const struct netdev_flow_key *keys[],
> - struct dpcls_rule **rules)
> -{
> - /* hard coded bit counts - enables compile time loop unrolling, and
> - * generating of optimized code-sequences due to loop unrolled code.
> - */
> - return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> - 5, 1);
> -}
> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)
> \
> + static uint32_t
> \
> + dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(
> \
> + struct dpcls_subtable *subtable,
> \
> + uint64_t *blocks_scratch,
> \
> + uint32_t keys_map,
> \
> + const struct netdev_flow_key
> *keys[],\
> + struct dpcls_rule **rules)
> \
> + {
> \
> + /* Hard coded bit counts - enables compile time loop unrolling, and
> \
> + * generating of optimized code-sequences due to loop unrolled code.
> \
> + */
> \
> + return lookup_generic_impl(subtable, blocks_scratch, keys_map,
> \
> + keys, rules, U0, U1);
> \
> + }
> \
>
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
> - uint64_t *blocks_scratch,
> - uint32_t keys_map,
> - const struct netdev_flow_key *keys[],
> - struct dpcls_rule **rules)
> -{
> - return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> - 4, 1);
> -}
> -
> -static uint32_t
> -dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
> - uint64_t *blocks_scratch,
> - uint32_t keys_map,
> - const struct netdev_flow_key *keys[],
> - struct dpcls_rule **rules)
> -{
> - return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> - 4, 0);
> -}
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1);
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0);
>
> /* Probe function to lookup an available specialized function.
> * If capable to run the requested miniflow fingerprint, this function
> returns
> @@ -283,14 +272,15 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t
> u1_bits)
> {
> dpcls_subtable_lookup_func f = NULL;
>
> - if (u0_bits == 5 && u1_bits == 1) {
> - f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> - } else if (u0_bits == 4 && u1_bits == 1) {
> - f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> - } else if (u0_bits == 4 && u1_bits == 0) {
> - f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> +#define CHECK_LOOKUP_FUNCTION(U0, U1) \
> + if (!f && u0_bits == U0 && u1_bits == U1) { \
> + f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1; \
> }
>
> + CHECK_LOOKUP_FUNCTION(5, 1);
> + CHECK_LOOKUP_FUNCTION(4, 1);
> + CHECK_LOOKUP_FUNCTION(4, 0);
> +
> if (f) {
> VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> u0_bits, u1_bits);
> ---
>
> Best regards, Ilya Maximets.
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev