Hi Yipeng,

Thanks a lot for your feedback. Please find some responses below.

Regards, Jan


From: Wang, Yipeng1 [mailto:[email protected]]
Sent: Sunday, 17 December, 2017 19:49
To: Jan Scheurich <[email protected]>; [email protected]
Cc: Gobriel, Sameh <[email protected]>; Tai, Charlie 
<[email protected]>
Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache

Hi, Jan

We went through the code and did some performance comparisons. We notice that 
the patch contains two parts of optimizations: EMC simplifying/resizing, and 
another layer of cache added before megaflow cache.  The new cache idea has the 
same direction with our cuckoo distributor(CD) proposal we posted back in April 
(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html) and 
presented in OvS 2017 conference. Comparing to our patch, we saw pros and cons 
for both CD and DFC, and currently seeking a way to combine the benefits of 
both patches. We are also seeking other ways to further simplify current 
datapath, to have a scalable while simple solution. Below are some detailed 
comments:

For EMC part, we wonder if you enabled transparent huge page (THP) during the 
test. For our test case, the new EMC only gives a little speedup if THP 
enabled, since with huge page, reducing EMC entry does not benefit much. Also, 
reducing 2-hash to 1-hash actually harms certain traffic patterns we tested. I 
guess the optimization will largely depend on traffic patterns. Another 
question is that it seems when EMC lookup does "netdev_flow_key_equal_mf", the 
key length is not initialized yet. Thus, the key comparison actually does not 
do correctly. Could you please double check?

[Jan] Yes, THP is enabled on my test setup, but I have doubts that it 
significantly boosts the performance of the DFC/EMC by transparently allocating 
that memory on a hugepage. Do you have a means to check that on a running OVS?

My primary goal when I chose to change the EMC implementation from 2-way to 
1-way associativity was to simplify the code. In my tests I have not seen any 
benefit of having two possible locations for an EMC entry. As far as I can see 
there is no theoretical reason why we should expect systematic collisions of 
pairs of flows that would justify such a design. There may well be specific 
traffic patterns that benefit from 2-way EMC, but then there are obviously 
others for which 1-way performs better. In doubt I believe we should choose the 
simpler design.

Regarding "netdev_flow_key_equal_mf", there is no difference to the baseline. 
The key length is taken from the stored EMC entry, not the packet's flow key.

For the DFC cache part, we compare with our CD patch we presented in the OvS 
conference. We saw CD begins to perform better than DFC with 2 or more 
subtables, and ~60% higher throughout with 20 subtables, especially when flow 
count is large (around 100k or more). We found CD is a more scalable 
implementation w.r.t subtable count and flow count. Part of the reason is that 
the 1-hash hash table of DFC does not perform consistently with various traffic 
patterns, and easily cause conflict misses. DFC's advantage shows up when all 
flows hit DFC or only 1 subtable exists. We are currently thinking about 
combining both approaches for example a hybrid model.

[Jan] A DFC hit will always be faster than a CD hit because the latter involves 
an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at 
5000 parallel flows to ~25% at 150K flows, so even for large number of flows 
most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search) 
must be very high to cause a big degradation.

Can you specify your test setup in more detail? What kind of DFC miss rates do 
you measure depending on the flow rate? Can you publish your measurement 
results?

Do you have an updated version of your CD patch that works with the membership 
library in DPDK 17.11 now that OVS master builds against 17.11?

We would love to hear your opinion on this and we think the best case is we 
could find a way to harmonize both patches, and find a both scalable and 
efficient way to refactor the datapath.

I would be interested to see your ideas how to combine DFC and CD in a good way.

In principle I think CD acceleration of DPCLS lookup could complement DFC but I 
am a bit concerned about the combined memory and cache footprint of EMC, DFC 
and CD. Even for EMC+DFC I have some doubts here. This should be evaluated in 
multi-core setups of OVS and with real VNFs/applications in the guests that 
exert a significant L3 cache contention.

Also, as all three are based on the same RSS hash as key, isn't there a 
likelihood of hash collisions hitting all three in the same way? I am thinking 
about packets that have little/no entropy in the outer headers (e.g. GRE 
tunnels).

Thanks
Yipeng

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to