Hi Yipeng, Thanks a lot for your feedback. Please find some responses below.
Regards, Jan From: Wang, Yipeng1 [mailto:[email protected]] Sent: Sunday, 17 December, 2017 19:49 To: Jan Scheurich <[email protected]>; [email protected] Cc: Gobriel, Sameh <[email protected]>; Tai, Charlie <[email protected]> Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache Hi, Jan We went through the code and did some performance comparisons. We notice that the patch contains two parts of optimizations: EMC simplifying/resizing, and another layer of cache added before megaflow cache. The new cache idea has the same direction with our cuckoo distributor(CD) proposal we posted back in April (https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html) and presented in OvS 2017 conference. Comparing to our patch, we saw pros and cons for both CD and DFC, and currently seeking a way to combine the benefits of both patches. We are also seeking other ways to further simplify current datapath, to have a scalable while simple solution. Below are some detailed comments: For EMC part, we wonder if you enabled transparent huge page (THP) during the test. For our test case, the new EMC only gives a little speedup if THP enabled, since with huge page, reducing EMC entry does not benefit much. Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we tested. I guess the optimization will largely depend on traffic patterns. Another question is that it seems when EMC lookup does "netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the key comparison actually does not do correctly. Could you please double check? [Jan] Yes, THP is enabled on my test setup, but I have doubts that it significantly boosts the performance of the DFC/EMC by transparently allocating that memory on a hugepage. Do you have a means to check that on a running OVS? My primary goal when I chose to change the EMC implementation from 2-way to 1-way associativity was to simplify the code. In my tests I have not seen any benefit of having two possible locations for an EMC entry. As far as I can see there is no theoretical reason why we should expect systematic collisions of pairs of flows that would justify such a design. There may well be specific traffic patterns that benefit from 2-way EMC, but then there are obviously others for which 1-way performs better. In doubt I believe we should choose the simpler design. Regarding "netdev_flow_key_equal_mf", there is no difference to the baseline. The key length is taken from the stored EMC entry, not the packet's flow key. For the DFC cache part, we compare with our CD patch we presented in the OvS conference. We saw CD begins to perform better than DFC with 2 or more subtables, and ~60% higher throughout with 20 subtables, especially when flow count is large (around 100k or more). We found CD is a more scalable implementation w.r.t subtable count and flow count. Part of the reason is that the 1-hash hash table of DFC does not perform consistently with various traffic patterns, and easily cause conflict misses. DFC's advantage shows up when all flows hit DFC or only 1 subtable exists. We are currently thinking about combining both approaches for example a hybrid model. [Jan] A DFC hit will always be faster than a CD hit because the latter involves an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at 5000 parallel flows to ~25% at 150K flows, so even for large number of flows most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search) must be very high to cause a big degradation. Can you specify your test setup in more detail? What kind of DFC miss rates do you measure depending on the flow rate? Can you publish your measurement results? Do you have an updated version of your CD patch that works with the membership library in DPDK 17.11 now that OVS master builds against 17.11? We would love to hear your opinion on this and we think the best case is we could find a way to harmonize both patches, and find a both scalable and efficient way to refactor the datapath. I would be interested to see your ideas how to combine DFC and CD in a good way. In principle I think CD acceleration of DPCLS lookup could complement DFC but I am a bit concerned about the combined memory and cache footprint of EMC, DFC and CD. Even for EMC+DFC I have some doubts here. This should be evaluated in multi-core setups of OVS and with real VNFs/applications in the guests that exert a significant L3 cache contention. Also, as all three are based on the same RSS hash as key, isn't there a likelihood of hash collisions hitting all three in the same way? I am thinking about packets that have little/no entropy in the outer headers (e.g. GRE tunnels). Thanks Yipeng _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
