Hi, Jan,

Please see my comments inlined.

Thanks
Yipeng

>-----Original Message-----
>From: Jan Scheurich [mailto:[email protected]]
>Sent: Monday, December 18, 2017 9:01 AM
>To: Wang, Yipeng1 <[email protected]>; [email protected]
>Cc: Gobriel, Sameh <[email protected]>; Tai, Charlie
><[email protected]>
>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>
>Hi Yipeng,
>
>Thanks a lot for your feedback. Please find some responses below.
>
>Regards, Jan
>
>
>From: Wang, Yipeng1 [mailto:[email protected]]
>Sent: Sunday, 17 December, 2017 19:49
>To: Jan Scheurich <[email protected]>; [email protected]
>Cc: Gobriel, Sameh <[email protected]>; Tai, Charlie
><[email protected]>
>Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
>
>Hi, Jan
>
>We went through the code and did some performance comparisons. We
>notice that the patch contains two parts of optimizations: EMC
>simplifying/resizing, and another layer of cache added before megaflow cache.
>The new cache idea has the same direction with our cuckoo distributor(CD)
>proposal we posted back in April
>(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
><https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html> )
>and presented in OvS 2017 conference. Comparing to our patch, we saw pros
>and cons for both CD and DFC, and currently seeking a way to combine the
>benefits of both patches. We are also seeking other ways to further simplify
>current datapath, to have a scalable while simple solution. Below are some
>detailed comments:
>
>For EMC part, we wonder if you enabled transparent huge page (THP) during
>the test. For our test case, the new EMC only gives a little speedup if THP
>enabled, since with huge page, reducing EMC entry does not benefit much.
>Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we
>tested. I guess the optimization will largely depend on traffic patterns.
>Another question is that it seems when EMC lookup does
>"netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the
>key comparison actually does not do correctly. Could you please double check?
>
>[Jan] Yes, THP is enabled on my test setup, but I have doubts that it
>significantly boosts the performance of the DFC/EMC by transparently
>allocating that memory on a hugepage. Do you have a means to check that on
>a running OVS?

[Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with 
same 16k entries.
If I turned off THP, the current EMC will cause many TLB misses because of its 
larger entry size, which I profiled with vTunes.
Once I turned on THP with no other changes, the current EMC's throughput 
increases a lot and is comparable with the newly
proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million 
to 0 during the 30sec profiling time.
So if THP is enabled, reducing EMC entry size may not give too much benefit 
comparing to the current EMC.
It is worth to mention that they both use similar amount of CPU cache since 
only the miniflow struct is accessed by CPU,
thus the TLB should be the major concern.

>My primary goal when I chose to change the EMC implementation from 2-way
>to 1-way associativity was to simplify the code. In my tests I have not seen 
>any
>benefit of having two possible locations for an EMC entry. As far as I can see
>there is no theoretical reason why we should expect systematic collisions of
>pairs of flows that would justify such a design. There may well be specific
>traffic patterns that benefit from 2-way EMC, but then there are obviously
>others for which 1-way performs better. In doubt I believe we should choose
>the simpler design.
>
[Wang, Yipeng] Yes that there is no systematic collisions. However, in general,
1-hash table tends to cause many more misses than 2-hash. For code simplicity,
I agree that 1-hash is simpler and much easier to understand. For performance,
if the flows can fit in 1-hash table, they should also stay in the primary 
location of the 2-hash table,
so basically they should have similar lookup speed. For large numbers of flows 
in general,
traffic will have higher miss ratio in 1-hash than 2-hash table. From one of 
our tests that has
10k flows and 3 subtable (test cases described later), and EMC is sized for 16k 
entries, 
the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% 
miss ratio.

>Regarding "netdev_flow_key_equal_mf", there is no difference to the
>baseline. The key length is taken from the stored EMC entry, not the packet's
>flow key.
>
[Wang, Yipeng] Sorry that I did not explain it correctly. It seems the key 
length may not be initialized
when emc_probablilistic_insert is called. If I am correct,  the EMC entry does 
not contain the right
key length and the EMC lookup does not use the correct length. This is because 
the emc insert
now happens during dfc_lookup but originally it happens after fast_processing.
Do I understand it correctly?

>For the DFC cache part, we compare with our CD patch we presented in the
>OvS conference. We saw CD begins to perform better than DFC with 2 or
>more subtables, and ~60% higher throughout with 20 subtables, especially
>when flow count is large (around 100k or more). We found CD is a more
>scalable implementation w.r.t subtable count and flow count. Part of the
>reason is that the 1-hash hash table of DFC does not perform consistently with
>various traffic patterns, and easily cause conflict misses. DFC's advantage
>shows up when all flows hit DFC or only 1 subtable exists. We are currently
>thinking about combining both approaches for example a hybrid model.
>
>[Jan] A DFC hit will always be faster than a CD hit because the latter involves
>an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at
>5000 parallel flows to ~25% at 150K flows, so even for large number of flows
>most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search)
>must be very high to cause a big degradation.
>
[Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, but CD 
usually has higher
hit rate for large number of flows,  as the data shows later.

>Can you specify your test setup in more detail? What kind of DFC miss rates do
>you measure depending on the flow rate? Can you publish your
>measurement results?
>
[Wang, Yipeng] We use the test/rules we posted with our CD patch. Basically we 
vary src_IP to hit different subtables,
and then vary dst_IP to create various numbers of flows. We use Spirent to 
generate src_IP from
1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 0.0.0.0 to 
certain value depending on the
flow count. It is similar to your traffic pattern with various UDP port number.
We use your proposed EMC design for both schemes. Here is the performance ratio 
we collected:

throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, 
THP on).     
table cnt/flow cnt      1       3       5       10      20
10k                     1.00    1.00    0.85    1.00    1.00
100k                    0.81    1.15    1.17    1.35    1.55
1M                      0.80    1.12    1.31    1.37    1.63
                                                                         
>Do you have an updated version of your CD patch that works with the
>membership library in DPDK 17.11 now that OVS master builds against 17.11?
>
[Wang, Yipeng] We have the code and we can send to you for testing
if you would like to. But since now we think it is better to combine the 
benefit of both DFC and CD,
it would be better to post on mailing list a more mature patch later.

>We would love to hear your opinion on this and we think the best case is we
>could find a way to harmonize both patches, and find a both scalable and
>efficient way to refactor the datapath.
>
>I would be interested to see your ideas how to combine DFC and CD in a good
>way.
>
[Wang, Yipeng] We are thinking of using the indirect table to store either the 
pointer to the megaflow (like DFC), 
or the pointer to the subtable (like CD). The heuristic will depend on the 
number of active megaflows and the locality
of the accesses. This way, we could keep the smaller size of CD using the 
indirect table, and higher hit rate,
while avoid dpcls subtable access like how DFC works.

>In principle I think CD acceleration of DPCLS lookup could complement DFC but
>I am a bit concerned about the combined memory and cache footprint of EMC,
>DFC and CD. Even for EMC+DFC I have some doubts here. This should be
>evaluated in multi-core setups of OVS and with real VNFs/applications in the
>guests that exert a significant L3 cache contention.
>
[Wang, Yipeng] The cache footprint was our concern too when we worked on CD 
thus we store
a subtable index instead of the pointer to megaflow. With the hybrid model 
mentioned
above, I think it would be a better solution that is both memory efficient and 
scalable.

>Also, as all three are based on the same RSS hash as key, isn't there a
>likelihood of hash collisions hitting all three in the same way? I am thinking
>about packets that have little/no entropy in the outer headers (e.g. GRE
>tunnels).
>
[Wang, Yipeng] If there is no entropy within the range considered by the RSS, 
they 
will collide. But this is another case that the 2-hash EMC may help comparing to
1-hash, right? Keys with same signature has an alternative location to go to. 

>Thanks
>Yipeng
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to