On 29.11.2017 00:40, Bodireddy, Bhanuprakash wrote: >> On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote: >>>> I agree with Ilya here. Adding theses cache line markers and >>>> re-grouping variables to minimize gaps in cache lines is creating a >>>> maintenance burden without any tangible benefit. I have had to go >>>> through the pain of refactoring my PMD Performance Metrics patch to >>>> the new dp_netdev_pmd_thread struct and spent a lot of time to >>>> analyze the actual memory layout with GDB and play Tetris with the >> variables. >>> >>> Analyzing the memory layout with gdb for large structures is time consuming >> and not usually recommended. >>> I would suggest using Poke-a-hole(pahole) and that helps to understand >> and fix the structures in no time. >>> With pahole it's going to be lot easier to work with large structures >> especially. >> >> Interesting tool, but it seems doesn't work perfectly. I see duplicated >> unions >> and zero length arrays in the output and I still need to check sizes by >> hands. >> And it fails trying to run on my ubuntu 16.04 LTS on x86. >> IMHO, the code should be simple enough to not use external utilities when >> you need to check the single structure. > > Pahole has been there for a while and is available with most distributions > and works reliably on RHEL distros. > I am on fedora and built pahole from sources and it displays all the sizes > and cacheline boundaries. > >> >>>> >>>> There will never be more than a handful of PMDs, so minimizing the >>>> gaps does not matter from memory perspective. And whether the >>>> individual members occupy 4 or 5 cache lines does not matter either >>>> compared to the many hundred cache lines touched for EMC and DPCLS >>>> lookups of an Rx batch. And any optimization done for x86 is not >>>> necessarily optimal for other architectures. >>> >>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due >> to its different cache line size. >>> >>>> >>>> Finally, even for x86 there is not even a performance improvement. I >>>> re-ran our standard L3VPN over VXLAN performance PVP test on master >>>> and with Ilya's revert patch: >>>> >>>> Flows master reverted >>>> 8, 4.46 4.48 >>>> 100, 4.27 4.29 >>>> 1000, 4.07 4.07 >>>> 2000, 3.68 3.68 >>>> 5000, 3.03 3.03 >>>> 10000, 2.76 2.77 >>>> 20000, 2.64 2.65 >>>> 50000, 2.60 2.61 >>>> 100000, 2.60 2.61 >>>> 500000, 2.60 2.61 >>> >>> What are the CFLAGS in this case, as they seem to make difference. I have >> added my finding here for a different patch targeted at performance >>> >>> https://mail.openvswitch.org/pipermail/ovs-dev/2017- >> November/341270.ht >>> ml >> >> Do you have any performance results that shows significant performance >> difference between above cases? Please describe your test scenario and >> environment so we'll be able to see that padding/alignment really needed >> here. I saw no such results yet. >> >> BTW, at one place you're saying that patch was not about performance, at the >> same time you're trying to show that it has some positive performance >> impact. I'm a bit confused with that. > > Giving a bit more context here, I have been experimenting with *prefetching* > in > OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps). > This work is aimed at checking the performance benefits with prefetching on > not just Haswell > but also with newer range of processors. > > The best way to prefetch a part of structure is to mark the portion of it. > This isn't possible > unless we have some kind of cache line marking. This is what my patch > initially did and then > we can prefetch portion of it based on cacheline markers. You can find an > example in pkt_metadata struct. > > My point is on X86 with cache line marking and with xzalloc_cacheline() API > one shouldn't > see drop in performance if not improvement. But the real improvements would > be seen when the > prefetching is done at right places and that’s WIP.
Is there any preliminary results for this work? Any significant performance improvement with prefetching 'dp_netdev_pmd_thread' structure? Otherwise I see no reason to keep these paddings which needed only for experiments and complicates normal development. > > Bhanuprakash. > >> >>> >>> Patches to consider when testing your use case: >>> Xzalloc_cachline: https://mail.openvswitch.org/pipermail/ovs-dev/2017- >> November/341231.html >>> (If using output batching) >>> https://mail.openvswitch.org/pipermail/ovs- >> dev/2017-November/341230.html >>> >>> - Bhanuprakash. >>> >>>> >>>> All in all, I support reverting this change. >>>> >>>> Regards, Jan >>>> >>>> Acked-by: Jan Scheurich <[email protected]> >>>> >>>>> -----Original Message----- >>>>> From: [email protected] >>>>> [mailto:[email protected]] On Behalf Of Bodireddy, >>>>> Bhanuprakash >>>>> Sent: Friday, 24 November, 2017 17:09 >>>>> To: Ilya Maximets <[email protected]>; ovs- >> [email protected]; >>>>> Ben Pfaff <[email protected]> >>>>> Cc: Heetae Ahn <[email protected]> >>>>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor >>>> dp_netdev_pmd_thread structure." >>>>> >>>>>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote: >>>>>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941. >>>>>>>> >>>>>>>> Padding and aligning of dp_netdev_pmd_thread structure members >> is >>>>>>>> useless, broken in a several ways and only greatly degrades >>>>>>>> maintainability and extensibility of the structure. >>>>>>> >>>>>>> The idea of my earlier patch was to mark the cache lines and >>>>>>> reduce the >>>>>> holes while still maintaining the grouping of related members in >>>>>> this >>>> structure. >>>>>> >>>>>> Some of the grouping aspects looks strange. For example, it looks >>>>>> illogical that 'exit_latch' is grouped with 'flow_table' but not >>>>>> the 'reload_seq' and other reload related stuff. It looks strange >>>>>> that statistics and counters spread across different groups. So, >>>>>> IMHO, it's not >>>> well grouped. >>>>> >>>>> I had to strike a fine balance and some members may be placed in a >>>>> different group due to their sizes and importance. Let me think if I >>>>> can make >>>> it better. >>>>> >>>>>> >>>>>>> Also cache line marking is a good practice to make some one extra >>>>>>> cautious >>>>>> when extending or editing important structures . >>>>>>> Most importantly I was experimenting with prefetching on this >>>>>>> structure >>>>>> and needed cache line markers for it. >>>>>>> >>>>>>> I see that you are on ARM (I don't have HW to test) and want to >>>>>>> know if this >>>>>> commit has any negative affect and any numbers would be appreciated. >>>>>> >>>>>> Basic VM-VM testing shows stable 0.5% perfromance improvement with >>>>>> revert applied. >>>>> >>>>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86. >>>>> >>>>>> Padding adds 560 additional bytes of holes. >>>>> As the cache line in ARM is 128 , it created holes, I can find a >>>>> workaround to >>>> handle this. >>>>> >>>>>> >>>>>>> More comments inline. >>>>>>> >>>>>>>> >>>>>>>> Issues: >>>>>>>> >>>>>>>> 1. It's not working because all the instances of struct >>>>>>>> dp_netdev_pmd_thread allocated only by usual malloc. All the >>>>>>>> memory is not aligned to cachelines -> structure almost never >>>>>>>> starts at aligned memory address. This means that any further >>>>>>>> paddings and alignments inside the structure are completely >>>>>>>> useless. Fo example: >>>>>>>> >>>>>>>> Breakpoint 1, pmd_thread_main >>>>>>>> (gdb) p pmd >>>>>>>> $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20 >>>>>>>> (gdb) p &pmd->cacheline1 >>>>>>>> $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60 >>>>>>>> (gdb) p &pmd->cacheline0 >>>>>>>> $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20 >>>>>>>> (gdb) p &pmd->flow_cache >>>>>>>> $53 = (struct emc_cache *) 0x1b1afe0 >>>>>>>> >>>>>>>> All of the above addresses shifted from cacheline start by 32B. >>>>>>> >>>>>>> If you see below all the addresses are 64 byte aligned. >>>>>>> >>>>>>> (gdb) p pmd >>>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040 >>>>>>> (gdb) p &pmd->cacheline0 >>>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040 >>>>>>> (gdb) p &pmd->cacheline1 >>>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080 >>>>>>> (gdb) p &pmd->flow_cache >>>>>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0 >>>>>>> (gdb) p &pmd->flow_table >>>>>>> $5 = (struct cmap *) 0x7fc1e9fba100 >>>>>>> (gdb) p &pmd->stats >>>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140 >>>>>>> (gdb) p &pmd->port_mutex >>>>>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180 >>>>>>> (gdb) p &pmd->poll_list >>>>>>> $8 = (struct hmap *) 0x7fc1e9fba1c0 >>>>>>> (gdb) p &pmd->tnl_port_cache >>>>>>> $9 = (struct hmap *) 0x7fc1e9fba200 >>>>>>> (gdb) p &pmd->stats_zero >>>>>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240 >>>>>>> >>>>>>> I tried using xzalloc_cacheline instead of default xzalloc() here. >>>>>>> I tried tens of times and always found that the address is >>>>>>> 64 byte aligned and it should start at the beginning of cache line on >>>>>>> X86. >>>>>>> Not sure why the comment " (The memory returned will not be at >>>>>>> the start >>>>>> of a cache line, though, so don't assume such alignment.)" says >> otherwise? >>>>>> >>>>>> Yes, you will always get aligned addressess on your x86 Linux >>>>>> system that supports >>>>>> posix_memalign() call. The comment says what it says because it >>>>>> will make some memory allocation tricks in case posix_memalign() is >>>>>> not available (Windows, some MacOS, maybe some Linux systems (not >>>>>> sure)) and the address will not be aligned it this case. >>>>> >>>>> I also verified the other case when posix_memalign isn't available >>>>> and even in that case it returns the address aligned on >>>>> CACHE_LINE_SIZE boundary. I will send out a patch to use >>>>> xzalloc_cacheline for allocating the >>>> memory. >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Can we fix it properly? NO. >>>>>>>> OVS currently doesn't have appropriate API to allocate aligned >>>>>>>> memory. The best candidate is 'xmalloc_cacheline()' but it >>>>>>>> clearly states that "The memory returned will not be at the >>>>>>>> start of a cache line, though, so don't assume such alignment". >>>>>>>> And also, this function will never return aligned memory on >>>>>>>> Windows or MacOS. >>>>>>>> >>>>>>>> 2. CACHE_LINE_SIZE is not constant. Different architectures have >>>>>>>> different cache line sizes, but the code assumes that >>>>>>>> CACHE_LINE_SIZE is always equal to 64 bytes. All the structure >>>>>>>> members are grouped by 64 bytes and padded to >> CACHE_LINE_SIZE. >>>>>>>> This leads to a huge holes in a structures if CACHE_LINE_SIZE >>>>>>>> differs from 64. This is opposite to portability. If I want >>>>>>>> good performance of cmap I need to have CACHE_LINE_SIZE equal >>>>>>>> to the real cache line size, but I will have huge holes in the >>>>>>>> structures. If you'll take a look to struct rte_mbuf from DPDK >>>>>>>> you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and >>>>>>>> RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure. >>>>>>> >>>>>>> I understand that ARM and few other processors (like OCTEON) have >>>>>>> 128 >>>>>> bytes cache lines. >>>>>>> But again curious of performance impact in your case with this >>>>>>> new >>>>>> alignment. >>>>>>> >>>>>>>> >>>>>>>> 3. Sizes of system/libc defined types are not constant for all the >>>>>>>> systems. For example, sizeof(pthread_mutex_t) == 48 on my >>>>>>>> ARMv8 machine, but only 40 on x86. The difference could be >>>>>>>> much bigger on Windows or MacOS systems. But the code >> assumes >>>>>>>> that sizeof(struct ovs_mutex) is always 48 bytes. This may lead >>>>>>>> to broken alignment/big holes in case of padding/wrong >> comments >>>>>>>> about amount of free pad bytes. >>>>>>> >>>>>>> This isn't an issue as you would have already mentioned and more >>>>>>> about >>>>>> issue with the comment that reads the pad bytes. >>>>>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86. >>>>>>> >>>>>>> union { >>>>>>> struct { >>>>>>> struct ovs_mutex port_mutex; /* 4849984 >>>>>>> 48 */ >>>>>>> }; /* 48 */ >>>>>>> uint8_t pad13[64]; /* 64 >>>>>>> */ >>>>>>> }; / >>>>>>> >>>>>> >>>>>> It's not only about 'port_mutex'. If you'll take a look at >>>>>> 'flow_mutex', you will see, that it even not padded. So, increasing >>>>>> the size >>>> of 'flow_mutex' >>>>>> leads to shifting of all the following padded blocks and no other >>>>>> blocks will be properly aligned even if the structure allocated on >>>>>> the >>>> aligned memory. >>>>>> >>>>>>>> >>>>>>>> 4. Sizes of the many fileds in structure depends on defines like >>>>>>>> DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so >>>> on. >>>>>>>> Any change in these defines or any change in any structure >>>>>>>> contained by thread should lead to the not so simple >>>>>>>> refactoring of the whole dp_netdev_pmd_thread structure. This >>>>>>>> greatly reduces maintainability and complicates development of >>>>>>>> a new features. >>>>>>> >>>>>>> I don't think it complicates development and instead I feel the >>>>>>> commit gives a clear indication to the developer that the members >>>>>>> are grouped and >>>>>> aligned and marked with cacheline markers. >>>>>>> This makes the developer extra cautious when adding new members >> so >>>>>>> that >>>>>> holes can be avoided. >>>>>> >>>>>> Starting rebase of the output batching patch-set I figured out that >>>>>> I need to remove 'unsigned long long last_cycles' and add 'struct >>>>>> dp_netdev_pmd_thread_ctx ctx' >>>>>> which is 8 bytes larger. Could you, please, suggest me where should >>>>>> I place that new structure member and what to do with a hole from >>>> 'last_cycles'? >>>>>> >>>>>> This is not a trivial question, because already poor grouping will >>>>>> become worse almost anyway. >>>>> >>>>> Aah, realized now that the batching series doesn't cleanly apply on >> master. >>>>> Let me check this and will send across the changes that should fix this. >>>>> >>>>> - Bhanuprakash >>>>> >>>>>>> >>>>>>> Cacheline marking the structure is a good practice and I am sure >>>>>>> this structure is significant and should be carefully extended in >>>>>>> the >>>> future. >>>>>> >>>>>> Not so sure about that. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> 5. There is no reason to align flow_cache member because it's >>>>>>>> too big and we usually access random entries by single thread >>>>>>>> only. >>>>>>>> >>>>>>> >>>>>>> I see your point. This patch wasn't done for performance and >>>>>>> instead more to have some order with this ever growing structure. >>>>>>> During testing I found that for some test cases aligning the >>>>>>> flow_cache was giving >>>>>> me 100k+ performance consistently and so was added. >>>>>> >>>>>> This was a random performance boost. You achieved it without >>>>>> aligned memory allocation. >>>>>> Just a luck with you system environment. Using of mzalloc_cacheline >>>>>> will likely eliminate this performance difference or even degrade >>>>>> the >>>> performance. >>>>>> >>>>>>> >>>>>>>> So, the padding/alignment only creates some visibility of >>>>>>>> performance optimization but does nothing useful in reality. It >>>>>>>> only complicates maintenance and adds huge holes for non-x86 >>>>>>>> architectures and non-Linux systems. Performance improvement >>>>>>>> stated in a original commit message should be random and not >>>>>>>> valuable. I see no >>>>>> performance difference. >>>>>>> >>>>>>> I understand that this is causing issues with architecture having >>>>>>> different cache line sizes, but unfortunately majority of them >>>>>>> have >>>>>>> 64 byte >>>>>> cache lines so this change makes sense. >>>>>> >>>>>> I understand that 64 byte cache lines are spread a lot wider. I >>>>>> also have x86 as a target arch, but still, IMHO, OVS is a >>>>>> cross-platform application and it should not have platform >>>>>> dependent stuff which makes one architecture/platform better and >> worsen others. >>>>>> >>>>>>> >>>>>>> If you have performance data to prove that this causes sever perf >>>>>> degradation I can think of work arounds for ARM. >>>>>>> >>>>>>> - Bhanuprakash. >>>>>> >>>>>> >>>>>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have >>>>>> to apply following patch to avoid build time assert (I'll send it >>>>>> formally >>>> later): >>>>>> >>>>>> ------------------------------------------------------------------- >>>>>> -- >>>>>> -------- >>>>>> diff --git a/lib/cmap.c b/lib/cmap.c index 35decea..5b15ecd 100644 >>>>>> --- a/lib/cmap.c >>>>>> +++ b/lib/cmap.c >>>>>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink); >>>>>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */ >>>>>> #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE) >>>>>> >>>>>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 >>>>>> on 64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - >>>>>> (CMAP_K * >>>>>> CMAP_ENTRY_SIZE)) >>>>>> - >>>>>> /* A cuckoo hash bucket. Designed to be cache-aligned and exactly >>>>>> one cache >>>>>> * line long. */ >>>>>> struct cmap_bucket { >>>>>> + /* Padding to make cmap_bucket exactly one cache line long. */ >>>>>> + PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>> /* Allows readers to track in-progress changes. Initially zero, each >>>>>> * writer increments this value just before and just after each >>>>>> change >>>> (see >>>>>> * cmap_set_bucket()). Thus, a reader can ensure that it gets >>>>>> a consistent @@ -145,11 +144,7 @@ struct cmap_bucket { >>>>>> * slots. */ >>>>>> uint32_t hashes[CMAP_K]; >>>>>> struct cmap_node nodes[CMAP_K]; >>>>>> - >>>>>> - /* Padding to make cmap_bucket exactly one cache line long. */ >>>>>> -#if CMAP_PADDING > 0 >>>>>> - uint8_t pad[CMAP_PADDING]; >>>>>> -#endif >>>>>> + ); >>>>>> }; >>>>>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE); >>>>>> >>>>>> diff --git a/lib/util.h b/lib/util.h index 3c43c2c..514fdaa 100644 >>>>>> --- a/lib/util.h >>>>>> +++ b/lib/util.h >>>>>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE { >>>>>> >>>>>> /* This system's cache line size, in bytes. >>>>>> * Being wrong hurts performance but not correctness. */ -#define >>>>>> CACHE_LINE_SIZE 64 >>>>>> +#define CACHE_LINE_SIZE 128 >>>>>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE)); >>>>>> >>>>>> /* Cacheline marking is typically done using zero-sized array. >>>>>> ------------------------------------------------------------------- >>>>>> -- >>>>>> -------- >>>>>> >>>>>> >>>>>> Best regards, Ilya Maximets. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Most of the above issues are also true for some other >>>>>>>> padded/aligned structures like 'struct netdev_dpdk'. They will be >>>> treated separately. >>>>>>>> >>>>>>>> CC: Bhanuprakash Bodireddy <[email protected]> >>>>>>>> CC: Ben Pfaff <[email protected]> >>>>>>>> Signed-off-by: Ilya Maximets <[email protected]> >>>>>>>> --- >>>>>>>> lib/dpif-netdev.c | 160 >>>>>>>> +++++++++++++++++++++++----------------------------- >>>>>>>> -- >>>>>>>> 1 file changed, 69 insertions(+), 91 deletions(-) >>>>>>>> >>>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index >>>>>>>> 0a62630..6784269 >>>>>>>> 100644 >>>>>>>> --- a/lib/dpif-netdev.c >>>>>>>> +++ b/lib/dpif-netdev.c >>>>>>>> @@ -547,31 +547,18 @@ struct tx_port { >>>>>>>> * actions in either case. >>>>>>>> * */ >>>>>>>> struct dp_netdev_pmd_thread { >>>>>>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, >>>>>> cacheline0, >>>>>>>> - struct dp_netdev *dp; >>>>>>>> - struct cmap_node node; /* In 'dp->poll_threads'. */ >>>>>>>> - pthread_cond_t cond; /* For synchronizing pmd >>>>>>>> thread >>>>>>>> - reload. */ >>>>>>>> - ); >>>>>>>> - >>>>>>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, >>>>>> cacheline1, >>>>>>>> - struct ovs_mutex cond_mutex; /* Mutex for condition >> variable. >>>> */ >>>>>>>> - pthread_t thread; >>>>>>>> - unsigned core_id; /* CPU core id of this pmd >>>>>>>> thread. */ >>>>>>>> - int numa_id; /* numa node id of this pmd >>>>>>>> thread. */ >>>>>>>> - ); >>>>>>>> + struct dp_netdev *dp; >>>>>>>> + struct ovs_refcount ref_cnt; /* Every reference must be >>>> refcount'ed. >>>>>> */ >>>>>>>> + struct cmap_node node; /* In 'dp->poll_threads'. */ >>>>>>>> + >>>>>>>> + pthread_cond_t cond; /* For synchronizing pmd thread >> reload. >>>> */ >>>>>>>> + struct ovs_mutex cond_mutex; /* Mutex for condition variable. >> */ >>>>>>>> >>>>>>>> /* Per thread exact-match cache. Note, the instance for cpu core >>>>>>>> * NON_PMD_CORE_ID can be accessed by multiple threads, and >>>> thusly >>>>>>>> * need to be protected by 'non_pmd_mutex'. Every other >> instance >>>>>>>> * will only be accessed by its own pmd thread. */ >>>>>>>> - OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache >>>> flow_cache; >>>>>>>> - struct ovs_refcount ref_cnt; /* Every reference must be >>>> refcount'ed. >>>>>> */ >>>>>>>> - >>>>>>>> - /* Queue id used by this pmd thread to send packets on all >> netdevs >>>> if >>>>>>>> - * XPS disabled for this netdev. All static_tx_qid's are unique >>>>>>>> and >> less >>>>>>>> - * than 'cmap_count(dp->poll_threads)'. */ >>>>>>>> - uint32_t static_tx_qid; >>>>>>>> + struct emc_cache flow_cache; >>>>>>>> >>>>>>>> /* Flow-Table and classifiers >>>>>>>> * >>>>>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread { >>>>>>>> * 'flow_mutex'. >>>>>>>> */ >>>>>>>> struct ovs_mutex flow_mutex; >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - struct cmap flow_table OVS_GUARDED; /* Flow table. */ >>>>>>>> - >>>>>>>> - /* One classifier per in_port polled by the pmd */ >>>>>>>> - struct cmap classifiers; >>>>>>>> - /* Periodically sort subtable vectors according to hit >>>>>>>> frequencies >> */ >>>>>>>> - long long int next_optimization; >>>>>>>> - /* End of the next time interval for which processing cycles >>>>>>>> - are stored for each polled rxq. */ >>>>>>>> - long long int rxq_next_cycle_store; >>>>>>>> - >>>>>>>> - /* Cycles counters */ >>>>>>>> - struct dp_netdev_pmd_cycles cycles; >>>>>>>> - >>>>>>>> - /* Used to count cycles. See 'cycles_counter_end()'. */ >>>>>>>> - unsigned long long last_cycles; >>>>>>>> - struct latch exit_latch; /* For terminating the pmd >>>>>>>> thread. */ >>>>>>>> - ); >>>>>>>> - >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - /* Statistics. */ >>>>>>>> - struct dp_netdev_pmd_stats stats; >>>>>>>> - >>>>>>>> - struct seq *reload_seq; >>>>>>>> - uint64_t last_reload_seq; >>>>>>>> - atomic_bool reload; /* Do we need to reload >>>>>>>> ports? */ >>>>>>>> - bool isolated; >>>>>>>> - >>>>>>>> - /* Set to true if the pmd thread needs to be reloaded. */ >>>>>>>> - bool need_reload; >>>>>>>> - /* 5 pad bytes. */ >>>>>>>> - ); >>>>>>>> - >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - struct ovs_mutex port_mutex; /* Mutex for 'poll_list' >>>>>>>> - and 'tx_ports'. */ >>>>>>>> - /* 16 pad bytes. */ >>>>>>>> - ); >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - /* List of rx queues to poll. */ >>>>>>>> - struct hmap poll_list OVS_GUARDED; >>>>>>>> - /* Map of 'tx_port's used for transmission. Written by the >>>>>>>> main >>>>>>>> - * thread, read by the pmd thread. */ >>>>>>>> - struct hmap tx_ports OVS_GUARDED; >>>>>>>> - ); >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - /* These are thread-local copies of 'tx_ports'. One contains >>>>>>>> only >>>>>>>> - * tunnel ports (that support push_tunnel/pop_tunnel), the >> other >>>>>>>> - * contains ports with at least one txq (that support send). >>>>>>>> - * A port can be in both. >>>>>>>> - * >>>>>>>> - * There are two separate maps to make sure that we don't try >> to >>>>>>>> - * execute OUTPUT on a device which has 0 txqs or PUSH/POP >> on a >>>>>>>> - * non-tunnel device. >>>>>>>> - * >>>>>>>> - * The instances for cpu core NON_PMD_CORE_ID can be >> accessed >>>> by >>>>>>>> - * multiple threads and thusly need to be protected by >>>>>>>> 'non_pmd_mutex'. >>>>>>>> - * Every other instance will only be accessed by its own pmd >>>> thread. >>>>>> */ >>>>>>>> - struct hmap tnl_port_cache; >>>>>>>> - struct hmap send_port_cache; >>>>>>>> - ); >>>>>>>> - >>>>>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>>>>>> - /* Only a pmd thread can write on its own 'cycles' and >>>>>>>> 'stats'. >>>>>>>> - * The main thread keeps 'stats_zero' and 'cycles_zero' as >>>>>>>> base >>>>>>>> - * values and subtracts them from 'stats' and 'cycles' before >>>>>>>> - * reporting to the user */ >>>>>>>> - unsigned long long stats_zero[DP_N_STATS]; >>>>>>>> - uint64_t cycles_zero[PMD_N_CYCLES]; >>>>>>>> - /* 8 pad bytes. */ >>>>>>>> - ); >>>>>>>> + struct cmap flow_table OVS_GUARDED; /* Flow table. */ >>>>>>>> + >>>>>>>> + /* One classifier per in_port polled by the pmd */ >>>>>>>> + struct cmap classifiers; >>>>>>>> + /* Periodically sort subtable vectors according to hit frequencies >> */ >>>>>>>> + long long int next_optimization; >>>>>>>> + /* End of the next time interval for which processing cycles >>>>>>>> + are stored for each polled rxq. */ >>>>>>>> + long long int rxq_next_cycle_store; >>>>>>>> + >>>>>>>> + /* Statistics. */ >>>>>>>> + struct dp_netdev_pmd_stats stats; >>>>>>>> + >>>>>>>> + /* Cycles counters */ >>>>>>>> + struct dp_netdev_pmd_cycles cycles; >>>>>>>> + >>>>>>>> + /* Used to count cicles. See 'cycles_counter_end()' */ >>>>>>>> + unsigned long long last_cycles; >>>>>>>> + >>>>>>>> + struct latch exit_latch; /* For terminating the pmd >>>>>>>> thread. */ >>>>>>>> + struct seq *reload_seq; >>>>>>>> + uint64_t last_reload_seq; >>>>>>>> + atomic_bool reload; /* Do we need to reload ports? */ >>>>>>>> + pthread_t thread; >>>>>>>> + unsigned core_id; /* CPU core id of this pmd >>>>>>>> thread. */ >>>>>>>> + int numa_id; /* numa node id of this pmd >>>>>>>> thread. */ >>>>>>>> + bool isolated; >>>>>>>> + >>>>>>>> + /* Queue id used by this pmd thread to send packets on all >>>>>>>> + netdevs >>>> if >>>>>>>> + * XPS disabled for this netdev. All static_tx_qid's are unique >>>>>>>> and >> less >>>>>>>> + * than 'cmap_count(dp->poll_threads)'. */ >>>>>>>> + uint32_t static_tx_qid; >>>>>>>> + >>>>>>>> + struct ovs_mutex port_mutex; /* Mutex for 'poll_list' and >>>> 'tx_ports'. >>>>>> */ >>>>>>>> + /* List of rx queues to poll. */ >>>>>>>> + struct hmap poll_list OVS_GUARDED; >>>>>>>> + /* Map of 'tx_port's used for transmission. Written by the >>>>>>>> + main >>>> thread, >>>>>>>> + * read by the pmd thread. */ >>>>>>>> + struct hmap tx_ports OVS_GUARDED; >>>>>>>> + >>>>>>>> + /* These are thread-local copies of 'tx_ports'. One >>>>>>>> + contains only >>>> tunnel >>>>>>>> + * ports (that support push_tunnel/pop_tunnel), the other >>>>>>>> + contains >>>>>> ports >>>>>>>> + * with at least one txq (that support send). A port can be in >>>>>>>> both. >>>>>>>> + * >>>>>>>> + * There are two separate maps to make sure that we don't >>>>>>>> + try to >>>>>> execute >>>>>>>> + * OUTPUT on a device which has 0 txqs or PUSH/POP on a >>>>>>>> + non-tunnel >>>>>>>> device. >>>>>>>> + * >>>>>>>> + * The instances for cpu core NON_PMD_CORE_ID can be >>>>>>>> + accessed by >>>>>>>> multiple >>>>>>>> + * threads, and thusly need to be protected by 'non_pmd_mutex'. >>>>>> Every >>>>>>>> + * other instance will only be accessed by its own pmd thread. */ >>>>>>>> + struct hmap tnl_port_cache; >>>>>>>> + struct hmap send_port_cache; >>>>>>>> + >>>>>>>> + /* Only a pmd thread can write on its own 'cycles' and 'stats'. >>>>>>>> + * The main thread keeps 'stats_zero' and 'cycles_zero' as base >>>>>>>> + * values and subtracts them from 'stats' and 'cycles' before >>>>>>>> + * reporting to the user */ >>>>>>>> + unsigned long long stats_zero[DP_N_STATS]; >>>>>>>> + uint64_t cycles_zero[PMD_N_CYCLES]; >>>>>>>> + >>>>>>>> + /* Set to true if the pmd thread needs to be reloaded. */ >>>>>>>> + bool need_reload; >>>>>>>> }; >>>>>>>> >>>>>>>> /* Interface to netdev-based datapath. */ >>>>>>>> -- >>>>>>>> 2.7.4 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> [email protected] >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >>> >>> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
