On 29.11.2017 00:40, Bodireddy, Bhanuprakash wrote:
>> On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:
>>>> I agree with Ilya here. Adding theses cache line markers and
>>>> re-grouping variables to minimize gaps in cache lines is creating a
>>>> maintenance burden without any tangible benefit. I have had to go
>>>> through the pain of refactoring my PMD Performance Metrics patch to
>>>> the new dp_netdev_pmd_thread struct and spent a lot of time to
>>>> analyze the actual memory layout with GDB and play Tetris with the
>> variables.
>>>
>>> Analyzing the memory layout with gdb for large structures is time consuming
>> and not usually recommended.
>>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>> and fix the structures in no time.
>>> With pahole it's going to be lot easier to work with large structures
>> especially.
>>
>> Interesting tool, but it seems doesn't work perfectly. I see duplicated 
>> unions
>> and zero length arrays in the output and I still need to check sizes by 
>> hands.
>> And it fails trying to run on my ubuntu 16.04 LTS on x86.
>> IMHO, the code should be simple enough to not use external utilities when
>> you need to check the single structure.
> 
> Pahole has been there for a while and is available with most distributions 
> and works reliably on RHEL distros.
> I am on fedora and built pahole from sources and it displays all the sizes 
> and cacheline boundaries.
> 
>>
>>>>
>>>> There will never be more than a handful of PMDs, so minimizing the
>>>> gaps does not matter from memory perspective. And whether the
>>>> individual members occupy 4 or 5 cache lines does not matter either
>>>> compared to the many hundred cache lines touched for EMC and DPCLS
>>>> lookups of an Rx batch. And any optimization done for x86 is not
>>>> necessarily optimal for other architectures.
>>>
>>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due
>> to its different cache line size.
>>>
>>>>
>>>> Finally, even for x86 there is not even a performance improvement. I
>>>> re-ran our standard L3VPN over VXLAN performance PVP test on master
>>>> and with Ilya's revert patch:
>>>>
>>>> Flows   master  reverted
>>>> 8,      4.46    4.48
>>>> 100,    4.27    4.29
>>>> 1000,   4.07    4.07
>>>> 2000,   3.68    3.68
>>>> 5000,   3.03    3.03
>>>> 10000,  2.76    2.77
>>>> 20000,  2.64    2.65
>>>> 50000,  2.60    2.61
>>>> 100000, 2.60    2.61
>>>> 500000, 2.60    2.61
>>>
>>> What are the  CFLAGS in this case, as they seem to make difference. I have
>> added my finding here for a different patch targeted at performance
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>> November/341270.ht
>>> ml
>>
>> Do you have any performance results that shows significant performance
>> difference between above cases? Please describe your test scenario and
>> environment so we'll be able to see that padding/alignment really needed
>> here. I saw no such results yet.
>>
>> BTW, at one place you're saying that patch was not about performance, at the
>> same time you're trying to show that it has some positive performance
>> impact. I'm a bit confused with that.
> 
> Giving a bit more context here, I have been experimenting  with *prefetching* 
> in
> OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps).
> This work is aimed at checking the performance benefits with prefetching on 
> not just Haswell
> but also with newer range of processors.
> 
> The best way to prefetch a part of structure is to mark the portion of it. 
> This isn't possible
> unless we have some kind of cache line marking. This is what my patch 
> initially did and then
> we can prefetch portion of it based on cacheline markers. You can find an 
> example in pkt_metadata struct.
> 
> My point is on X86  with cache line marking and with xzalloc_cacheline() API 
> one shouldn't
> see drop in performance if not improvement. But the real improvements would 
> be seen when the
> prefetching is done at right places and that’s WIP.

Is there any preliminary results for this work? Any significant performance
improvement with prefetching 'dp_netdev_pmd_thread' structure?
Otherwise I see no reason to keep these paddings which needed only for
experiments and complicates normal development.

> 
> Bhanuprakash.
> 
>>
>>>
>>> Patches to consider when testing your use case:
>>>      Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>> November/341231.html
>>>      (If using output batching)      
>>> https://mail.openvswitch.org/pipermail/ovs-
>> dev/2017-November/341230.html
>>>
>>> - Bhanuprakash.
>>>
>>>>
>>>> All in all, I support reverting this change.
>>>>
>>>> Regards, Jan
>>>>
>>>> Acked-by: Jan Scheurich <[email protected]>
>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected]
>>>>> [mailto:[email protected]] On Behalf Of Bodireddy,
>>>>> Bhanuprakash
>>>>> Sent: Friday, 24 November, 2017 17:09
>>>>> To: Ilya Maximets <[email protected]>; ovs-
>> [email protected];
>>>>> Ben Pfaff <[email protected]>
>>>>> Cc: Heetae Ahn <[email protected]>
>>>>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>>>> dp_netdev_pmd_thread structure."
>>>>>
>>>>>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>>>>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>>>>>>
>>>>>>>> Padding and aligning of dp_netdev_pmd_thread structure members
>> is
>>>>>>>> useless, broken in a several ways and only greatly degrades
>>>>>>>> maintainability and extensibility of the structure.
>>>>>>>
>>>>>>> The idea of my earlier patch was to mark the cache lines and
>>>>>>> reduce the
>>>>>> holes while still maintaining the grouping of related members in
>>>>>> this
>>>> structure.
>>>>>>
>>>>>> Some of the grouping aspects looks strange. For example, it looks
>>>>>> illogical that 'exit_latch' is grouped with 'flow_table' but not
>>>>>> the 'reload_seq' and other reload related stuff. It looks strange
>>>>>> that statistics and counters spread across different groups. So,
>>>>>> IMHO, it's not
>>>> well grouped.
>>>>>
>>>>> I had to strike a fine balance and some members may be placed in a
>>>>> different group due to their sizes and importance. Let me think if I
>>>>> can make
>>>> it better.
>>>>>
>>>>>>
>>>>>>> Also cache line marking is a good practice to make some one extra
>>>>>>> cautious
>>>>>> when extending or editing important structures .
>>>>>>> Most importantly I was experimenting with prefetching on this
>>>>>>> structure
>>>>>> and needed cache line markers for it.
>>>>>>>
>>>>>>> I see that you are on ARM (I don't have HW to test) and want to
>>>>>>> know if this
>>>>>> commit has any negative affect and any numbers would be appreciated.
>>>>>>
>>>>>> Basic VM-VM testing shows stable 0.5% perfromance improvement with
>>>>>> revert applied.
>>>>>
>>>>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>>>>
>>>>>> Padding adds 560 additional bytes of holes.
>>>>> As the cache line in ARM is 128 , it created holes, I can find a
>>>>> workaround to
>>>> handle this.
>>>>>
>>>>>>
>>>>>>> More comments inline.
>>>>>>>
>>>>>>>>
>>>>>>>> Issues:
>>>>>>>>
>>>>>>>>    1. It's not working because all the instances of struct
>>>>>>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>>>>>>       memory is not aligned to cachelines -> structure almost never
>>>>>>>>       starts at aligned memory address. This means that any further
>>>>>>>>       paddings and alignments inside the structure are completely
>>>>>>>>       useless. Fo example:
>>>>>>>>
>>>>>>>>       Breakpoint 1, pmd_thread_main
>>>>>>>>       (gdb) p pmd
>>>>>>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>>>>>>       (gdb) p &pmd->cacheline1
>>>>>>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>>>>>>       (gdb) p &pmd->cacheline0
>>>>>>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>>>>>>       (gdb) p &pmd->flow_cache
>>>>>>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>>>>>>
>>>>>>>>       All of the above addresses shifted from cacheline start by 32B.
>>>>>>>
>>>>>>> If you see below all the addresses are 64 byte aligned.
>>>>>>>
>>>>>>> (gdb) p pmd
>>>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>>>>>>> (gdb) p &pmd->cacheline0
>>>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>>>>>>> (gdb) p &pmd->cacheline1
>>>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>>>>>>> (gdb) p &pmd->flow_cache
>>>>>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>>>>>>> (gdb) p &pmd->flow_table
>>>>>>> $5 = (struct cmap *) 0x7fc1e9fba100
>>>>>>> (gdb) p &pmd->stats
>>>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>>>>>>> (gdb) p &pmd->port_mutex
>>>>>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>>>>>>> (gdb) p &pmd->poll_list
>>>>>>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>>>>>>> (gdb) p &pmd->tnl_port_cache
>>>>>>> $9 = (struct hmap *) 0x7fc1e9fba200
>>>>>>> (gdb) p &pmd->stats_zero
>>>>>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>>>>>>
>>>>>>> I tried using xzalloc_cacheline instead of default xzalloc() here.
>>>>>>> I tried tens of times and always found that the address is
>>>>>>> 64 byte aligned and it should start at the beginning of cache line on 
>>>>>>> X86.
>>>>>>> Not sure why the comment  " (The memory returned will not be at
>>>>>>> the start
>>>>>> of  a cache line, though, so don't assume such alignment.)" says
>> otherwise?
>>>>>>
>>>>>> Yes, you will always get aligned addressess on your x86 Linux
>>>>>> system that supports
>>>>>> posix_memalign() call. The comment says what it says because it
>>>>>> will make some memory allocation tricks in case posix_memalign() is
>>>>>> not available (Windows, some MacOS, maybe some Linux systems (not
>>>>>> sure)) and the address will not be aligned it this case.
>>>>>
>>>>> I also verified the other case when posix_memalign isn't available
>>>>> and even in that case it returns the address aligned on
>>>>> CACHE_LINE_SIZE boundary. I will send out a patch to use
>>>>> xzalloc_cacheline for allocating the
>>>> memory.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>       Can we fix it properly? NO.
>>>>>>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>>>>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>>>>>>       clearly states that "The memory returned will not be at the
>>>>>>>>       start of a cache line, though, so don't assume such alignment".
>>>>>>>>       And also, this function will never return aligned memory on
>>>>>>>>       Windows or MacOS.
>>>>>>>>
>>>>>>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>>>>>>       different cache line sizes, but the code assumes that
>>>>>>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>>>>>>       members are grouped by 64 bytes and padded to
>> CACHE_LINE_SIZE.
>>>>>>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>>>>>>       differs from 64. This is opposite to portability. If I want
>>>>>>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>>>>>>       to the real cache line size, but I will have huge holes in the
>>>>>>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>>>>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>>>>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>>>>>>
>>>>>>> I understand that ARM and few other processors (like OCTEON) have
>>>>>>> 128
>>>>>> bytes cache lines.
>>>>>>> But  again curious of performance impact in your case with this
>>>>>>> new
>>>>>> alignment.
>>>>>>>
>>>>>>>>
>>>>>>>>    3. Sizes of system/libc defined types are not constant for all the
>>>>>>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>>>>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>>>>>>       much bigger on Windows or MacOS systems. But the code
>> assumes
>>>>>>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>>>>>>       to broken alignment/big holes in case of padding/wrong
>> comments
>>>>>>>>       about amount of free pad bytes.
>>>>>>>
>>>>>>> This isn't an issue as you would have already mentioned and more
>>>>>>> about
>>>>>> issue with the comment that reads the pad bytes.
>>>>>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>>>>>>
>>>>>>>         union {
>>>>>>>                 struct {
>>>>>>>                         struct ovs_mutex port_mutex;     /* 4849984    
>>>>>>> 48 */
>>>>>>>                 };    /*          48 */
>>>>>>>                 uint8_t            pad13[64];            /*          64 
>>>>>>> */
>>>>>>>         };                                               /
>>>>>>>
>>>>>>
>>>>>> It's not only about 'port_mutex'. If you'll take a look at
>>>>>> 'flow_mutex', you will see, that it even not padded. So, increasing
>>>>>> the size
>>>> of 'flow_mutex'
>>>>>> leads to shifting of all the following padded blocks and no other
>>>>>> blocks will be properly aligned even if the structure allocated on
>>>>>> the
>>>> aligned memory.
>>>>>>
>>>>>>>>
>>>>>>>>    4. Sizes of the many fileds in structure depends on defines like
>>>>>>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so
>>>> on.
>>>>>>>>       Any change in these defines or any change in any structure
>>>>>>>>       contained by thread should lead to the not so simple
>>>>>>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>>>>>>       greatly reduces maintainability and complicates development of
>>>>>>>>       a new features.
>>>>>>>
>>>>>>> I don't think it complicates development and instead I feel the
>>>>>>> commit gives a clear indication to the developer that the members
>>>>>>> are grouped and
>>>>>> aligned and marked with cacheline markers.
>>>>>>> This makes the developer extra cautious when adding new members
>> so
>>>>>>> that
>>>>>> holes can be avoided.
>>>>>>
>>>>>> Starting rebase of the output batching patch-set I figured out that
>>>>>> I need to remove 'unsigned long long last_cycles' and add 'struct
>>>>>> dp_netdev_pmd_thread_ctx ctx'
>>>>>> which is 8 bytes larger. Could you, please, suggest me where should
>>>>>> I place that new structure member and what to do with a hole from
>>>> 'last_cycles'?
>>>>>>
>>>>>> This is not a trivial question, because already poor grouping will
>>>>>> become worse almost anyway.
>>>>>
>>>>> Aah, realized now that the batching series doesn't cleanly apply on
>> master.
>>>>> Let me check this and will send across the changes that should fix this.
>>>>>
>>>>> - Bhanuprakash
>>>>>
>>>>>>>
>>>>>>> Cacheline marking the structure is a good practice and I am sure
>>>>>>> this structure is significant and should be carefully extended in
>>>>>>> the
>>>> future.
>>>>>>
>>>>>> Not so sure about that.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>    5. There is no reason to align flow_cache member because it's
>>>>>>>>       too big and we usually access random entries by single thread
>>>>>>>>       only.
>>>>>>>>
>>>>>>>
>>>>>>> I see your point. This patch wasn't done for performance and
>>>>>>> instead more to have some order with this ever growing structure.
>>>>>>> During testing I found that for some test cases aligning the
>>>>>>> flow_cache was giving
>>>>>> me 100k+ performance consistently and so was added.
>>>>>>
>>>>>> This was a random performance boost. You achieved it without
>>>>>> aligned memory allocation.
>>>>>> Just a luck with you system environment. Using of mzalloc_cacheline
>>>>>> will likely eliminate this performance difference or even degrade
>>>>>> the
>>>> performance.
>>>>>>
>>>>>>>
>>>>>>>> So, the padding/alignment only creates some visibility of
>>>>>>>> performance optimization but does nothing useful in reality. It
>>>>>>>> only complicates maintenance and adds huge holes for non-x86
>>>>>>>> architectures and non-Linux systems. Performance improvement
>>>>>>>> stated in a original commit message should be random and not
>>>>>>>> valuable. I see no
>>>>>> performance difference.
>>>>>>>
>>>>>>> I understand that this is causing issues with architecture having
>>>>>>> different cache line sizes, but unfortunately majority of them
>>>>>>> have
>>>>>>> 64 byte
>>>>>> cache lines so this change makes sense.
>>>>>>
>>>>>> I understand that 64 byte cache lines are spread a lot wider. I
>>>>>> also have x86 as a target arch, but still, IMHO, OVS is a
>>>>>> cross-platform application and it should not have platform
>>>>>> dependent stuff which makes one architecture/platform better and
>> worsen others.
>>>>>>
>>>>>>>
>>>>>>> If you have performance data to prove that this causes sever perf
>>>>>> degradation I can think of work arounds for ARM.
>>>>>>>
>>>>>>> - Bhanuprakash.
>>>>>>
>>>>>>
>>>>>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have
>>>>>> to apply following patch to avoid build time assert (I'll send it
>>>>>> formally
>>>> later):
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> --
>>>>>> --------
>>>>>> diff --git a/lib/cmap.c b/lib/cmap.c index 35decea..5b15ecd 100644
>>>>>> --- a/lib/cmap.c
>>>>>> +++ b/lib/cmap.c
>>>>>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>>>>>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
>>>>>> #define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>>>>>>
>>>>>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0
>>>>>> on 64-bit. */ - #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) -
>>>>>> (CMAP_K *
>>>>>> CMAP_ENTRY_SIZE))
>>>>>> -
>>>>>> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly
>>>>>> one cache
>>>>>>  * line long. */
>>>>>> struct cmap_bucket {
>>>>>> +    /* Padding to make cmap_bucket exactly one cache line long. */
>>>>>> +    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>     /* Allows readers to track in-progress changes.  Initially zero, each
>>>>>>      * writer increments this value just before and just after each
>>>>>> change
>>>> (see
>>>>>>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets
>>>>>> a consistent @@ -145,11 +144,7 @@ struct cmap_bucket {
>>>>>>      * slots. */
>>>>>>     uint32_t hashes[CMAP_K];
>>>>>>     struct cmap_node nodes[CMAP_K];
>>>>>> -
>>>>>> -    /* Padding to make cmap_bucket exactly one cache line long. */
>>>>>> -#if CMAP_PADDING > 0
>>>>>> -    uint8_t pad[CMAP_PADDING];
>>>>>> -#endif
>>>>>> +    );
>>>>>> };
>>>>>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>>>>>>
>>>>>> diff --git a/lib/util.h b/lib/util.h index 3c43c2c..514fdaa 100644
>>>>>> --- a/lib/util.h
>>>>>> +++ b/lib/util.h
>>>>>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>>>>>>
>>>>>> /* This system's cache line size, in bytes.
>>>>>>  * Being wrong hurts performance but not correctness. */ -#define
>>>>>> CACHE_LINE_SIZE 64
>>>>>> +#define CACHE_LINE_SIZE 128
>>>>>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>>>>>>
>>>>>> /* Cacheline marking is typically done using zero-sized array.
>>>>>> -------------------------------------------------------------------
>>>>>> --
>>>>>> --------
>>>>>>
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Most of the above issues are also true for some other
>>>>>>>> padded/aligned structures like 'struct netdev_dpdk'. They will be
>>>> treated separately.
>>>>>>>>
>>>>>>>> CC: Bhanuprakash Bodireddy <[email protected]>
>>>>>>>> CC: Ben Pfaff <[email protected]>
>>>>>>>> Signed-off-by: Ilya Maximets <[email protected]>
>>>>>>>> ---
>>>>>>>> lib/dpif-netdev.c | 160
>>>>>>>> +++++++++++++++++++++++-----------------------------
>>>>>>>> --
>>>>>>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>>>>> 0a62630..6784269
>>>>>>>> 100644
>>>>>>>> --- a/lib/dpif-netdev.c
>>>>>>>> +++ b/lib/dpif-netdev.c
>>>>>>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>>>>>>  * actions in either case.
>>>>>>>>  * */
>>>>>>>> struct dp_netdev_pmd_thread {
>>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>>>> cacheline0,
>>>>>>>> -        struct dp_netdev *dp;
>>>>>>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>>>> -        pthread_cond_t cond;            /* For synchronizing pmd 
>>>>>>>> thread
>>>>>>>> -                                           reload. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>>>>>> cacheline1,
>>>>>>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition
>> variable.
>>>> */
>>>>>>>> -        pthread_t thread;
>>>>>>>> -        unsigned core_id;               /* CPU core id of this pmd 
>>>>>>>> thread. */
>>>>>>>> -        int numa_id;                    /* numa node id of this pmd 
>>>>>>>> thread. */
>>>>>>>> -    );
>>>>>>>> +    struct dp_netdev *dp;
>>>>>>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be
>>>> refcount'ed.
>>>>>> */
>>>>>>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>>>>>>> +
>>>>>>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread
>> reload.
>>>> */
>>>>>>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable.
>> */
>>>>>>>>
>>>>>>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>>>>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and
>>>> thusly
>>>>>>>>      * need to be protected by 'non_pmd_mutex'.  Every other
>> instance
>>>>>>>>      * will only be accessed by its own pmd thread. */
>>>>>>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache
>>>> flow_cache;
>>>>>>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be
>>>> refcount'ed.
>>>>>> */
>>>>>>>> -
>>>>>>>> -    /* Queue id used by this pmd thread to send packets on all
>> netdevs
>>>> if
>>>>>>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique 
>>>>>>>> and
>> less
>>>>>>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>>>>>>> -    uint32_t static_tx_qid;
>>>>>>>> +    struct emc_cache flow_cache;
>>>>>>>>
>>>>>>>>     /* Flow-Table and classifiers
>>>>>>>>      *
>>>>>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>>>>>>      * 'flow_mutex'.
>>>>>>>>      */
>>>>>>>>     struct ovs_mutex flow_mutex;
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>>>> -
>>>>>>>> -        /* One classifier per in_port polled by the pmd */
>>>>>>>> -        struct cmap classifiers;
>>>>>>>> -        /* Periodically sort subtable vectors according to hit 
>>>>>>>> frequencies
>> */
>>>>>>>> -        long long int next_optimization;
>>>>>>>> -        /* End of the next time interval for which processing cycles
>>>>>>>> -           are stored for each polled rxq. */
>>>>>>>> -        long long int rxq_next_cycle_store;
>>>>>>>> -
>>>>>>>> -        /* Cycles counters */
>>>>>>>> -        struct dp_netdev_pmd_cycles cycles;
>>>>>>>> -
>>>>>>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>>>>>>> -        unsigned long long last_cycles;
>>>>>>>> -        struct latch exit_latch;        /* For terminating the pmd 
>>>>>>>> thread. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* Statistics. */
>>>>>>>> -        struct dp_netdev_pmd_stats stats;
>>>>>>>> -
>>>>>>>> -        struct seq *reload_seq;
>>>>>>>> -        uint64_t last_reload_seq;
>>>>>>>> -        atomic_bool reload;             /* Do we need to reload 
>>>>>>>> ports? */
>>>>>>>> -        bool isolated;
>>>>>>>> -
>>>>>>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>>>>>>> -        bool need_reload;
>>>>>>>> -        /* 5 pad bytes. */
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>>>>>>> -                                           and 'tx_ports'. */
>>>>>>>> -        /* 16 pad bytes. */
>>>>>>>> -    );
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* List of rx queues to poll. */
>>>>>>>> -        struct hmap poll_list OVS_GUARDED;
>>>>>>>> -        /* Map of 'tx_port's used for transmission.  Written by the 
>>>>>>>> main
>>>>>>>> -         * thread, read by the pmd thread. */
>>>>>>>> -        struct hmap tx_ports OVS_GUARDED;
>>>>>>>> -    );
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* These are thread-local copies of 'tx_ports'.  One contains 
>>>>>>>> only
>>>>>>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the
>> other
>>>>>>>> -         * contains ports with at least one txq (that support send).
>>>>>>>> -         * A port can be in both.
>>>>>>>> -         *
>>>>>>>> -         * There are two separate maps to make sure that we don't try
>> to
>>>>>>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP
>> on a
>>>>>>>> -         * non-tunnel device.
>>>>>>>> -         *
>>>>>>>> -         * The instances for cpu core NON_PMD_CORE_ID can be
>> accessed
>>>> by
>>>>>>>> -         * multiple threads and thusly need to be protected by
>>>>>>>> 'non_pmd_mutex'.
>>>>>>>> -         * Every other instance will only be accessed by its own pmd
>>>> thread.
>>>>>> */
>>>>>>>> -        struct hmap tnl_port_cache;
>>>>>>>> -        struct hmap send_port_cache;
>>>>>>>> -    );
>>>>>>>> -
>>>>>>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>>>>>> -        /* Only a pmd thread can write on its own 'cycles' and 
>>>>>>>> 'stats'.
>>>>>>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as 
>>>>>>>> base
>>>>>>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>>>>>>> -         * reporting to the user */
>>>>>>>> -        unsigned long long stats_zero[DP_N_STATS];
>>>>>>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>>>> -        /* 8 pad bytes. */
>>>>>>>> -    );
>>>>>>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>>>>>> +
>>>>>>>> +    /* One classifier per in_port polled by the pmd */
>>>>>>>> +    struct cmap classifiers;
>>>>>>>> +    /* Periodically sort subtable vectors according to hit frequencies
>> */
>>>>>>>> +    long long int next_optimization;
>>>>>>>> +    /* End of the next time interval for which processing cycles
>>>>>>>> +       are stored for each polled rxq. */
>>>>>>>> +    long long int rxq_next_cycle_store;
>>>>>>>> +
>>>>>>>> +    /* Statistics. */
>>>>>>>> +    struct dp_netdev_pmd_stats stats;
>>>>>>>> +
>>>>>>>> +    /* Cycles counters */
>>>>>>>> +    struct dp_netdev_pmd_cycles cycles;
>>>>>>>> +
>>>>>>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>>>>>>> +    unsigned long long last_cycles;
>>>>>>>> +
>>>>>>>> +    struct latch exit_latch;        /* For terminating the pmd 
>>>>>>>> thread. */
>>>>>>>> +    struct seq *reload_seq;
>>>>>>>> +    uint64_t last_reload_seq;
>>>>>>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>>>>>>> +    pthread_t thread;
>>>>>>>> +    unsigned core_id;               /* CPU core id of this pmd 
>>>>>>>> thread. */
>>>>>>>> +    int numa_id;                    /* numa node id of this pmd 
>>>>>>>> thread. */
>>>>>>>> +    bool isolated;
>>>>>>>> +
>>>>>>>> +    /* Queue id used by this pmd thread to send packets on all
>>>>>>>> + netdevs
>>>> if
>>>>>>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique 
>>>>>>>> and
>> less
>>>>>>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>>>>>>> +    uint32_t static_tx_qid;
>>>>>>>> +
>>>>>>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and
>>>> 'tx_ports'.
>>>>>> */
>>>>>>>> +    /* List of rx queues to poll. */
>>>>>>>> +    struct hmap poll_list OVS_GUARDED;
>>>>>>>> +    /* Map of 'tx_port's used for transmission.  Written by the
>>>>>>>> + main
>>>> thread,
>>>>>>>> +     * read by the pmd thread. */
>>>>>>>> +    struct hmap tx_ports OVS_GUARDED;
>>>>>>>> +
>>>>>>>> +    /* These are thread-local copies of 'tx_ports'.  One
>>>>>>>> + contains only
>>>> tunnel
>>>>>>>> +     * ports (that support push_tunnel/pop_tunnel), the other
>>>>>>>> + contains
>>>>>> ports
>>>>>>>> +     * with at least one txq (that support send).  A port can be in 
>>>>>>>> both.
>>>>>>>> +     *
>>>>>>>> +     * There are two separate maps to make sure that we don't
>>>>>>>> + try to
>>>>>> execute
>>>>>>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>>>>>> + non-tunnel
>>>>>>>> device.
>>>>>>>> +     *
>>>>>>>> +     * The instances for cpu core NON_PMD_CORE_ID can be
>>>>>>>> + accessed by
>>>>>>>> multiple
>>>>>>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>>>>>> Every
>>>>>>>> +     * other instance will only be accessed by its own pmd thread. */
>>>>>>>> +    struct hmap tnl_port_cache;
>>>>>>>> +    struct hmap send_port_cache;
>>>>>>>> +
>>>>>>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>>>>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>>>>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>>>>>>> +     * reporting to the user */
>>>>>>>> +    unsigned long long stats_zero[DP_N_STATS];
>>>>>>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>>>>>>> +
>>>>>>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>>>>>>> +    bool need_reload;
>>>>>>>> };
>>>>>>>>
>>>>>>>> /* Interface to netdev-based datapath. */
>>>>>>>> --
>>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [email protected]
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>>
>>>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to