On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>
>> Padding and aligning of dp_netdev_pmd_thread structure members is
>> useless, broken in a several ways and only greatly degrades maintainability
>> and extensibility of the structure.
>
> The idea of my earlier patch was to mark the cache lines and reduce the holes
> while still maintaining the grouping of related members in this structure.
Some of the grouping aspects looks strange. For example, it looks illogical that
'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
reload related stuff. It looks strange that statistics and counters spread
across different groups. So, IMHO, it's not well grouped.
> Also cache line marking is a good practice to make some one extra cautious
> when extending or editing important structures .
> Most importantly I was experimenting with prefetching on this structure and
> needed cache line markers for it.
>
> I see that you are on ARM (I don't have HW to test) and want to know if this
> commit has any negative affect and any numbers would be appreciated.
Basic VM-VM testing shows stable 0.5% perfromance improvement with revert
applied.
Padding adds 560 additional bytes of holes.
> More comments inline.
>
>>
>> Issues:
>>
>> 1. It's not working because all the instances of struct
>> dp_netdev_pmd_thread allocated only by usual malloc. All the
>> memory is not aligned to cachelines -> structure almost never
>> starts at aligned memory address. This means that any further
>> paddings and alignments inside the structure are completely
>> useless. Fo example:
>>
>> Breakpoint 1, pmd_thread_main
>> (gdb) p pmd
>> $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>> (gdb) p &pmd->cacheline1
>> $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>> (gdb) p &pmd->cacheline0
>> $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>> (gdb) p &pmd->flow_cache
>> $53 = (struct emc_cache *) 0x1b1afe0
>>
>> All of the above addresses shifted from cacheline start by 32B.
>
> If you see below all the addresses are 64 byte aligned.
>
> (gdb) p pmd
> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
> (gdb) p &pmd->cacheline0
> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
> (gdb) p &pmd->cacheline1
> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
> (gdb) p &pmd->flow_cache
> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
> (gdb) p &pmd->flow_table
> $5 = (struct cmap *) 0x7fc1e9fba100
> (gdb) p &pmd->stats
> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
> (gdb) p &pmd->port_mutex
> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
> (gdb) p &pmd->poll_list
> $8 = (struct hmap *) 0x7fc1e9fba1c0
> (gdb) p &pmd->tnl_port_cache
> $9 = (struct hmap *) 0x7fc1e9fba200
> (gdb) p &pmd->stats_zero
> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>
> I tried using xzalloc_cacheline instead of default xzalloc() here. I tried
> tens of times and always found that the address is
> 64 byte aligned and it should start at the beginning of cache line on X86.
> Not sure why the comment " (The memory returned will not be at the start of
> a cache line, though, so don't assume such alignment.)" says otherwise?
Yes, you will always get aligned addressess on your x86 Linux system that
supports
posix_memalign() call. The comment says what it says because it will make some
memory allocation tricks in case posix_memalign() is not available (Windows,
some MacOS,
maybe some Linux systems (not sure)) and the address will not be aligned it
this case.
>
>>
>> Can we fix it properly? NO.
>> OVS currently doesn't have appropriate API to allocate aligned
>> memory. The best candidate is 'xmalloc_cacheline()' but it
>> clearly states that "The memory returned will not be at the
>> start of a cache line, though, so don't assume such alignment".
>> And also, this function will never return aligned memory on
>> Windows or MacOS.
>>
>> 2. CACHE_LINE_SIZE is not constant. Different architectures have
>> different cache line sizes, but the code assumes that
>> CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>> members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>> This leads to a huge holes in a structures if CACHE_LINE_SIZE
>> differs from 64. This is opposite to portability. If I want
>> good performance of cmap I need to have CACHE_LINE_SIZE equal
>> to the real cache line size, but I will have huge holes in the
>> structures. If you'll take a look to struct rte_mbuf from DPDK
>> you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>> RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>
> I understand that ARM and few other processors (like OCTEON) have 128 bytes
> cache lines.
> But again curious of performance impact in your case with this new alignment.
>
>>
>> 3. Sizes of system/libc defined types are not constant for all the
>> systems. For example, sizeof(pthread_mutex_t) == 48 on my
>> ARMv8 machine, but only 40 on x86. The difference could be
>> much bigger on Windows or MacOS systems. But the code assumes
>> that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>> to broken alignment/big holes in case of padding/wrong comments
>> about amount of free pad bytes.
>
> This isn't an issue as you would have already mentioned and more about issue
> with the comment that reads the pad bytes.
> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>
> union {
> struct {
> struct ovs_mutex port_mutex; /* 4849984 48 */
> }; /* 48 */
> uint8_t pad13[64]; /* 64 */
> }; /
>
It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex',
you will see, that it even not padded. So, increasing the size of 'flow_mutex'
leads to shifting of all the following padded blocks and no other blocks will
be properly aligned even if the structure allocated on the aligned memory.
>>
>> 4. Sizes of the many fileds in structure depends on defines like
>> DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>> Any change in these defines or any change in any structure
>> contained by thread should lead to the not so simple
>> refactoring of the whole dp_netdev_pmd_thread structure. This
>> greatly reduces maintainability and complicates development of
>> a new features.
>
> I don't think it complicates development and instead I feel the commit gives
> a clear indication
> to the developer that the members are grouped and aligned and marked with
> cacheline markers.
> This makes the developer extra cautious when adding new members so that holes
> can be avoided.
Starting rebase of the output batching patch-set I figured out that I need to
remove 'unsigned long long last_cycles' and add 'struct
dp_netdev_pmd_thread_ctx ctx'
which is 8 bytes larger. Could you, please, suggest me where should I place that
new structure member and what to do with a hole from 'last_cycles'?
This is not a trivial question, because already poor grouping will become worse
almost anyway.
>
> Cacheline marking the structure is a good practice and I am sure this
> structure is significant
> and should be carefully extended in the future.
Not so sure about that.
>
>>
>> 5. There is no reason to align flow_cache member because it's
>> too big and we usually access random entries by single thread
>> only.
>>
>
> I see your point. This patch wasn't done for performance and instead more to
> have some order with this
> ever growing structure. During testing I found that for some test cases
> aligning the flow_cache was giving
> me 100k+ performance consistently and so was added.
This was a random performance boost. You achieved it without aligned memory
allocation.
Just a luck with you system environment. Using of mzalloc_cacheline will likely
eliminate this performance difference or even degrade the performance.
>
>> So, the padding/alignment only creates some visibility of performance
>> optimization but does nothing useful in reality. It only complicates
>> maintenance and adds huge holes for non-x86 architectures and non-Linux
>> systems. Performance improvement stated in a original commit message
>> should be random and not valuable. I see no performance difference.
>
> I understand that this is causing issues with architecture having different
> cache line sizes,
> but unfortunately majority of them have 64 byte cache lines so this change
> makes sense.
I understand that 64 byte cache lines are spread a lot wider. I also have x86
as a target arch,
but still, IMHO, OVS is a cross-platform application and it should not have
platform dependent
stuff which makes one architecture/platform better and worsen others.
>
> If you have performance data to prove that this causes sever perf degradation
> I can think of work arounds for ARM.
>
> - Bhanuprakash.
P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
following patch to avoid build time assert (I'll send it formally later):
-----------------------------------------------------------------------------
diff --git a/lib/cmap.c b/lib/cmap.c
index 35decea..5b15ecd 100644
--- a/lib/cmap.c
+++ b/lib/cmap.c
@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
/* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */
#define CMAP_K ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */
-#define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K * CMAP_ENTRY_SIZE))
-
/* A cuckoo hash bucket. Designed to be cache-aligned and exactly one cache
* line long. */
struct cmap_bucket {
+ /* Padding to make cmap_bucket exactly one cache line long. */
+ PADDED_MEMBERS(CACHE_LINE_SIZE,
/* Allows readers to track in-progress changes. Initially zero, each
* writer increments this value just before and just after each change (see
* cmap_set_bucket()). Thus, a reader can ensure that it gets a consistent
@@ -145,11 +144,7 @@ struct cmap_bucket {
* slots. */
uint32_t hashes[CMAP_K];
struct cmap_node nodes[CMAP_K];
-
- /* Padding to make cmap_bucket exactly one cache line long. */
-#if CMAP_PADDING > 0
- uint8_t pad[CMAP_PADDING];
-#endif
+ );
};
BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
diff --git a/lib/util.h b/lib/util.h
index 3c43c2c..514fdaa 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
/* This system's cache line size, in bytes.
* Being wrong hurts performance but not correctness. */
-#define CACHE_LINE_SIZE 64
+#define CACHE_LINE_SIZE 128
BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
/* Cacheline marking is typically done using zero-sized array.
-----------------------------------------------------------------------------
Best regards, Ilya Maximets.
>
>>
>> Most of the above issues are also true for some other padded/aligned
>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>
>> CC: Bhanuprakash Bodireddy <[email protected]>
>> CC: Ben Pfaff <[email protected]>
>> Signed-off-by: Ilya Maximets <[email protected]>
>> ---
>> lib/dpif-netdev.c | 160 +++++++++++++++++++++++-----------------------------
>> --
>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 0a62630..6784269
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -547,31 +547,18 @@ struct tx_port {
>> * actions in either case.
>> * */
>> struct dp_netdev_pmd_thread {
>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
>> - struct dp_netdev *dp;
>> - struct cmap_node node; /* In 'dp->poll_threads'. */
>> - pthread_cond_t cond; /* For synchronizing pmd thread
>> - reload. */
>> - );
>> -
>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>> - struct ovs_mutex cond_mutex; /* Mutex for condition variable. */
>> - pthread_t thread;
>> - unsigned core_id; /* CPU core id of this pmd thread.
>> */
>> - int numa_id; /* numa node id of this pmd thread.
>> */
>> - );
>> + struct dp_netdev *dp;
>> + struct ovs_refcount ref_cnt; /* Every reference must be refcount'ed.
>> */
>> + struct cmap_node node; /* In 'dp->poll_threads'. */
>> +
>> + pthread_cond_t cond; /* For synchronizing pmd thread reload.
>> */
>> + struct ovs_mutex cond_mutex; /* Mutex for condition variable. */
>>
>> /* Per thread exact-match cache. Note, the instance for cpu core
>> * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>> * need to be protected by 'non_pmd_mutex'. Every other instance
>> * will only be accessed by its own pmd thread. */
>> - OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>> - struct ovs_refcount ref_cnt; /* Every reference must be refcount'ed.
>> */
>> -
>> - /* Queue id used by this pmd thread to send packets on all netdevs if
>> - * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> - * than 'cmap_count(dp->poll_threads)'. */
>> - uint32_t static_tx_qid;
>> + struct emc_cache flow_cache;
>>
>> /* Flow-Table and classifiers
>> *
>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>> * 'flow_mutex'.
>> */
>> struct ovs_mutex flow_mutex;
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> -
>> - /* One classifier per in_port polled by the pmd */
>> - struct cmap classifiers;
>> - /* Periodically sort subtable vectors according to hit frequencies
>> */
>> - long long int next_optimization;
>> - /* End of the next time interval for which processing cycles
>> - are stored for each polled rxq. */
>> - long long int rxq_next_cycle_store;
>> -
>> - /* Cycles counters */
>> - struct dp_netdev_pmd_cycles cycles;
>> -
>> - /* Used to count cycles. See 'cycles_counter_end()'. */
>> - unsigned long long last_cycles;
>> - struct latch exit_latch; /* For terminating the pmd thread.
>> */
>> - );
>> -
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - /* Statistics. */
>> - struct dp_netdev_pmd_stats stats;
>> -
>> - struct seq *reload_seq;
>> - uint64_t last_reload_seq;
>> - atomic_bool reload; /* Do we need to reload ports? */
>> - bool isolated;
>> -
>> - /* Set to true if the pmd thread needs to be reloaded. */
>> - bool need_reload;
>> - /* 5 pad bytes. */
>> - );
>> -
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - struct ovs_mutex port_mutex; /* Mutex for 'poll_list'
>> - and 'tx_ports'. */
>> - /* 16 pad bytes. */
>> - );
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - /* List of rx queues to poll. */
>> - struct hmap poll_list OVS_GUARDED;
>> - /* Map of 'tx_port's used for transmission. Written by the main
>> - * thread, read by the pmd thread. */
>> - struct hmap tx_ports OVS_GUARDED;
>> - );
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - /* These are thread-local copies of 'tx_ports'. One contains only
>> - * tunnel ports (that support push_tunnel/pop_tunnel), the other
>> - * contains ports with at least one txq (that support send).
>> - * A port can be in both.
>> - *
>> - * There are two separate maps to make sure that we don't try to
>> - * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>> - * non-tunnel device.
>> - *
>> - * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>> - * multiple threads and thusly need to be protected by
>> 'non_pmd_mutex'.
>> - * Every other instance will only be accessed by its own pmd
>> thread. */
>> - struct hmap tnl_port_cache;
>> - struct hmap send_port_cache;
>> - );
>> -
>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>> - /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> - * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> - * values and subtracts them from 'stats' and 'cycles' before
>> - * reporting to the user */
>> - unsigned long long stats_zero[DP_N_STATS];
>> - uint64_t cycles_zero[PMD_N_CYCLES];
>> - /* 8 pad bytes. */
>> - );
>> + struct cmap flow_table OVS_GUARDED; /* Flow table. */
>> +
>> + /* One classifier per in_port polled by the pmd */
>> + struct cmap classifiers;
>> + /* Periodically sort subtable vectors according to hit frequencies */
>> + long long int next_optimization;
>> + /* End of the next time interval for which processing cycles
>> + are stored for each polled rxq. */
>> + long long int rxq_next_cycle_store;
>> +
>> + /* Statistics. */
>> + struct dp_netdev_pmd_stats stats;
>> +
>> + /* Cycles counters */
>> + struct dp_netdev_pmd_cycles cycles;
>> +
>> + /* Used to count cicles. See 'cycles_counter_end()' */
>> + unsigned long long last_cycles;
>> +
>> + struct latch exit_latch; /* For terminating the pmd thread. */
>> + struct seq *reload_seq;
>> + uint64_t last_reload_seq;
>> + atomic_bool reload; /* Do we need to reload ports? */
>> + pthread_t thread;
>> + unsigned core_id; /* CPU core id of this pmd thread. */
>> + int numa_id; /* numa node id of this pmd thread. */
>> + bool isolated;
>> +
>> + /* Queue id used by this pmd thread to send packets on all netdevs if
>> + * XPS disabled for this netdev. All static_tx_qid's are unique and less
>> + * than 'cmap_count(dp->poll_threads)'. */
>> + uint32_t static_tx_qid;
>> +
>> + struct ovs_mutex port_mutex; /* Mutex for 'poll_list' and
>> 'tx_ports'. */
>> + /* List of rx queues to poll. */
>> + struct hmap poll_list OVS_GUARDED;
>> + /* Map of 'tx_port's used for transmission. Written by the main thread,
>> + * read by the pmd thread. */
>> + struct hmap tx_ports OVS_GUARDED;
>> +
>> + /* These are thread-local copies of 'tx_ports'. One contains only
>> tunnel
>> + * ports (that support push_tunnel/pop_tunnel), the other contains ports
>> + * with at least one txq (that support send). A port can be in both.
>> + *
>> + * There are two separate maps to make sure that we don't try to execute
>> + * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel
>> device.
>> + *
>> + * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>> multiple
>> + * threads, and thusly need to be protected by 'non_pmd_mutex'. Every
>> + * other instance will only be accessed by its own pmd thread. */
>> + struct hmap tnl_port_cache;
>> + struct hmap send_port_cache;
>> +
>> + /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>> + * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>> + * values and subtracts them from 'stats' and 'cycles' before
>> + * reporting to the user */
>> + unsigned long long stats_zero[DP_N_STATS];
>> + uint64_t cycles_zero[PMD_N_CYCLES];
>> +
>> + /* Set to true if the pmd thread needs to be reloaded. */
>> + bool need_reload;
>> };
>>
>> /* Interface to netdev-based datapath. */
>> --
>> 2.7.4
>
>
>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev