Re: [dpdk-dev] [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins

Honnappa Nagarahalli Thu, 16 Jul 2020 11:23:24 -0700

<snip>

> Subject: [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins
> 
> Add information about possible optimizations using C11 atomic built-ins.
> 
> Signed-off-by: Phil Yang <phil.y...@arm.com>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>


Thanks for the changes, they look good now.

David wanted to change 'built-ins' to 'builtins', otherwise
Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>

> ---
>  doc/guides/prog_guide/writing_efficient_code.rst | 59
> +++++++++++++++++++++++-
>  1 file changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> b/doc/guides/prog_guide/writing_efficient_code.rst
> index 849f63e..53a1ca1 100644
> --- a/doc/guides/prog_guide/writing_efficient_code.rst
> +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
>  Locks and Atomic Operations
>  ---------------------------
> 
> -Atomic operations imply a lock prefix before the instruction,
> +This section describes some key considerations when using locks and
> +atomic operations in the DPDK environment.
> +
> +Locks
> +~~~~~
> +
> +On x86, atomic operations imply a lock prefix before the instruction,
>  causing the processor's LOCK# signal to be asserted during execution of the
> following instruction.
>  This has a big impact on performance in a multicore environment.
> 
> @@ -176,6 +182,57 @@ It can often be replaced by other solutions like per-
> lcore variables.
>  Also, some locking techniques are more efficient than others.
>  For instance, the Read-Copy-Update (RCU) algorithm can frequently replace
> simple rwlocks.
> 
> +Atomic Operations: Use C11 Atomic Built-ins
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +DPDK generic rte_atomic operations are implemented by __sync built-ins.
> +These __sync built-ins result in full barriers on aarch64, which are
> +unnecessary in many use cases. They can be replaced by __atomic
> +built-ins that conform to the C11 memory model and provide finer memory
> order control.
> +
> +So replacing the rte_atomic operations with __atomic built-ins might
> +improve performance for aarch64 machines.
> +
> +Some typical optimization cases are listed below:
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Some use cases require atomicity alone, the ordering of the memory
> +operations does not matter. For example, the packet statistics counters
> +need to be incremented atomically but do not need any particular memory
> ordering.
> +So, RELAXED memory ordering is sufficient.
> +
> +One-way Barrier
> +^^^^^^^^^^^^^^^
> +
> +Some use cases allow for memory reordering in one way while requiring
> +memory ordering in the other direction.
> +
> +For example, the memory operations before the spinlock lock are allowed
> +to move to the critical section, but the memory operations in the
> +critical section are not allowed to move above the lock. In this case,
> +the full memory barrier in the compare-and-swap operation can be replaced
> with ACQUIRE memory order.
> +On the other hand, the memory operations after the spinlock unlock are
> +allowed to move to the critical section, but the memory operations in
> +the critical section are not allowed to move below the unlock. So the
> +full barrier in the store operation can use RELEASE memory order.
> +
> +Reader-Writer Concurrency
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Lock-free reader-writer concurrency is one of the common use cases in DPDK.
> +
> +The payload or the data that the writer wants to communicate to the
> +reader, can be written with RELAXED memory order. However, the guard
> +variable should be written with RELEASE memory order. This ensures that
> +the store to guard variable is observable only after the store to payload is
> observable.
> +
> +Correspondingly, on the reader side, the guard variable should be read
> +with ACQUIRE memory order. The payload or the data the writer
> +communicated, can be read with RELAXED memory order. This ensures that,
> +if the store to guard variable is observable, the store to payload is also
> observable.
> +
>  Coding Considerations
>  ---------------------
> 
> --
> 2.7.4

Re: [dpdk-dev] [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins

Reply via email to