On 2024-09-13 13:23, Jerin Jacob wrote:
On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:

On 2024-09-12 17:11, Jerin Jacob wrote:
On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:

On 2024-09-12 15:09, Jerin Jacob wrote:
On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
<mattias.ronnb...@ericsson.com> wrote:

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
---
    app/test/meson.build           |   1 +
    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
    2 files changed, 161 insertions(+)
    create mode 100644 app/test/test_lcore_var_perf.c


+static double
+benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
+{
+       uint64_t i;
+       uint64_t start;
+       uint64_t end;
+       double latency;
+
+       init_fun();
+
+       start = rte_get_timer_cycles();
+
+       for (i = 0; i < ITERATIONS; i++)
+               update_fun();
+
+       end = rte_get_timer_cycles();

Use precise variant. rte_rdtsc_precise() or so to be accurate

With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.

I was thinking in another way, with 1e7 iteration, the additional
barrier on precise will be amortized, and we get more _deterministic_
behavior e.s.p in case if we print cycles and if we need to catch
regressions.

If you time a section of code which spends ~40000000 cycles, it doesn't
matter if you add or remove a few cycles at the beginning and the end.

The rte_rdtsc_precise() is both better (more precise in the sense of
more serialization), and worse (because it's more costly, and thus more
intrusive).

We can calibrate the overhead to remove the cost.

What you are interested is primarily the impact of (instruction) throughput, not the latency of the sequence of instructions that must be retired in order to load the lcore variable values, when you switch from
(say) lcore id-index static arrays to lcore variables in your module.

Usually, there is not reason to make a distinction between latency and throughput in this context, but as you zoom into very short snippets of code being executed, the difference becomes relevant. For example, adding an div instruction won't necessarily add 12 cc to your program's execution time on a Zen 4, even though that is its latency. Rather, the effects may, depending on data dependencies and what other instructions are executed in parallel, be much smaller.

So, one could argue the ILP you get with the loop is a feature, not a bug.

With or without per-iteration latency measurements, these benchmark are not-very-useful at best, and misleading at worst. I will rework them to include more than a single module/lcore variable, which I think would be somewhat of an improvement.

Even better would have some real domain logic, instead of just a dummy multiplication.


You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
doesn't matter.

Yes. In this setup and it is pretty inaccurate PER iteration. Please
refer to the below patch to see the difference.

Patch 1: Make nanoseconds to cycles per iteration
------------------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..b8d25400f593 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
void (*update_fun)(void))

         end = rte_get_timer_cycles();

-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+       latency = ((end - start)) / ITERATIONS;

         return latency;
  }
@@ -137,8 +137,7 @@ test_lcore_var_access(void)

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
         printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

         return TEST_SUCCESS;
  }


Patch 2: Change to precise with calibration
-----------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..8142ecd56241 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -96,23 +96,28 @@ lvar_update(void)
  static double
  benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
  {
-       uint64_t i;
+       double tsc_latency;
+       double latency;
         uint64_t start;
         uint64_t end;
-       double latency;
+       uint64_t i;

-       init_fun();
+       /* calculate rte_rdtsc_precise overhead */
+       start = rte_rdtsc_precise();
+       end = rte_rdtsc_precise();
+       tsc_latency = (end - start);

-       start = rte_get_timer_cycles();
+       init_fun();

-       for (i = 0; i < ITERATIONS; i++)
+       latency = 0;
+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                 update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }

-       end = rte_get_timer_cycles();
-
-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
-
-       return latency;
+       return latency / (double)ITERATIONS;
  }

  static int
@@ -135,10 +140,9 @@ test_lcore_var_access(void)
         sarray_latency = benchmark_access_method(sarray_init, sarray_update);
         lvar_latency = benchmark_access_method(lvar_init, lvar_update);

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
         printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

         return TEST_SUCCESS;
  }

ARM N2 core with patch 1(aka current scheme)
-----------------------------------

  + ------------------------------------------------------- +
  + Test Suite : lcore variable perf autotest
  + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                  7.0           7.0              7.0


ARM N2 core with patch 2
-----------------------------------

  + ------------------------------------------------------- +
  + Test Suite : lcore variable perf autotest
  + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 11.4          15.5             15.5

x86 i9 core with patch 1(aka current scheme)
------------------------------------------------------------

  + ------------------------------------------------------- +
  + Test Suite : lcore variable perf autotest
  + ------------------------------------------------------- +
Latencies [ns/update]
Thread-local storage  Static array  Lcore variables
                  5.0           6.0              6.0

x86 i9 core with patch 2
--------------------------------
  + ------------------------------------------------------- +
  + Test Suite : lcore variable perf autotest
  + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                  5.3          10.6             11.7






Furthermore, you may consider replacing rte_random() in fast path to
running number or so if it is not deterministic in cycle computation.

rte_rand() is not used in the fast path. I don't understand what you

I missed that. Ignore this comment.

mean by "running number".

Reply via email to