Hi,

I'm having a look at the code for the `instrument-*' family opcodes.

For example in vm-engine.c:
--8<---------------cut here---------------start------------->8---
  VM_DEFINE_OP (1, instrument_entry, "instrument-entry", OP2 (X32, N32))
    {
#if ENABLE_JIT
      if (!VP->disable_mcode)
        {
          struct scm_jit_function_data *data;

          int32_t data_offset = ip[1];
          data = (struct scm_jit_function_data *) (ip + data_offset);

          if (data->mcode)
            {
              SYNC_IP ();
              scm_jit_enter_mcode (thread, data->mcode);
              CACHE_REGISTER ();
              NEXT (0);
            }

          if (data->counter >= scm_jit_counter_threshold)
            {
              const uint8_t *mcode;

              SYNC_IP ();
              mcode = scm_jit_compute_mcode (thread, data);

              if (mcode)
                {
                  scm_jit_enter_mcode (thread, mcode);
                  CACHE_REGISTER ();
                  NEXT (0);
                }
            }
          else
            data->counter += SCM_JIT_COUNTER_ENTRY_INCREMENT;
        }
#endif
--8<---------------cut here---------------end--------------->8---

It seems to me that something is not correctly being done here.  The
load and the increment of `data->counter' should be made with atomic
accesses.  I assume that these counters are part of the procedure
bytecode and not some counter local to the VM.

So really it should look like so:
--8<---------------cut here---------------start------------->8---
  if (__atomic_load_n(&data->counter, __ATOMIC_RELAXED) >= 
scm_jit_counter_threshold)
    {
      ...
    }
  else
    __atomic_fetch_add(&data->counter, 1, __ATOMIC_RELAXED);
--8<---------------cut here---------------end--------------->8---

Now since these counters are simple heuristic counters, I guess that not
doing atomic operations is not a problem per-say and will results in the
following scenario in multi-thread environment:

  1. JIT happened, counter is not read anymore yeah!
  2. There are load-store tearings and the counters are messed up
     => JIT does not happen or happens latter than expected
  3. There are load-store fusings and some increment of the counters
     simply disapear => JIT happens latter than expected

Using atomics like proposed above would solve 2. and 3.

However, I see another problem and this is the main discussion I want to
have here.  Basically, in mutli-thread environment, these heuristic
counters are bouncing cache-line between cores and this can drastically
decrease performance until JIT happened, _if_ it happened.

One way to solve this is the usage of split-counters.  On Linux, it is
possible to have a per-cpu split-counter and use the restartable
sequence to increment the local CPU counter without touching the cache
line of other CPUs.  The main downside of this approach is the memory
overhead involve and the summation of all the counters to get a global
sum.

However, there are ongoing works for a percpu-tree structure:

  Level 0:  0    1    2    3    4    5    6    7
            |   /     |   /     |   /     |   /
            |  /      |  /      |  /      |  /
            | /       | /       | /       | /
  Level 1:  0         1         2         3
            |       /           |       /
            |    /              |    /
            | /                 | /
  Level 2:  0                   1
            |               /
            |         /
            |   /
  Level 3:  0

Counters at level 0 are per-cpu counters that gets incremented really
fast with rseq(2).  Intermediate levels are used for taking the overflow
or a configurable batch size of sub-levels and are incremented like
regular atomic counters.  For example, for a batch size of 128, you
would need to increment the counter 0 of the level 0, 128 times before
incrementing the counter 0 of the level 1.

This seems to allocate even more memory, but it's in fact possible to
use a single byte for every counters expect the top one.  In the above
example, the percpu-tree split-counter would use 8 + 4 + 2 + 8 = 22
bytes.  Furthermore, there is a librseq(3) library in User-space with a
specialized per-cpumemory pool allocator, thus it's possible to have a
very compact solution in term of memory with very little footprint.

Combined all this together, and it's possible to have percpu-tree
split-counters for instrumenting function entries (and possibly any
basic-blocks) at a very low memory cost and yet very much faster than
the current increment of a single shared counter.

Thought?

Thanks,
Olivier

-- 
Olivier Dion
EfficiOS Inc.
https://www.efficios.com


Reply via email to