Hi, I'm having a look at the code for the `instrument-*' family opcodes.
For example in vm-engine.c: --8<---------------cut here---------------start------------->8--- VM_DEFINE_OP (1, instrument_entry, "instrument-entry", OP2 (X32, N32)) { #if ENABLE_JIT if (!VP->disable_mcode) { struct scm_jit_function_data *data; int32_t data_offset = ip[1]; data = (struct scm_jit_function_data *) (ip + data_offset); if (data->mcode) { SYNC_IP (); scm_jit_enter_mcode (thread, data->mcode); CACHE_REGISTER (); NEXT (0); } if (data->counter >= scm_jit_counter_threshold) { const uint8_t *mcode; SYNC_IP (); mcode = scm_jit_compute_mcode (thread, data); if (mcode) { scm_jit_enter_mcode (thread, mcode); CACHE_REGISTER (); NEXT (0); } } else data->counter += SCM_JIT_COUNTER_ENTRY_INCREMENT; } #endif --8<---------------cut here---------------end--------------->8--- It seems to me that something is not correctly being done here. The load and the increment of `data->counter' should be made with atomic accesses. I assume that these counters are part of the procedure bytecode and not some counter local to the VM. So really it should look like so: --8<---------------cut here---------------start------------->8--- if (__atomic_load_n(&data->counter, __ATOMIC_RELAXED) >= scm_jit_counter_threshold) { ... } else __atomic_fetch_add(&data->counter, 1, __ATOMIC_RELAXED); --8<---------------cut here---------------end--------------->8--- Now since these counters are simple heuristic counters, I guess that not doing atomic operations is not a problem per-say and will results in the following scenario in multi-thread environment: 1. JIT happened, counter is not read anymore yeah! 2. There are load-store tearings and the counters are messed up => JIT does not happen or happens latter than expected 3. There are load-store fusings and some increment of the counters simply disapear => JIT happens latter than expected Using atomics like proposed above would solve 2. and 3. However, I see another problem and this is the main discussion I want to have here. Basically, in mutli-thread environment, these heuristic counters are bouncing cache-line between cores and this can drastically decrease performance until JIT happened, _if_ it happened. One way to solve this is the usage of split-counters. On Linux, it is possible to have a per-cpu split-counter and use the restartable sequence to increment the local CPU counter without touching the cache line of other CPUs. The main downside of this approach is the memory overhead involve and the summation of all the counters to get a global sum. However, there are ongoing works for a percpu-tree structure: Level 0: 0 1 2 3 4 5 6 7 | / | / | / | / | / | / | / | / | / | / | / | / Level 1: 0 1 2 3 | / | / | / | / | / | / Level 2: 0 1 | / | / | / Level 3: 0 Counters at level 0 are per-cpu counters that gets incremented really fast with rseq(2). Intermediate levels are used for taking the overflow or a configurable batch size of sub-levels and are incremented like regular atomic counters. For example, for a batch size of 128, you would need to increment the counter 0 of the level 0, 128 times before incrementing the counter 0 of the level 1. This seems to allocate even more memory, but it's in fact possible to use a single byte for every counters expect the top one. In the above example, the percpu-tree split-counter would use 8 + 4 + 2 + 8 = 22 bytes. Furthermore, there is a librseq(3) library in User-space with a specialized per-cpumemory pool allocator, thus it's possible to have a very compact solution in term of memory with very little footprint. Combined all this together, and it's possible to have percpu-tree split-counters for instrumenting function entries (and possibly any basic-blocks) at a very low memory cost and yet very much faster than the current increment of a single shared counter. Thought? Thanks, Olivier -- Olivier Dion EfficiOS Inc. https://www.efficios.com