It's not obsolete, what is described in your link is leveraging
instruction-level parallelism (ILP). This is useful if there is a difference
between the cycle latency (if data dependencies) or cycle throughput (if no
data dependencies) of the instructions. There is no need to use assembly for
ILP, a simple a += 1; goto foo; in C will make use of it.
If we take the LuaJIT 2 example:
-- 3. Dispatch next instruction
mov eax, [esi] -- Load next instruction into eax
movzx ecx, ah -- Predecode A into ecx
movzx ebp, al -- zero-extend OP into ebp
add esi, 4 -- increment program counter
shr eax, 16 -- predecode D
jmp [ebx + ebp * 4] -- jump to next instruction via dispatch table
Run
add and shr can be interleaved with jmp as they don't touch ebx and eax
register. Checking the [Haswell arch as a
baseline](https://www.agner.org/optimize/instruction_tables.pdf)
* jmp [ebx + ebp * 4] has a throughput of 2 cycles + branch misprediction
cost.
* add esi, 4 has a throughput of 0.25 cycle
* shr eax, 16 has a throughput of 1 cycle
This means that there is still an overhead of 1 cycle + branch misprediction
cost. Context threading is still relevant here to improve branch prediction.