It's not obsolete, what is described in your link is leveraging 
instruction-level parallelism (ILP). This is useful if there is a difference 
between the cycle latency (if data dependencies) or cycle throughput (if no 
data dependencies) of the instructions. There is no need to use assembly for 
ILP, a simple a += 1; goto foo; in C will make use of it.

If we take the LuaJIT 2 example:
    
    
        -- 3. Dispatch next instruction
        mov eax, [esi]    -- Load next instruction into eax
        movzx ecx, ah     -- Predecode A into ecx
        movzx ebp, al     -- zero-extend OP into ebp
        add esi, 4        -- increment program counter
        shr eax, 16       -- predecode D
        jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table
    
    
    Run

add and shr can be interleaved with jmp as they don't touch ebx and eax 
register. Checking the [Haswell arch as a 
baseline](https://www.agner.org/optimize/instruction_tables.pdf)

  * jmp [ebx + ebp * 4] has a throughput of 2 cycles + branch misprediction 
cost.
  * add esi, 4 has a throughput of 0.25 cycle
  * shr eax, 16 has a throughput of 1 cycle



This means that there is still an overhead of 1 cycle + branch misprediction 
cost. Context threading is still relevant here to improve branch prediction. 

Reply via email to