This is really incredible. I've removed all of the D code, and I can still
reproduce the behaviour. If you uncomment out the jz line, it won't happen.
The 'int 3' line is just a breakpoint, to prove that the branch is never taken.

void main()
    int ctr; // also works with __gshared int ctr;
    asm {
        mov EAX, 2;
        and EAX, 0xFF;
        mov ctr, EAX;
//        jz was_zero;
        dec int ptr ctr;
        jnz Lxx;
        jmp done;
        int 3;
done:   ;        

Wild speculation: there's a bug in CPUID 2: it's not clearing the loopback
buffer. The loop is executed as if 'ctr' were still zero. This means that it
loops 2^^32 times. This is long enough that Windows does a task switch.
In core2, the loopback buffer was between the predecoders and the decoders, but
on core i7, they moved it after the decoders.
I tried to confirm this by extending the size of the loop, by padding with
When the loop is 63 bytes of code (56 nops), it fails. Once I add a 57th nop,
it stops failing.
These aren't the numbers I expected -- the loopback buffer is 256 bytes on the
core i7. However I have a core i3, perhaps it's different, or it may be a
decoding bug. Regardless, this looks very much like a CPU erratum.

My guess is that affecting the loop predictor. which isn't the branch

