I can't find it right now, and I might be mistaken, but IIRC Mike Pall (of LuaJIT fame) managed to make this call/ret trickery obsolete in the LuaJIT2 interpreter, by pipelining the decoding of the next instruction to overlay with the fetching of the next address, along the lines described in [https://nominolo.blogspot.com/2012/07/implementing-fast-interpreters.html](https://nominolo.blogspot.com/2012/07/implementing-fast-interpreters.html).
In my opinion, if you have to resort to machine code (which the context threading solution does), then you might as well spend a little more to overlay the operations; it will be obsolete within 10 years either way thanks to architecture differences. Their solution does carry more easily to more architectures, I'll give them that, but the question is "how fast can you go", not "how fast can you go for only 400 implementation lines".
