On Fri, Dec 21, 2001 at 12:03:51AM +0000, Tom Hughes wrote: > In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > To run a program with the JIT, pass test_parrot the -j flag and watch it > > scream. Well, scream if you're on x86 Linux or BSD (I get a speedup on > > mops.pbc of 35x) but it's a darned good place to start. > > It does seem to be quite impressively fast. Faster even than the > compiled version of mops on my machine... > > It looks like it is going to need some work before it can work for > other instruction sets though, at least for RISC systems where the > operands are typically encoded with the opcode as part of a single > word and the range of immediate constants is often restricted. > > I'm thinking it will need some way of indicating field widths and > shifts for the operands and opcode so they can be merged into an > instruction word and also some way of handling a constant pool so > that arbitrary addresses can be loaded using PC relative loads.
I've been thinking about this, and worrying somewhat if I'm in danger of trying to make a something run before it can walk, but in trying to think how to make an ARM jit. I don't know how typical ARM is of RISC CPUs, but it's probably enough food for thought. I may have made some minor technical errors in this: All our JIT code would be running in user mode on ARM, so we only get to play with the user mode bank of 16 "general purpose" registers, named r0-r15 "general purpose" in quotes because r15 is the program counter, one (now always r13) is the stack pointer and the ABI specifies uses for some others. I assume that we're not generating shared library code (if we are, the ABI claims r9 from us) The ABI needs 6 registers to call another function, a called function has the first 4 integer arguments passed in using r0-r3, and returns a result in r0. However in the body of a function we can arrange to use 2 of those reserved 6, so we have 12 registers for our use, of which r4-r9 will be preserved across function calls. Memory loads and stores have to be done relative to a base register (which can be r15, the PC), +- a constant (in the range 0-4095) or another register (optionally shifted by a constant). There is no direct way to load a register with an arbitrary 32 bit constant. Registers can be loaded with an immediate constant stored in 12 bits as <8 bit value> rotate right 2 * <4 bit value> (So you can load 0xFF in one instruction (4 bytes), for something like 0xFFF it's best to express it in 2 instructions (8 bytes) as load 0xFF, add 0xF00, and for complex values stuff it in a data word within 4092 bytes of the PC and load it PC relative, which is also 8 bytes, but possibly slower) The upshot of this is that loading integer register values by substituting addresses into template code is not an easy way to go. Much better would be to load up r4 to r7 with the base addresses of the integer, floating point, string and PMC registers and do a "load integer register 16" into r0 as ldr r0, [r4, #60] @ parrot registers start at 1, don't they? [maybe I am already in danger of trying to run before I can walk, as I suspect that the address substitution can be shoe-horned as ldr r0, [pc, #-4] @ oops - may have got the pipeling wrong here b .L1 .word XXXX @ substitute your address here ..L1 ldr r0, [r0] but that's something like 5 CPU cycles rather than 2.] I'm in danger of wanting to run before I can walk. And maybe I should shut up for now as I'm describing a possible next generation JIT, rather than this one: The reason I'd feel I'd naturally want to define Parrot_set_i_i not as 1 code snippet, but as load parrot integer register (from memory) into CPU register n # do nothing here store CPU register n to parrot integer register (to memory) is because even with a more complex JIT syntax that lets me translate set I2, I4 as ldr r0, [r4, #12] str r0, [r4, #4] with r4-r7 loaded with those base addresses, I'm wasting most of the ARM CPU: I've got r1-r3, r8, r9, r12 and r14 that I'm not even using. I've got to be careful in what I'm suggesting here: I'm *not* suggesting that I (or anyone else) immediately writes a JIT that maps CPU registers to parrot registers and attempts to keep values in the CPU where possible. I *am* suggesting that if Parrot_set_i_i is defined as input: load parrot integer register (from memory) into CPU register n body: output: store CPU register n to parrot integer register (to memory) where the JIT provides that load and store code (rather than each op) then the work that goes into writing body code for the ops is still useful and usable with a second (third?) generation JIT that does know how to combine 2 or more adjacent Parrot ops. Possibly also useful thoughts: For things like generating constants for ARM, it would be useful to be able to specify alternative code generation snippets, with some sort of constraint on when they should be used. So the first to be tried would be mov r0, #XXX with the constraint that XXX must be expressible as <8 bit value> rotate right 2 * <4 bit value> then mvn r0, #XXX where XXX must be logical not of <8 bit value> rotate right 2 * <4 bit value> and then a third that copes with arbitrary 32 bit values. Note that currently it's the assembler at assemble time that works out how to decompose XXX into the 8 bit constant and 4 bit shift. Our JIT code generator is going to have to be able to do this. The constraints might get quite complicated - it may be useful if a snippet of parrot assembler could be used to express them. The code generation itself could be quite complex - it might be useful if as an alternative to the existing fast fixed system one could also define for any parrot op a snippet of parrot assembler to call to try to generate code. I presume that we can JIT the code for the JIT (running the JITting code as interpreted code while building the JIT to avoid infinite recursion), and that if that's still not fast enough we can write custom parrot ops in C for the JIT. (not sure what API would be needed, sorry, but I suspect it's going to be important to distinguish between an empty string result (0 bytes of code will do the job) and an undef result [constraint failure, try the next alternative]. And it may be useful to also return a "quality" result to allow a -O2 type optimiser to try different alternatives and then use the best) In the few times I've written inline assembler for gcc, the three things I've missed most are 1: being able to ask for temporary registers (that are neither input nor output, and don't need to be explicitly initialised to zero or used in a dummy way to avoid compiler warnings) 2: being able to specify constraints about which input registers can't be the same as which output registers. 3: being able to specify alternatives, when I knew I could code the same assembler in 2 ways depending on whether (say) the compiler's optimiser wanted input and output in the same CPU register, or whether it would benefit from them being different, and so getting a "free" copy into another register. Nicholas Clark