Here's a very minimal ARM jit framework. It does work (at least as far as passing all 10 t/op/basic.t subtests, and running mops.pbc)
As you can see from the patch all it does is implement the end and noop ops. Everything else is being called. Interestingly, JITing like this is slower than computed goto: computed goto: $ ./parrot examples/assembly/mops.pbc Iterations: 100000000 Estimated ops: 200000000 Elapsed time: 37.209835 M op/s: 5.374923 no computed goto: $ ./parrot -g examples/assembly/mops.pbc Iterations: 100000000 Estimated ops: 200000000 Elapsed time: 71.245085 M op/s: 2.807211 JIT: $ ./parrot -j examples/assembly/mops.pbc Iterations: 100000000 Estimated ops: 200000000 Elapsed time: 53.474880 M op/s: 3.740074 JIT with ARM_K_BUG, to generate code that doesn't tickle the page faulting related bug in the K StrongARM: $ ./parrot -j examples/assembly/mops.pbc Iterations: 100000000 Estimated ops: 200000000 Elapsed time: 56.142425 M op/s: 3.562368 I doubt in its current form this is quite ready to go in. Points I'd like to raise 0: I've only implemented generator code fully for 1 class of instructions (load/store multiple registers), partially for a second (load/store single registers, and hard coded the minimal set of other things I needed. I'll replaced these with fully featured versions, now that I'm happy that the concept works 1: The most optimal code I could think of to call external functions sets everything up by loading arguments into registers and function address into PC a single load multiple instruction. (plus setting the return address in the link register, by using the link register as the base register for the load). All that in 1 instruction, plus a second to prime LR for the load. (This is why I like it) However, this is the form of instruction that can trigger bugs on the (very early) K version StrongARMs. (if it page faults midway) Probably the rest of the world doesn't have these (unless they have machines dating from 1996 or so) but I do have one, so it is an important itch for me. ARM_K_BUG is a symbol to define to generate code that cannot cause the bug. 2: This code probably is the ARM assembler version of a JAPH, in that I've not actually found the need (yet) to use any branch instructions. They do exist! It's just that I find I can do it all so far with loads. :-) 3: The code as is issues casting warnings and 3 warnings about unprototyped functions. (which I think can be static) 4: I'd really like the type of the pointer for the native code to be machine chosen. char* isn't the most appropriate type for ARM code - all instructions are word sized (32 bits) and must all be word aligned, so I'd really like to be fabricating them in ints, and writing to an int* in one blat. 5: The symbol TESTING was so that I could #include "jit_emit.h" in a test C program to check my generator (by spitting a buffer out into a $file, and then disassembling it with objdump -b binary -m arm -D $file 6: ARMs with separate I and D caches need to sync them before running code. (else it all goes SEGV shaped with really really weird backtraces) I don't think there's any official Linux function wrapper round the ARM Linux syscal necessary to do this, hence the function with the inline assembler. I'm not sure if there is a better way to do this. [optional .s file in the architecture's jit directory, which the jit installer can copy if it finds?] 7: Debian define the archname on their perl as "arm", whereas building from the source tree gets me armv4l (from uname) hence the substitution for armv[34]l? down to arm. I do have a machine with an ARM3 here (which I think would be armv2) but it is 14 years old, and doesn't currently have Linux on it (or a compiler for RISC OS, and I'm not feeling up to attempting a RISC OS port for parrot just to experiment with JITs) It's probably quite feasible to make the JIT work on everything back to the ARM2 (ARM1 was the prototype and I believe was never used in any hardware available outside Acorn, and IIRC all ARM1 doesn't have is the multiply instruction, so it could be done) Apart from all of that, the JIT version 2 looks much more flexible than JIT version 1 - thanks Daniel. I'll start writing some real JIT ops over the next few days, although possibly only for the ops mops and life use :-) [although I strongly suspect that JITting the ops the regexps compile down to would be the first real world JIT priority. How fast would perl6 regexps be with that?] Oh, and prepare an acceptable version of this patch once people decide what is acceptable Nicholas Clark -- Even better than the real thing: http://nms-cgi.sourceforge.net/ --- /dev/null Mon Jul 16 22:57:44 2001 +++ jit/arm/core.jit Mon Jul 29 00:14:30 2002 @@ -0,0 +1,26 @@ +; +; arm/core.jit +; +; $Id: core.jit,v 1.4 2002/05/20 05:32:58 grunblatt Exp $ +; + +Parrot_noop { + emit_nop(jit_info->native_ptr); +} + +; ldmea fp, {r4, r5, r6, r7, fp, sp, pc +; but K bug Grr if I load pc direct. + +Parrot_end { + jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr, + cond_AL, is_load, dir_EA, 0, 0, + REG11_fp, + reg2mask(4) | reg2mask(REG11_fp) + | reg2mask(REG13_sp) + #ifndef ARM_K_BUG + | reg2mask(REG15_pc)); + #else + | reg2mask(REG14_lr)); + emit_mov(jit_info->native_ptr, REG15_pc, REG14_lr); + #endif +} --- /dev/null Mon Jul 16 22:57:44 2001 +++ jit/arm/jit_emit.h Mon Jul 29 22:23:37 2002 @@ -0,0 +1,293 @@ +/* +** jit_emit.h +** +** ARM (v3 and later - maybe this can easily be unified to v1) +** +** $Id: jit_emit.h,v 1.3 2002/07/04 21:32:12 mrjoltcola Exp $ +**/ + +/* I'll use mov r0, r0 as my NOP for now. */ + +typedef enum { + cond_EQ = 0x00, + cond_NE = 0x10, + cond_CS = 0x20, + cond_CC = 0x30, + cond_MI = 0x40, + cond_PL = 0x50, + cond_VS = 0x60, + cond_VC = 0x70, + cond_HI = 0x80, + cond_LS = 0x90, + cond_GE = 0xA0, + cond_LT = 0xB0, + cond_GT = 0xC0, + cond_LE = 0xD0, + cond_AL = 0xE0, +/* cond_NV = 0xF0, */ + cond_HS = 0x20, + cond_LO = 0x30 +} cont_t; + +typedef enum { + REG10_sl = 10, + REG11_fp = 11, + REG12_ip = 12, + REG13_sp = 13, + REG14_lr = 14, + REG15_pc = 15 +} arm_register_t; + +#define emit_nop(pc) emit_mov (pc, 0, 0) + +#define emit_mov(pc, dest, src) { \ + *(pc++) = 0x00 | src; \ + *(pc++) = dest << 4; \ + *(pc++) = 0xA0; \ + *(pc++) = cond_AL | 1; } + +#define emit_sub4(pc, dest, src) { \ + *(pc++) = 0x04; \ + *(pc++) = dest << 4; \ + *(pc++) = 0x40 | src; \ + *(pc++) = cond_AL | 2; } + +#define emit_add4(pc, dest, src) { \ + *(pc++) = 0x04; \ + *(pc++) = dest << 4; \ + *(pc++) = 0x80 | src; \ + *(pc++) = cond_AL | 2; } + +#define emit_dcd(pc, word) { \ + *((int *)pc) = word; \ + pc+=4; } + +#define reg2mask(reg) (1<<(reg)) + +#define is_store 0x00 +#define is_load 0x10 +#define is_writeback 0x20 +#define is_caret 0x40 /* assembler syntax is ^ - load sets status flags in + USR mode, or load/store use user bank registers + in other mode. IIRC. */ +#define is_byte 0x40 +#define is_pre 0x01 /* pre index addressing. */ +#define is_post 0x00 /* post indexed addressing. ie arithmetic for free */ + +/* multiple register transfer direction. + D = decrease, I = increase + A = after, B = before + or the stack notation + FD = full descending (the usual) + ED = empty descending + FA = full ascending + FD = full descending + values for stack notation are 0x10 | (ldm type) << 2 | (stm type) +*/ +typedef enum { + dir_DA = 0, + dir_IA = 1, + dir_DB = 2, + dir_IB = 3, + dir_FD = 0x10 | (1 << 2) | 2, + dir_FA = 0x10 | (0 << 2) | 3, + dir_ED = 0x10 | (3 << 2) | 0, + dir_EA = 0x10 | (2 << 2) | 1 +} ldm_stm_dir_t; + +typedef enum { + dir_Up = 0x80, + dir_Down = 0x00 +} ldr_str_dir_t; + +char * +emit_ldmstm(char *pc, + int cond, + int l_s, + ldm_stm_dir_t direction, + int caret, + int writeback, + int base, + int regmask) { + if ((l_s == is_load) && (direction & 0x10)) + direction >>= 2; + + *(pc++) = regmask; + *(pc++) = regmask >> 8; + /* bottom bit of direction is the up/down flag. */ + *(pc++) = ((direction & 1) << 7) | caret | writeback | l_s | base; + /* binary 100x is code for stm/ldm. */ + /* Top bit of direction is pre/post increment flag. */ + *(pc++) = cond | 0x8 | ((direction >> 1) & 1); + return pc; +} + +char * +emit_ldrstr(char *pc, + int cond, + int l_s, + ldr_str_dir_t direction, + int pre, + int writeback, + int byte, + int dest, + int base, + int offset_type, + unsigned int offset) { + + *(pc++) = offset; + *(pc++) = ((offset >> 8) & 0xF) | (dest << 4); + *(pc++) = direction | byte | writeback | l_s | base; + *(pc++) = cond | 0x4 | offset_type | pre; + return pc; +} + +char * +emit_ldrstr_offset (char *pc, + int cond, + int l_s, + int pre, + int writeback, + int byte, + int dest, + int base, + int offset) { + ldr_str_dir_t direction = dir_Up; +#ifndef TESTING + if (offset > 4095 || offset < -4095) { + internal_exception(JIT_ERROR, + "Unable to generate offsets > 4095\n" ); + } +#endif + if (offset < 0) { + direction = dir_Down; + offset = -offset; + } + return emit_ldrstr(pc, cond, l_s, direction, pre, writeback, byte, dest, + base, 0, offset); +} + +void Parrot_jit_dofixup(Parrot_jit_info *jit_info, + struct Parrot_Interp * interpreter) +{ + /* Todo. */ +} +/* My entry code is create a stack frame: + mov ip, sp + stmfd sp!, {r4, fp, ip, lr, pc} + sub fp, ip, #4 + Then store the first parameter (pointer to the interpreter) in r4. + mov r4, r0 +*/ + +void +Parrot_jit_begin(Parrot_jit_info *jit_info, + struct Parrot_Interp * interpreter) +{ + emit_mov (jit_info->native_ptr, REG12_ip, REG13_sp); + jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr, + cond_AL, is_store, dir_FD, 0, + is_writeback, + REG13_sp, + reg2mask(4) | reg2mask(REG11_fp) + | reg2mask(REG12_ip) + | reg2mask(REG14_lr) + | reg2mask(REG15_pc)); + emit_sub4 (jit_info->native_ptr, REG11_fp, REG12_ip); + emit_mov (jit_info->native_ptr, 4, 0); +} + +/* I'm going to load registers to call functions in general like this: + adr r14, .L1 + ldmia r14!, {r0, r1, r2, pc} ; register list built by jit + .L1: r0 data + r1 data + r2 data + <where ever> ; address of function. + .L2: ; next instruction - return point from func. + + # here I'm going to do + + mov r1, r4 ; current interpreter is arg 1 + adr r14, .L1 + ldmia r14!, {r0, pc} + .L1: address of current opcode + <where ever> ; address of function for op + .L2: ; next instruction - return point from func. +*/ + +/* +XXX no. +need to adr beyond: + + mov r1, r4 ; current interpreter is arg 1 + adr r14, .L1 + ldmda r14!, {r0, ip} + mov pc, ip + .L1 address of current opcode + dcd <where ever> ; address of function for op + .L2: ; next instruction - return point from func. +*/ +void +Parrot_jit_normal_op(Parrot_jit_info *jit_info, + struct Parrot_Interp * interpreter) +{ + emit_mov (jit_info->native_ptr, 1, 4); +#ifndef ARM_K_BUG + emit_mov (jit_info->native_ptr, REG14_lr, REG15_pc); +#else + emit_add4 (jit_info->native_ptr, REG14_lr, REG15_pc); +#endif + jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr, + cond_AL, is_load, dir_IA, 0, + is_writeback, + REG14_lr, + reg2mask(0) +#ifndef ARM_K_BUG + | reg2mask(REG15_pc) +#else + | reg2mask(REG12_ip) +#endif + ); +#ifdef ARM_K_BUG + emit_mov (jit_info->native_ptr, REG15_pc, REG12_ip); +#endif + emit_dcd (jit_info->native_ptr, (int) jit_info->cur_op); + emit_dcd (jit_info->native_ptr, + (int) interpreter->op_func_table[*(jit_info->cur_op)]); +} + +/* We get back address of opcode in bytecode. + We want address of equivalent bit of jit code, which is stored as an + address at the same offset in a jit table. */ +void Parrot_jit_cpcf_op(Parrot_jit_info *jit_info, + struct Parrot_Interp * interpreter) +{ + Parrot_jit_normal_op(jit_info, interpreter); + + /* This is effectively the pseudo-opcode ldr - ie load relative to PC. + So offset includes pipeline. */ + jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr, cond_AL, + is_load, is_pre, 0, 0, + REG14_lr, REG15_pc, 0); + /* ldr pc, [r14, r0] */ + /* lazy. this is offset type 0, 0x000 which is r0 with zero shift */ + jit_info->native_ptr = emit_ldrstr (jit_info->native_ptr, cond_AL, + is_load, dir_Up, is_pre, 0, 0, + REG15_pc, REG14_lr, 2, 0); + /* and this "instruction" is never reached, so we can use it to store + the constant that we load into r14 */ + emit_dcd (jit_info->native_ptr, + ((long) jit_info->op_map) - + ((long) interpreter->code->byte_code)); +} + +/* + * Local variables: + * c-indentation-style: bsd + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + * + * vim: expandtab shiftwidth=4: + */ --- jit.c~ Tue Jul 23 19:18:41 2002 +++ jit.c Mon Jul 29 21:46:44 2002 @@ -128,6 +128,63 @@ optimize_jit(struct Parrot_Interp *inter return optimizer; } +#ifdef ARM +static void +arm_sync_d_i_cache (void *start, void *end) { +/* Strictly this is only needed for StrongARM and later (not sure about ARM8) + because earlier cores don't have separate D and I caches. + However there aren't that many ARM7 or earlier devices around that we'll be + running on. */ +#ifdef __linux +#ifdef __GNUC__ + int result; + /* swi call based on code snippet from Russell King. Description + verbatim: */ + /* + * Flush a region from virtual address 'r0' to virtual address 'r1' + * _inclusive_. There is no alignment requirement on either address; + * user space does not need to know the hardware cache layout. + * + * r2 contains flags. It should ALWAYS be passed as ZERO until it + * is defined to be something else. For now we ignore it, but may + * the fires of hell burn in your belly if you break this rule. ;) + * + * (at a later date, we may want to allow this call to not flush + * various aspects of the cache. Passing '0' will guarantee that + * everything necessary gets flushed to maintain consistency in + * the specified region). + */ + + /* The value of the SWI is actually available by in + __ARM_NR_cacheflush defined in <asm/unistd.h>, but quite how to + get that to interpolate as a number into the ASM string is beyond + me. */ + /* I'm actually passing in exclusive end address, so subtract 1 from + it inside the assembler. */ + __asm__ __volatile__ ( + "mov r0, %1\n" + "sub r1, %2, #1\n" + "mov r2, #0\n" + "swi 0x9f0002\n" + "mov %0, r0\n" + : "=r" (result) + : "r" ((long)start), "r" ((long)end) + : "r0","r1","r2"); + + if (result < 0) { + internal_exception(JIT_ERROR, + "Synchronising I and D caches failed with errno=%d\n", + -result); + } +#else +#error "ARM needs to sync D and I caches, and I don't know how to embed assmbler on +this C compiler" +#endif +#else +/* Not strictly true - on RISC OS it's OS_SynchroniseCodeAreas */ +#error "ARM needs to sync D and I caches, and I don't know how to on this OS" +#endif +} +#endif /* ** build_asm() @@ -214,6 +271,9 @@ build_asm(struct Parrot_Interp *interpre } } +#ifdef ARM + arm_sync_d_i_cache (jit_info.arena_start, jit_info.native_ptr); +#endif return (jit_f)jit_info.arena_start; } --- config/auto/jit.pl.orig Sat Jul 13 22:39:40 2002 +++ config/auto/jit.pl Mon Jul 29 00:08:22 2002 @@ -42,11 +42,14 @@ sub runstep { $cpuarch = 'i386'; } + $cpuarch =~ s/armv[34]l?/arm/i; + Configure::Data->set( archname => $archname, cpuarch => $cpuarch, osname => $osname, ); + my $jitarchname = "$cpuarch-$osname"; $jitarchname =~ s/i[456]86/i386/i;