ARM Jit v2

Nicholas Clark Mon, 29 Jul 2002 14:44:34 -0700

Here's a very minimal ARM jit framework. It does work (at least as far as
passing all 10 t/op/basic.t subtests, and running mops.pbc)


As you can see from the patch all it does is implement the end and noop ops.
Everything else is being called. Interestingly, JITing like this is slower
than computed goto:

computed goto:

$ ./parrot  examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  37.209835
M op/s:        5.374923

no computed goto:

$ ./parrot -g examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  71.245085
M op/s:        2.807211

JIT:

$ ./parrot -j examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  53.474880
M op/s:        3.740074

JIT with ARM_K_BUG, to generate code that doesn't tickle the page faulting
related bug in the K StrongARM:

$ ./parrot -j examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  56.142425
M op/s:        3.562368

I doubt in its current form this is quite ready to go in. Points I'd like to
raise

0: I've only implemented generator code fully for 1 class of instructions
   (load/store multiple registers), partially for a second (load/store
   single registers, and hard coded the minimal set of other things I
   needed. I'll replaced these with fully featured versions, now that I'm
   happy that the concept works

1: The most optimal code I could think of to call external functions sets
   everything up by loading arguments into registers and function address
   into PC a single load multiple instruction. (plus setting the return
   address in the link register, by using the link register as the base
   register for the load). All that in 1 instruction, plus a second to prime
   LR for the load. (This is why I like it)

   However, this is the form of instruction that can trigger bugs on the
   (very early) K version StrongARMs. (if it page faults midway) Probably
   the rest of the world doesn't have these (unless they have machines
   dating from 1996 or so) but I do have one, so it is an important itch for
   me. ARM_K_BUG is a symbol to define to generate code that cannot cause
   the bug.

2: This code probably is the ARM assembler version of a JAPH, in that I've
   not actually found the need (yet) to use any branch instructions. They
   do exist! It's just that I find I can do it all so far with loads. :-)

3: The code as is issues casting warnings and 3 warnings about unprototyped
   functions. (which I think can be static)

4: I'd really like the type of the pointer for the native code to be
   machine chosen. char* isn't the most appropriate type for ARM code -
   all instructions are word sized (32 bits) and must all be word aligned,
   so I'd really like to be fabricating them in ints, and writing to an int*
   in one blat.

5: The symbol TESTING was so that I could #include "jit_emit.h" in a test C
   program to check my generator (by spitting a buffer out into a $file, and
   then disassembling it with objdump -b binary -m arm -D $file

6: ARMs with separate I and D caches need to sync them before running code.
   (else it all goes SEGV shaped with really really weird backtraces)
   I don't think there's any official Linux function wrapper round the
   ARM Linux syscal necessary to do this, hence the function with the inline
   assembler. I'm not sure if there is a better way to do this.
   [optional .s file in the architecture's jit directory, which the jit
   installer can copy if it finds?]

7: Debian define the archname on their perl as "arm", whereas building from
   the source tree gets me armv4l (from uname) hence the substitution for
   armv[34]l? down to arm. I do have a machine with an ARM3 here (which I
   think would be armv2) but it is 14 years old, and doesn't currently have
   Linux on it (or a compiler for RISC OS, and I'm not feeling up to
   attempting a RISC OS port for parrot just to experiment with JITs)
   It's probably quite feasible to make the JIT work on everything back to
   the ARM2 (ARM1 was the prototype and I believe was never used in any
   hardware available outside Acorn, and IIRC all ARM1 doesn't have is the
   multiply instruction, so it could be done)

Apart from all of that, the JIT version 2 looks much more flexible than
JIT version 1 - thanks Daniel.

I'll start writing some real JIT ops over the next few days, although
possibly only for the ops mops and life use :-)
[although I strongly suspect that JITting the ops the regexps compile down
to would be the first real world JIT priority. How fast would perl6 regexps
be with that?]

Oh, and prepare an acceptable version of this patch once people decide what
is acceptable

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/

--- /dev/null   Mon Jul 16 22:57:44 2001
+++ jit/arm/core.jit    Mon Jul 29 00:14:30 2002
@@ -0,0 +1,26 @@
+;
+; arm/core.jit
+;
+; $Id: core.jit,v 1.4 2002/05/20 05:32:58 grunblatt Exp $
+;
+
+Parrot_noop {
+    emit_nop(jit_info->native_ptr);
+}
+
+; ldmea        fp, {r4, r5, r6, r7, fp, sp, pc
+; but K bug Grr if I load pc direct.
+
+Parrot_end {
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_EA, 0, 0,
+                                        REG11_fp,
+                                        reg2mask(4) | reg2mask(REG11_fp)
+                                        | reg2mask(REG13_sp)
+ #ifndef ARM_K_BUG
+                                        | reg2mask(REG15_pc));
+ #else
+                                        | reg2mask(REG14_lr));
+    emit_mov(jit_info->native_ptr, REG15_pc, REG14_lr);
+ #endif
+}
--- /dev/null   Mon Jul 16 22:57:44 2001
+++ jit/arm/jit_emit.h  Mon Jul 29 22:23:37 2002
@@ -0,0 +1,293 @@
+/*
+** jit_emit.h
+** 
+** ARM (v3 and later - maybe this can easily be unified to v1)
+**
+** $Id: jit_emit.h,v 1.3 2002/07/04 21:32:12 mrjoltcola Exp $
+**/
+
+/* I'll use mov r0, r0 as my NOP for now.  */
+
+typedef enum {
+    cond_EQ = 0x00,
+    cond_NE = 0x10,
+    cond_CS = 0x20,
+    cond_CC = 0x30,
+    cond_MI = 0x40,
+    cond_PL = 0x50,
+    cond_VS = 0x60,
+    cond_VC = 0x70,
+    cond_HI = 0x80,
+    cond_LS = 0x90,
+    cond_GE = 0xA0,
+    cond_LT = 0xB0,
+    cond_GT = 0xC0,
+    cond_LE = 0xD0,
+    cond_AL = 0xE0,
+/*    cond_NV = 0xF0, */
+    cond_HS = 0x20,
+    cond_LO = 0x30
+} cont_t;
+
+typedef enum {
+    REG10_sl = 10,
+    REG11_fp = 11,
+    REG12_ip = 12,
+    REG13_sp = 13,
+    REG14_lr = 14,
+    REG15_pc = 15
+} arm_register_t;
+
+#define emit_nop(pc) emit_mov (pc, 0, 0)
+
+#define emit_mov(pc, dest, src)  { \
+    *(pc++) = 0x00 | src; \
+    *(pc++) = dest << 4; \
+    *(pc++) = 0xA0; \
+    *(pc++) = cond_AL | 1; }
+
+#define emit_sub4(pc, dest, src)  { \
+    *(pc++) = 0x04; \
+    *(pc++) = dest << 4; \
+    *(pc++) = 0x40 | src; \
+    *(pc++) = cond_AL | 2; }
+
+#define emit_add4(pc, dest, src)  { \
+    *(pc++) = 0x04; \
+    *(pc++) = dest << 4; \
+    *(pc++) = 0x80 | src; \
+    *(pc++) = cond_AL | 2; }
+
+#define emit_dcd(pc, word)  { \
+    *((int *)pc) = word; \
+    pc+=4; }
+
+#define reg2mask(reg) (1<<(reg))
+
+#define is_store 0x00
+#define is_load      0x10
+#define is_writeback 0x20
+#define is_caret     0x40 /* assembler syntax is ^ - load sets status flags in
+                             USR mode, or load/store use user bank registers
+                             in other mode. IIRC.  */
+#define is_byte      0x40
+#define is_pre       0x01 /* pre index addressing.  */
+#define is_post      0x00 /* post indexed addressing. ie arithmetic for free  */
+
+/* multiple register transfer direction.
+   D = decrease, I = increase
+   A = after, B = before
+   or the stack notation
+   FD = full descending (the usual)
+   ED = empty descending
+   FA = full ascending
+   FD = full descending
+   values for stack notation are 0x10 | (ldm type) << 2 | (stm type)
+*/
+typedef enum {
+    dir_DA = 0,
+    dir_IA = 1,
+    dir_DB = 2,
+    dir_IB = 3,
+    dir_FD = 0x10 | (1 << 2) | 2,
+    dir_FA = 0x10 | (0 << 2) | 3,
+    dir_ED = 0x10 | (3 << 2) | 0,
+    dir_EA = 0x10 | (2 << 2) | 1
+} ldm_stm_dir_t;
+
+typedef enum {
+    dir_Up = 0x80,
+    dir_Down = 0x00
+} ldr_str_dir_t;
+
+char *
+emit_ldmstm(char *pc,
+            int cond,
+            int l_s,
+            ldm_stm_dir_t direction,
+            int caret,
+            int writeback,
+            int base,
+            int regmask) {
+    if ((l_s == is_load) && (direction & 0x10))
+        direction >>= 2;
+
+    *(pc++) = regmask;
+    *(pc++) = regmask >> 8;
+    /* bottom bit of direction is the up/down flag.  */
+    *(pc++) = ((direction & 1) << 7) | caret | writeback | l_s | base;
+    /* binary 100x is code for stm/ldm.  */
+    /* Top bit of direction is pre/post increment flag.  */
+    *(pc++) = cond | 0x8 | ((direction >> 1) & 1);
+    return pc;
+}
+
+char *
+emit_ldrstr(char *pc,
+            int cond,
+            int l_s,
+            ldr_str_dir_t direction,
+            int pre,
+            int writeback,
+            int byte, 
+            int dest,
+            int base,
+            int offset_type,
+            unsigned int offset) {
+
+    *(pc++) = offset;
+    *(pc++) = ((offset >> 8) & 0xF) | (dest << 4);
+    *(pc++) = direction | byte | writeback | l_s | base;
+    *(pc++) = cond | 0x4 | offset_type | pre;
+    return pc;
+}
+
+char *
+emit_ldrstr_offset (char *pc,
+                    int cond,
+                    int l_s,
+                    int pre,
+                    int writeback,
+                    int byte,
+                    int dest,
+                    int base,
+                    int offset) {
+    ldr_str_dir_t direction = dir_Up;
+#ifndef TESTING
+    if (offset > 4095 || offset < -4095) {
+        internal_exception(JIT_ERROR,
+                           "Unable to generate offsets > 4095\n" );
+    }
+#endif
+    if (offset < 0) {
+        direction = dir_Down;
+        offset = -offset;
+    }
+    return emit_ldrstr(pc, cond, l_s, direction, pre, writeback, byte, dest,
+                       base, 0, offset);
+}
+
+void Parrot_jit_dofixup(Parrot_jit_info *jit_info,
+                        struct Parrot_Interp * interpreter)
+{
+    /* Todo.  */
+}
+/* My entry code is create a stack frame:
+       mov     ip, sp
+       stmfd   sp!, {r4, fp, ip, lr, pc}
+       sub     fp, ip, #4
+   Then store the first parameter (pointer to the interpreter) in r4.
+       mov     r4, r0
+*/
+
+void
+Parrot_jit_begin(Parrot_jit_info *jit_info,
+                 struct Parrot_Interp * interpreter)
+{
+    emit_mov (jit_info->native_ptr, REG12_ip, REG13_sp);
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_store, dir_FD, 0,
+                                        is_writeback,
+                                        REG13_sp,
+                                        reg2mask(4) | reg2mask(REG11_fp)
+                                        | reg2mask(REG12_ip)
+                                        | reg2mask(REG14_lr)
+                                        | reg2mask(REG15_pc));
+    emit_sub4 (jit_info->native_ptr, REG11_fp, REG12_ip);
+    emit_mov (jit_info->native_ptr, 4, 0);
+}
+
+/* I'm going to load registers to call functions in general like this:
+    adr     r14,  .L1
+    ldmia   r14!,  {r0, r1, r2, pc} ; register list built by jit
+    .L1:    r0 data
+            r1 data
+            r2 data
+           <where ever>        ; address of function.
+    .L2:                      ; next instruction - return point from func.
+
+    # here I'm going to do 
+
+    mov            r1, r4      ; current interpreter is arg 1
+    adr     r14,  .L1
+    ldmia   r14!,  {r0, pc}
+    .L1:    address of current opcode
+           <where ever>        ; address of function for op
+    .L2:                      ; next instruction - return point from func.
+*/
+
+/*
+XXX no.
+need to adr beyond:
+
+    mov            r1, r4      ; current interpreter is arg 1
+    adr     r14,  .L1
+    ldmda   r14!,  {r0, ip}
+    mov     pc, ip
+    .L1     address of current opcode
+    dcd     <where ever>      ; address of function for op
+    .L2:                      ; next instruction - return point from func.
+*/
+void
+Parrot_jit_normal_op(Parrot_jit_info *jit_info,
+                     struct Parrot_Interp * interpreter)
+{
+    emit_mov (jit_info->native_ptr, 1, 4);
+#ifndef ARM_K_BUG
+    emit_mov (jit_info->native_ptr, REG14_lr, REG15_pc);
+#else
+    emit_add4 (jit_info->native_ptr, REG14_lr, REG15_pc);
+#endif
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_IA, 0,
+                                        is_writeback,
+                                        REG14_lr,
+                                        reg2mask(0)
+#ifndef ARM_K_BUG
+                                        | reg2mask(REG15_pc)
+#else
+                                        | reg2mask(REG12_ip)
+#endif
+        );
+#ifdef ARM_K_BUG
+    emit_mov (jit_info->native_ptr, REG15_pc, REG12_ip);
+#endif
+    emit_dcd (jit_info->native_ptr, (int) jit_info->cur_op);
+    emit_dcd (jit_info->native_ptr,
+              (int) interpreter->op_func_table[*(jit_info->cur_op)]);
+}
+
+/* We get back address of opcode in bytecode.
+   We want address of equivalent bit of jit code, which is stored as an
+   address at the same offset in a jit table. */
+void Parrot_jit_cpcf_op(Parrot_jit_info *jit_info,
+                        struct Parrot_Interp * interpreter)
+{
+    Parrot_jit_normal_op(jit_info, interpreter);
+
+    /* This is effectively the pseudo-opcode ldr - ie load relative to PC.
+       So offset includes pipeline.  */
+    jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr, cond_AL,
+                                               is_load, is_pre, 0, 0,
+                                               REG14_lr, REG15_pc, 0);
+    /* ldr pc, [r14, r0]  */
+    /* lazy. this is offset type 0, 0x000 which is r0 with zero shift  */
+    jit_info->native_ptr = emit_ldrstr (jit_info->native_ptr, cond_AL,
+                                        is_load, dir_Up, is_pre, 0, 0,
+                                        REG15_pc, REG14_lr, 2, 0);
+    /* and this "instruction" is never reached, so we can use it to store
+       the constant that we load into r14  */
+    emit_dcd (jit_info->native_ptr,
+              ((long) jit_info->op_map) -
+              ((long) interpreter->code->byte_code));
+}
+
+/*
+ * Local variables:
+ * c-indentation-style: bsd
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil 
+ * End:
+ *
+ * vim: expandtab shiftwidth=4:
+ */
--- jit.c~      Tue Jul 23 19:18:41 2002
+++ jit.c       Mon Jul 29 21:46:44 2002
@@ -128,6 +128,63 @@ optimize_jit(struct Parrot_Interp *inter
     return optimizer;
 }
 
+#ifdef ARM
+static void
+arm_sync_d_i_cache (void *start, void *end) {
+/* Strictly this is only needed for StrongARM and later (not sure about ARM8)
+   because earlier cores don't have separate D and I caches.
+   However there aren't that many ARM7 or earlier devices around that we'll be
+   running on.  */
+#ifdef __linux
+#ifdef __GNUC__
+    int result;
+    /* swi call based on code snippet from Russell King.  Description
+       verbatim:  */
+    /*
+     * Flush a region from virtual address 'r0' to virtual address 'r1'
+     * _inclusive_.  There is no alignment requirement on either address;   
+     * user space does not need to know the hardware cache layout.
+     *
+     * r2 contains flags.  It should ALWAYS be passed as ZERO until it
+     * is defined to be something else.  For now we ignore it, but may
+     * the fires of hell burn in your belly if you break this rule. ;)
+     *
+     * (at a later date, we may want to allow this call to not flush
+     * various aspects of the cache.  Passing '0' will guarantee that
+     * everything necessary gets flushed to maintain consistency in
+     * the specified region).
+     */
+
+    /* The value of the SWI is actually available by in
+       __ARM_NR_cacheflush defined in <asm/unistd.h>, but quite how to
+       get that to interpolate as a number into the ASM string is beyond
+       me.  */
+    /* I'm actually passing in exclusive end address, so subtract 1 from
+       it inside the assembler.  */
+    __asm__ __volatile__ (
+        "mov     r0, %1\n"
+        "sub     r1, %2, #1\n"
+        "mov     r2, #0\n"
+        "swi     0x9f0002\n"
+        "mov     %0, r0\n"
+        : "=r" (result)
+        : "r" ((long)start), "r" ((long)end)
+        : "r0","r1","r2");
+
+    if (result < 0) {
+        internal_exception(JIT_ERROR,
+                           "Synchronising I and D caches failed with errno=%d\n",
+                           -result);
+    }
+#else
+#error "ARM needs to sync D and I caches, and I don't know how to embed assmbler on 
+this C compiler"
+#endif
+#else
+/* Not strictly true - on RISC OS it's OS_SynchroniseCodeAreas  */
+#error "ARM needs to sync D and I caches, and I don't know how to on this OS"
+#endif
+}
+#endif
 
 /*
 ** build_asm()
@@ -214,6 +271,9 @@ build_asm(struct Parrot_Interp *interpre
         }
     }
 
+#ifdef ARM
+    arm_sync_d_i_cache (jit_info.arena_start, jit_info.native_ptr);
+#endif
     return (jit_f)jit_info.arena_start;
 }
 
--- config/auto/jit.pl.orig     Sat Jul 13 22:39:40 2002
+++ config/auto/jit.pl  Mon Jul 29 00:08:22 2002
@@ -42,11 +42,14 @@ sub runstep {
     $cpuarch = 'i386';
   }
 
+  $cpuarch               =~ s/armv[34]l?/arm/i;
+
   Configure::Data->set(
     archname    => $archname,
     cpuarch     => $cpuarch,
     osname      => $osname,
   );
+
 
   my $jitarchname              =  "$cpuarch-$osname";
   $jitarchname                 =~ s/i[456]86/i386/i;

ARM Jit v2

Reply via email to