Re: ARM Jit v2

Nicholas Clark Wed, 31 Jul 2002 15:46:53 -0700

On Tue, Jul 30, 2002 at 10:20:51PM +0100, Nicholas Clark wrote:
> On Mon, Jul 29, 2002 at 10:34:00PM -0300, Daniel Grunblatt wrote:
> > On Mon, 29 Jul 2002, Nicholas Clark wrote:
> > 
> > > Here's a very minimal ARM jit framework. It does work (at least as far as
> > > passing all 10 t/op/basic.t subtests, and running mops.pbc)
> > 
> > Cool, I have also been playing with ARM but your approach is in better
> > shape. (I'll send you a copy of what I got here anyway because it's bit
> > more documented and you might want to merge it).
> 
> It's very documented, and I did merge it. Thanks. Expect to recognise large
> chunks of it in a day or two when I get to a suitable point to submit a
> better patch.


Here goes. This *isn't* functional - it's the least amount of work I could
get away with (before midnight) that gets the inner loop of mops.pasm JITted.

including this judicious bit of cheating:

--- ./examples/assembly/mops.pasm~      Wed Jan  2 13:48:40 2002
+++ ./examples/assembly/mops.pasm       Wed Jul 31 23:29:46 2002
@@ -36,7 +36,8 @@ DONE:   time   N5
         print  N2
         print  "\n"
 
-        if     I4, BUG
+        set    N1, I4
+        if     N1, BUG
 
         set    N1, I5
         div    N1, N1, N2

because I need if_i_ic JITted but mops only needs backwards jumps (which are
easy, and don't require me to write the fixup routine yet)

$ ./parrot  examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  37.217333
M op/s:        5.373840

$ ./parrot -j examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  4.963865
M op/s:        40.291184
$ ./parrot -j examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  5.018550
M op/s:        39.852148
$ ./parrot -j examples/assembly/mops.pbc 
Iterations:    100000000
Estimated ops: 200000000
Elapsed time:  5.002693
M op/s:        39.978467

This is about right. 202MHz chip, 4 instructions, 5 cycles for the sub_i_i_i,
3 instructions, 5 cycles for the if_i_ic (including the pipeline stall for
the branch)

TODO, from memory

1: fixups, to allow forward branches
2: code to emit MUL and MLA instructions
3: load integer constants
   (which in turn means writing a routine to work out what values can be
    expressed as (8 bit val) rotated by (2 * n), which others can be
    represented by the logical not of that, and what to do with the rest.
    This isn't that hard)
4: remove the compiler warnings

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/

--- /dev/null   Mon Jul 16 22:57:44 2001
+++ jit/arm/core.jit    Wed Jul 31 23:35:52 2002
@@ -0,0 +1,71 @@
+;
+; arm/core.jit
+;
+; $Id: core.jit,v 1.4 2002/05/20 05:32:58 grunblatt Exp $
+;
+
+Parrot_noop {
+    jit_info->native_ptr = emit_nop(jit_info->native_ptr);
+}
+
+; ldmea        fp, {r4, r5, r6, r7, fp, sp, pc
+; but K bug Grr if I load pc direct.
+
+Parrot_end {
+ #ifndef ARM_K_BUG
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_EA, no_writeback,
+                                        REG11_fp,
+                                        reg2mask(4) | reg2mask(REG11_fp)
+                                        | reg2mask(REG13_sp)
+                                        | reg2mask(REG15_pc));
+ #else
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_EA, no_writeback,
+                                        REG11_fp,
+                                        reg2mask(4) | reg2mask(REG11_fp)
+                                        | reg2mask(REG13_sp)
+                                        | reg2mask(REG14_lr));
+    jit_info->native_ptr = emit_mov(jit_info->native_ptr, REG15_pc, REG14_lr);
+ #endif
+}
+
+; This shows why it would be nice in the future to have a way to have ops
+;  broken into 1 to 3 of:
+;
+; -1) get values from parrot registers into CPU registers
+;  0) do stuff
+; +1) write values back to parrot registers
+;
+; that way, a JIT optimiser could punt -1 and +1 outside loops leaving
+; intermediate values in CPU registers. It could collate -1 and +1 [leaving
+; nothing :-)] and choose how to maximise use of as many real CPU registers as
+; possible.
+
+Parrot_set_i_i {
+    Parrot_jit_int_load(jit_info, interpreter, 2, r0);
+    Parrot_jit_int_store(jit_info, interpreter, 1, r0);
+}
+
+Parrot_add_i_i_i {
+    Parrot_jit_int_load(jit_info, interpreter, 2, r0);
+    Parrot_jit_int_load(jit_info, interpreter, 3, r1);
+    jit_info->native_ptr = emit_arith_reg (jit_info->native_ptr, cond_AL,
+                                                 ADD, 0, r2, r0, r1);
+    Parrot_jit_int_store(jit_info, interpreter, 1, r2);
+}
+
+Parrot_sub_i_i_i {
+    Parrot_jit_int_load(jit_info, interpreter, 2, r0);
+    Parrot_jit_int_load(jit_info, interpreter, 3, r1);
+    jit_info->native_ptr = emit_arith_reg (jit_info->native_ptr, cond_AL,
+                                                 SUB, 0, r2, r0, r1);
+    Parrot_jit_int_store(jit_info, interpreter, 1, r2);
+}
+
+Parrot_if_i_ic {
+    Parrot_jit_int_load(jit_info, interpreter, 1, r0);
+    jit_info->native_ptr = emit_arith_immediate (jit_info->native_ptr, cond_AL,
+                                                 CMP, 0, 0, r0, 0, 0);
+    emit_jump_to_op (jit_info, cond_NE, 0, *INT_CONST[2]);
+}
--- /dev/null   Mon Jul 16 22:57:44 2001
+++ jit/arm/jit_emit.h  Wed Jul 31 23:42:52 2002
@@ -0,0 +1,653 @@
+/*
+ * jit_emit.h
+ * 
+ * ARM (I think this is all ARM2 or later, although it is APCS-32)
+ *
+ * $Id: $
+ */
+
+/*  Registers
+ *
+ *  r0  Argument/result/scratch register 0.
+ *  r1  Argument/result/scratch register 1.
+ *  r2  Argument/result/scratch register 2.
+ *  r3  Argument/result/scratch register 3.
+ *  r4  Variable register 1.
+ *  r5  Variable register 2.
+ *  r6  Variable register 3.
+ *  r7  Variable register 4.
+ *  r8  Variable register 5.
+ *  r9  ARM State variable register 6. Static Base in PID, re-entrant
+ *      shared-library variants.
+ *  r10 ARM State variable register 7. Stack limit pointer in stack-checked 
+ *      variants.
+ *  r11 ARM State variable register 8. ARM state frame pointer.
+ *  r12 The Intra-Procedure call scratch register.
+ *  r13 The Stack Pointer.
+ *  r14 The Link Register.
+ *  r15 The Program Counter.
+ *
+ * r0-r3 are used to pass in first 4 arguments, and are not preserved by a
+ * function. Results (that would fit) are returned in r0
+ * Other registers are preserved across calls, although (by implication) r14
+ * and r15 are used by the call process. I don't think that it is mandated
+ * that r14 on return must hold the link address.
+ * r12 (ip) is only used on subroutine entry for stack frame calculations -
+ * after then it is a useful scratch register. If you push r14 you get
+ * another scratch register quickly.
+ *
+ * Most things nowadays are StrongARM or later. StrongARM is v4 of the
+ * architecture. ARM6 and ARM7 cores are v3, which introduced the 32 bit
+ * address bus. Earlier cores (which you won't encounter) used a 26 bit address
+ * bus, with program counter and status register combined in r15
+ */
+
+typedef enum {
+    r0,
+    r1,
+    r2,
+    r3,
+    r4,
+    r5,
+    r6,
+    r7,
+    r8,
+    r9,
+    r10,
+    r11,
+    r12,
+    r13,
+    r14,
+    r15,
+    REG10_sl = 10, REG11_fp = 11,
+    REG12_ip = 12,
+    REG13_sp = 13,
+    REG14_lr = 14,
+    REG15_pc = 15
+
+} arm_register_t;
+
+typedef enum {
+    cond_EQ = 0x00,
+    cond_NE = 0x10,
+    cond_CS = 0x20,
+    cond_CC = 0x30,
+    cond_MI = 0x40,
+    cond_PL = 0x50,
+    cond_VS = 0x60,
+    cond_VC = 0x70,
+    cond_HI = 0x80,
+    cond_LS = 0x90,
+    cond_GE = 0xA0,
+    cond_LT = 0xB0,
+    cond_GT = 0xC0,
+    cond_LE = 0xD0,
+    cond_AL = 0xE0,
+/*    cond_NV = 0xF0, */
+    /* synonyms for CS and CC:  */
+    cond_HS = 0x20,
+    cond_LO = 0x30
+} arm_cond_t;
+
+/* I've deliberately shifted these right by 1 bit so that I can forcibly
+   set the status flag on ops such as CMP. It's easy to forget (the assembler
+   doesn't mandate you explicitly write CMPS, it just sets the bit for you).
+   I got an illegal instruction trap on a StrongARM for a CMP without S, but
+   I think some of the other comparison operators have legal weird effects
+   with no S flag.  */
+typedef enum {
+    AND = 0x00,
+    EOR = 0x02,
+    SUB = 0x04,  /* Subtract         rd = rn - op2   */
+    RSB = 0x06,  /* Reverse SUbtract rd = op2 - rn  ; op2 is more flexible.   */
+    ADD = 0x08,
+    ADC = 0x0A,  /* ADd with Carry.  */
+    SBC = 0x0C,  /* SuBtract with Carry.  */
+    RSC = 0x0E,  /* Reverse Subtract with Carry.  */
+    TST = 0x11,  /* TeST                  rn AND op2    (sets flags).  */
+    TEQ = 0x13,  /* Test EQuivalence      rn XOR op2    (won't set V flag). */
+    CMP = 0x15,  /* CoMPare              rn  -  op2    */
+    CMN = 0x17,  /* CoMpare Negated       rn  +  op2    */
+    ORR = 0x18,
+    MOV = 0x1A,  /* MOVe                  rd =   op2 */
+    BIC = 0x1C,  /* BIt Clear             rd = rn AND NOT op2 */
+    MVN = 0x1E   /* MoV Not               rd = NOT op2 */
+} alu_op_t;
+
+/* note MVN is move    NOT      (ie logical NOT, 1s complement), whereas
+        CMN is compare NEGATIVE (ie arithmetic NEGATION, 2s complement)  */
+
+#define arith_sets_S 0x10
+
+#define INTERP_STRUCT_ADDR_REG r4
+
+/* B / BL
+ *
+ *  +--------------------------------------------------------------------+
+ *  | cond | 1 0 1 | L |                signed_immed_24                  |
+ *  +--------------------------------------------------------------------+
+ *   31  28 27   25 24  23                                              0
+ *
+ *
+ * The L bit
+ * 
+ * If L == 1 the instruction will store a return address in the link
+ * register (R14). Otherwise L == 0, the instruction will simply branch without
+ * storing a return address.
+ *
+ * The target address
+ *
+ * Specifies the address to branch to. The branch target address is calculated
+ * by:
+ *
+ *  - Sign-extending the 24-bit signed (two's complement) immediate to 32 bits.
+ *
+ *  - Shifting the result left two bits.
+ *
+ *  - Adding this to the contents of the PC, which contains the address of the
+ *    branch instruction plus 8.
+ *
+ * The instruction can therefore specify a branch of approximately ą32MB.
+ *
+ * [Not the full 32 bit address range of the v3 and later cores.]
+ */
+
+/* IIRC bx is branch into thumb mode, so don't name this back to bx  */
+
+char *emit_branch(char *pc,
+                  arm_cond_t cond,
+                  int L,
+                  int imm) {
+    *(pc++) = imm;
+    *(pc++) = ((imm) >> 8);
+    *(pc++) = ((imm) >> 16);
+    *(pc++) = cond | 0xA | L;
+    return pc;
+}
+
+#define emit_b(pc, cond, imm) \
+  emit_branch(pc, cond, 0, imm)
+
+#define emit_bl(pc, cond, imm) \
+  emit_branch(pc, cond, 1, imm)
+
+
+#define reg2mask(reg) (1<<(reg))
+
+typedef enum {
+    is_store      = 0x00,
+    is_load       = 0x10,
+    is_writeback  = 0x20,
+    no_writeback  = 0,
+    is_caret     = 0x40, /* assembler syntax is ^ - load sets status flags in
+                             USR mode, or load/store use user bank registers
+                             in other mode. IIRC.  */
+    no_caret      = 0,
+    is_byte      = 0x40,
+    no_byte      = 0,    /* It's a B suffix for a byte load, no suffix for
+                             word load, so this is more natural than is_word  */
+    is_pre       = 0x01, /* pre index addressing.  */
+    is_post       = 0x00  /* post indexed addressing. ie arithmetic for free  */
+} transfer_flags;
+
+/* multiple register transfer direction.
+   D = decrease, I = increase
+   A = after, B = before
+   or the stack notation
+   FD = full descending (the usual)
+   ED = empty descending
+   FA = full ascending
+   FD = full descending
+   values for stack notation are 0x10 | (ldm type) << 2 | (stm type)
+*/
+typedef enum {
+    dir_DA = 0,
+    dir_IA = 1,
+    dir_DB = 2,
+    dir_IB = 3,
+    dir_FD = 0x10 | (1 << 2) | 2,
+    dir_FA = 0x10 | (0 << 2) | 3,
+    dir_ED = 0x10 | (3 << 2) | 0,
+    dir_EA = 0x10 | (2 << 2) | 1
+} ldm_stm_dir_t;
+
+typedef enum {
+    dir_Up = 0x80,
+    dir_Down = 0x00
+} ldr_str_dir_t;
+
+char *
+emit_ldmstm_x(char *pc,
+             arm_cond_t cond,
+             int l_s,
+             ldm_stm_dir_t direction,
+             int caret,
+             int writeback,
+             int base,
+             int regmask) {
+    if ((l_s == is_load) && (direction & 0x10))
+        direction >>= 2;
+
+    *(pc++) = regmask;
+    *(pc++) = regmask >> 8;
+    /* bottom bit of direction is the up/down flag.  */
+    *(pc++) = ((direction & 1) << 7) | caret | writeback | l_s | base;
+    /* binary 100x is code for stm/ldm.  */
+    /* Top bit of direction is pre/post increment flag.  */
+    *(pc++) = cond | 0x8 | ((direction >> 1) & 1);
+    return pc;
+}
+
+/* Is is going to be rare to non existent that anyone needs to use the ^
+   syntax on LDM or STM, so make it easy to generate the normal form:  */
+#define emit_ldmstm(pc, cond, l_s, direction, writeback, base, regmask) \
+       emit_ldmstm_x(pc, cond, l_s, direction, 0, writeback, base, regmask)
+
+/* Load / Store
+ *
+ *  +--------------------------------------------------------------------+
+ *  | cond | 0 1 | I | P | U | B | W | L |  Rn  |  Rd  |      offset     |
+ *  +--------------------------------------------------------------------+
+ *   31  28 27 26 25  24  23  22  21  20  19  16 15  12 11              0
+ *
+ *
+ * The P bit
+ *
+ * P == 0 indicates the use of post-indexed addressing. The base register value
+ *        is used for the memory address, and the offset is then applied to the
+ *        base register and written back to the base register.
+ *
+ * P == 1 indicates the use of offset addressing or pre-indexed addressing (the
+ *        W bit determines which). The memory address is generated by applying
+ *        the offset to the base register value.
+ *
+ * The U bit
+ *
+ * Indicates whether the offset is added to the base (U == 1) or is subtracted
+ * from the base (U == 0).
+ *
+ * The B bit
+ * 
+ * Distinguishes between an unsigned byte (B == 1) and a word (B == 0) access.
+ *
+ * The W bit
+ *
+ * P == 0 If W == 0, the instruction is LDR, LDRB, STR or STRB and a normal
+ *        memory access is performed. If W == 1, the instruction is LDRBT,
+ *        LDRT, STRBT or STRT and an unprivileged (User mode) memory access is
+ *        performed.
+ *
+ * P == 1 If W == 0, the base register is not updated (offset addressing). If
+ *        W == 1, the calculated memory address is written back to the base 
+ *        register (pre-indexed addressing).
+ *
+ * The L bit
+ *
+ * Distinguishes between a Load (L == 1) and a Store (L == 0).
+ *
+ *  <Rd> is the destination register.
+ *  <Rn> is the base register.
+ *
+ * XXX need to detail addr mode, for I = 0 and I = 1
+ *
+ * Note that you can take advantage of post indexed addressing to get a free
+ * add onto the base register if you need it for some other purpose.
+ *
+ * Note that on StrongARM [and later? but not XScale :-(] if you don't use Rd
+ * next instruction then a load doesn't stall if it is from the cache (ie
+ * 1 cycle loads). You will want to re-order things where possible to take
+ * advantage of this.
+ *
+ * ARM1 had register shift register as possibilities for the offset (as the
+ * ALU ops still do. These took 1 more cycle, and were taken out as virtually
+ * no use was found for them. However, the bit patterns they represent
+ * certainly didn't used to fault as an illegal instruction on ARM2s, and
+ * probably later. So beware of generating illegal bit pattern offsets, as
+ * you'll get silent undefined behaviour.
+ */
+
+char *
+emit_ldrstr(char *pc,
+            arm_cond_t cond,
+            int l_s,
+            ldr_str_dir_t direction,
+            int pre,
+            int writeback,
+            int byte, 
+            int dest,
+            int base,
+            int offset_type,
+            unsigned int offset) {
+
+    *(pc++) = offset;
+    *(pc++) = ((offset >> 8) & 0xF) | (dest << 4);
+    *(pc++) = direction | byte | writeback | l_s | base;
+    *(pc++) = cond | 0x4 | offset_type | pre;
+    return pc;
+}
+
+char *
+emit_ldrstr_offset (char *pc,
+                    arm_cond_t cond,
+                    int l_s,
+                    int pre,
+                    int writeback,
+                    int byte,
+                    int dest,
+                    int base,
+                    int offset) {
+    ldr_str_dir_t direction = dir_Up;
+    if (offset > 4095 || offset < -4095) {
+        internal_exception(JIT_ERROR,
+                           "Unable to generate offset %d, larger than 4095\n",
+                           offset);
+    }
+    if (offset < 0) {
+        direction = dir_Down;
+        offset = -offset;
+    }
+    return emit_ldrstr(pc, cond, l_s, direction, pre, writeback, byte, dest,
+                       base, 0, offset);
+}
+
+/* Arithmetic
+ *
+ *  +--------------------------------------------------------------------+
+ *  | cond | 0 0 | I |  ALU  Opcode  | S |  Rn  |  Rd  | shifted operand |
+ *  +--------------------------------------------------------------------+
+ *   31  28 27 26 25  24  23  22  21  20  19  16 15  12 11              0
+ *
+ *
+ * The S bit
+ * 
+ * Indicates if the CPSR will be updated (S == 1) or not (S == 0).
+ *
+ * Two types of CPSR updates can occur:
+ *
+ *  - If <Rd> is not R15, the N and Z flags are set according to the result of
+ *    of the addition, and C and V flags are set according to whether the
+ *    addition generated a carry (unsigned overflow) and a signed overflow,
+ *    respectively. The rest of the CPSR is unchanged.
+ *
+ * XXX shifted immediate values for the second operand can also set the C
+ * flag (and therefore presumably also clear it) when the S flag is set for
+ * certain ALU ops. (I think just the logical ops) This is obscure, but
+ * sometimes useful. No idea where this is documented.
+ *
+ *  - If <Rd> is R15, the SPSR of the current mode is copied to the CPSR. This
+ *    form of the instruction is UNPREDICTABLE if executed in User mode or
+ *    System mode, because these do not have an SPSR.
+ *
+ *
+ */
+
+typedef enum {
+    shift_LSL = 0x00,
+    shift_LSR = 0x20,
+    shift_ASR = 0x40,
+    shift_ROR = 0x60,
+    shift_ASL = 0x00   /* Synonym - no sign extension (or not) on <<  */
+} barrel_shift_t;
+/* RRX (rotate right with extend - a 1 position 33 bit rotate including the
+   carry flag) is encoded as ROR by 0.  */
+
+char *
+emit_arith(char *pc,
+           arm_cond_t cond,
+           alu_op_t op,
+           int status,
+           arm_register_t rd,
+           arm_register_t rn,
+           int operand2_type,
+           int operand2) {
+    *(pc++) = operand2;
+    *(pc++) = rd << 4 | ((operand2 >> 8) & 0xF);
+    *(pc++) = op << 4 | status | rn;
+    *(pc++) = cond | 0 | operand2_type | op >> 4;
+    return pc;
+}
+
+/* eg add r0, r3, r7  */
+#define emit_arith_reg(pc, cond, op, status, rd, rn, rm) \
+       emit_arith (pc, cond, op, status, rd, rn, 0, rm)
+
+/* eg sub r0, r3, r7 lsr #3 */
+#define emit_arith_reg_shift_const(pc, cond, op, status, rd, rn, rm, shift, by) \
+       emit_arith (pc, cond, op, status, rd, rn, 0, ((by) << 7) | shift | 0 | (rm))
+
+/* eg orrs r1, r2, r1 rrx */
+#define emit_arith_reg_rrx(pc, cond, op, status, rd, rn, rm) \
+       emit_arith (pc, cond, op, status, rd, rn, 0, shift_ROR | 0 | (rm))
+
+/* I believe these take 2 cycles (due to having to access a 4th register.  */
+#define emit_arith_reg_shift_reg(pc, cond, op, status, rd, rn, rm, shift, rs) \
+       emit_arith (pc, cond, op, status, rd, rn, 0, ((rs) << 8) | shift | 0x10 | (rm))
+
+#define emit_arith_immediate(pc, cond, op, status, rd, rn, val, rotate) \
+       emit_arith (pc, cond, op, status, rd, rn, 2, ((rotate) << 8) | (val))
+
+/* I'll use mov r0, r0 as my NOP for now.  */
+#define emit_nop(pc) emit_mov (pc, r0, r0)
+
+/* MOV ignores rn  */
+#define emit_mov(pc, dest, src) emit_arith_reg (pc, cond_AL, MOV, 0, dest, 0, src)
+
+#define emit_dcd(pc, word)  { \
+    *((int *)pc) = word; \
+    pc+=4; }
+
+static void Parrot_jit_int_load(Parrot_jit_info *jit_info,
+                             struct Parrot_Interp *interpreter,
+                             int param,
+                             int hwreg)
+{
+    opcode_t op_type
+        = interpreter->op_info_table[*jit_info->cur_op].types[param];
+    int val = jit_info->cur_op[param];
+    int offset;
+
+    switch(op_type){
+        case PARROT_ARG_I:
+            offset = ((char *)&interpreter->ctx.int_reg.registers[val])
+                - (char *)interpreter;
+            if (offset > 4095) {
+                internal_exception(JIT_ERROR,
+                                   "integer load register %d generates offset %d, 
+larger than 4095\n",
+                           val, offset);
+            }
+            jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr,
+                                                       cond_AL,
+                                                       is_load,
+                                                       is_pre,
+                                                       0, 0,
+                                                       hwreg,
+                                                       INTERP_STRUCT_ADDR_REG,
+                                                       offset);
+            break;
+
+        case PARROT_ARG_IC:
+        default:
+            internal_exception(JIT_ERROR,
+                               "Unsupported op parameter type %d in jit_int_load\n",
+                               op_type);
+    }
+}
+
+static void Parrot_jit_int_store(Parrot_jit_info *jit_info,
+                             struct Parrot_Interp *interpreter,
+                             int param,
+                             int hwreg)
+{
+    opcode_t op_type
+         = interpreter->op_info_table[*jit_info->cur_op].types[param];
+    int val = jit_info->cur_op[param];
+    int offset;
+
+    switch(op_type){
+        case PARROT_ARG_I:
+            offset = ((char *)&interpreter->ctx.int_reg.registers[val])
+                - (char *)interpreter;
+            if (offset > 4095) {
+                internal_exception(JIT_ERROR,
+                                   "integer store register %d generates offset %d, 
+larger than 4095\n",
+                           val, offset);
+            }
+            jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr,
+                                                       cond_AL,
+                                                       is_store,
+                                                       is_pre,
+                                                       0, 0,
+                                                       hwreg,
+                                                       INTERP_STRUCT_ADDR_REG,
+                                                       offset);
+            break;
+
+        case PARROT_ARG_N:
+        default:
+            internal_exception(JIT_ERROR,
+                            "Unsupported op parameter type %d in jit_int_store\n",
+                               op_type);
+    }
+}
+
+static void emit_jump_to_op(Parrot_jit_info *jit_info, arm_cond_t cond,
+                            int L, opcode_t disp) {
+    opcode_t opcode = jit_info->op_i + disp;
+
+    if(opcode <= jit_info->op_i) {
+        int offset = jit_info->op_map[opcode].offset -
+            (jit_info->native_ptr - jit_info->arena_start);
+
+        jit_info->native_ptr
+            = emit_branch(jit_info->native_ptr, cond, L, (offset >> 2) - 2);
+        return;
+    }
+    internal_exception(JIT_ERROR, "Can't go forward yet\n");
+    
+}
+
+void Parrot_jit_dofixup(Parrot_jit_info *jit_info,
+                        struct Parrot_Interp * interpreter)
+{
+    /* Todo.  */
+}
+/* My entry code is create a stack frame:
+       mov     ip, sp
+       stmfd   sp!, {r4, fp, ip, lr, pc}
+       sub     fp, ip, #4
+   Then store the first parameter (pointer to the interpreter) in r4.
+       mov     r4, r0
+*/
+
+void
+Parrot_jit_begin(Parrot_jit_info *jit_info,
+                 struct Parrot_Interp * interpreter)
+{
+    jit_info->native_ptr = emit_mov (jit_info->native_ptr, REG12_ip, REG13_sp);
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_store, dir_FD,
+                                        is_writeback,
+                                        REG13_sp,
+                                        reg2mask(4) | reg2mask(REG11_fp)
+                                        | reg2mask(REG12_ip)
+                                        | reg2mask(REG14_lr)
+                                        | reg2mask(REG15_pc));
+    jit_info->native_ptr = emit_arith_immediate (jit_info->native_ptr, cond_AL,
+                                                 SUB, 0, REG11_fp, REG12_ip,
+                                                 4, 0);
+    jit_info->native_ptr = emit_mov (jit_info->native_ptr, 4, 0);
+}
+
+/* I'm going to load registers to call functions in general like this:
+    adr     r14,  .L1
+    ldmia   r14!,  {r0, r1, r2, pc} ; register list built by jit
+    .L1:    r0 data
+            r1 data
+            r2 data
+           <where ever>        ; address of function.
+    .L2:                      ; next instruction - return point from func.
+
+    # here I'm going to do 
+
+    mov            r1, r4      ; current interpreter is arg 1
+    adr     r14,  .L1
+    ldmia   r14!,  {r0, pc}
+    .L1:    address of current opcode
+           <where ever>        ; address of function for op
+    .L2:                      ; next instruction - return point from func.
+*/
+
+/*
+XXX no.
+need to adr beyond:
+
+    mov            r1, r4      ; current interpreter is arg 1
+    adr     r14,  .L1
+    ldmda   r14!,  {r0, ip}
+    mov     pc, ip
+    .L1     address of current opcode
+    dcd     <where ever>      ; address of function for op
+    .L2:                      ; next instruction - return point from func.
+*/
+void
+Parrot_jit_normal_op(Parrot_jit_info *jit_info,
+                     struct Parrot_Interp * interpreter)
+{
+    jit_info->native_ptr = emit_mov (jit_info->native_ptr, r1, r4);
+#ifndef ARM_K_BUG
+    jit_info->native_ptr = emit_mov (jit_info->native_ptr, REG14_lr, REG15_pc);
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_IA,
+                                        is_writeback,
+                                        REG14_lr,
+                                        reg2mask(0) | reg2mask(REG15_pc));
+#else
+    jit_info->native_ptr = emit_arith_immediate (jit_info->native_ptr, cond_AL,
+                                                 ADD, 0, REG14_lr, REG15_pc,
+                                                 4, 0);
+    jit_info->native_ptr = emit_ldmstm (jit_info->native_ptr,
+                                        cond_AL, is_load, dir_IA,
+                                        is_writeback,
+                                        REG14_lr,
+                                        reg2mask(0) | reg2mask(REG12_ip));
+    jit_info->native_ptr = emit_mov (jit_info->native_ptr, REG15_pc, REG12_ip);
+#endif
+    emit_dcd (jit_info->native_ptr, (int) jit_info->cur_op);
+    emit_dcd (jit_info->native_ptr,
+              (int) interpreter->op_func_table[*(jit_info->cur_op)]);
+}
+
+/* We get back address of opcode in bytecode.
+   We want address of equivalent bit of jit code, which is stored as an
+   address at the same offset in a jit table. */
+void Parrot_jit_cpcf_op(Parrot_jit_info *jit_info,
+                        struct Parrot_Interp * interpreter)
+{
+    Parrot_jit_normal_op(jit_info, interpreter);
+
+    /* This is effectively the pseudo-opcode ldr - ie load relative to PC.
+       So offset includes pipeline.  */
+    jit_info->native_ptr = emit_ldrstr_offset (jit_info->native_ptr, cond_AL,
+                                               is_load, is_pre, 0, 0,
+                                               REG14_lr, REG15_pc, 0);
+    /* ldr pc, [r14, r0]  */
+    /* lazy. this is offset type 0, 0x000 which is r0 with zero shift  */
+    jit_info->native_ptr = emit_ldrstr (jit_info->native_ptr, cond_AL,
+                                        is_load, dir_Up, is_pre, 0, 0,
+                                        REG15_pc, REG14_lr, 2, 0);
+    /* and this "instruction" is never reached, so we can use it to store
+       the constant that we load into r14  */
+    emit_dcd (jit_info->native_ptr,
+              ((long) jit_info->op_map) -
+              ((long) interpreter->code->byte_code));
+}
+
+/*
+ * Local variables:
+ * c-indentation-style: bsd
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil 
+ * End:
+ *
+ * vim: expandtab shiftwidth=4:
+ */
--- ../parrot-clean/jit.c       Tue Jul 23 19:18:41 2002
+++ jit.c       Tue Jul 30 23:32:06 2002
@@ -7,6 +7,12 @@
 #include <parrot/parrot.h>
 #include "parrot/jit.h"
 
+#ifdef ARM
+#ifdef __linux
+#include <asm/unistd.h>
+#endif
+#endif
+
 /*
 ** optimize_jit()
 ** XXX Don't pay much attention to this yet.
@@ -128,6 +134,63 @@ optimize_jit(struct Parrot_Interp *inter
     return optimizer;
 }
 
+#ifdef ARM
+static void
+arm_sync_d_i_cache (void *start, void *end) {
+/* Strictly this is only needed for StrongARM and later (not sure about ARM8)
+   because earlier cores don't have separate D and I caches.
+   However there aren't that many ARM7 or earlier devices around that we'll be
+   running on.  */
+#ifdef __linux
+#ifdef __GNUC__
+    int result;
+    /* swi call based on code snippet from Russell King.  Description
+       verbatim:  */
+    /*
+     * Flush a region from virtual address 'r0' to virtual address 'r1'
+     * _inclusive_.  There is no alignment requirement on either address;   
+     * user space does not need to know the hardware cache layout.
+     *
+     * r2 contains flags.  It should ALWAYS be passed as ZERO until it
+     * is defined to be something else.  For now we ignore it, but may
+     * the fires of hell burn in your belly if you break this rule. ;)
+     *
+     * (at a later date, we may want to allow this call to not flush
+     * various aspects of the cache.  Passing '0' will guarantee that
+     * everything necessary gets flushed to maintain consistency in
+     * the specified region).
+     */
+
+    /* The value of the SWI is actually available by in
+       __ARM_NR_cacheflush defined in <asm/unistd.h>, but quite how to
+       get that to interpolate as a number into the ASM string is beyond
+       me.  */
+    /* I'm actually passing in exclusive end address, so subtract 1 from
+       it inside the assembler.  */
+    __asm__ __volatile__ (
+        "mov     r0, %1\n"
+        "sub     r1, %2, #1\n"
+        "mov     r2, #0\n"
+        "swi     " __sys1(__ARM_NR_cacheflush) "\n"
+        "mov     %0, r0\n"
+        : "=r" (result)
+        : "r" ((long)start), "r" ((long)end)
+        : "r0","r1","r2");
+
+    if (result < 0) {
+        internal_exception(JIT_ERROR,
+                           "Synchronising I and D caches failed with errno=%d\n",
+                           -result);
+    }
+#else
+#error "ARM needs to sync D and I caches, and I don't know how to embed assmbler on 
+this C compiler"
+#endif
+#else
+/* Not strictly true - on RISC OS it's OS_SynchroniseCodeAreas  */
+#error "ARM needs to sync D and I caches, and I don't know how to on this OS"
+#endif
+}
+#endif
 
 /*
 ** build_asm()
@@ -214,6 +277,9 @@ build_asm(struct Parrot_Interp *interpre
         }
     }
 
+#ifdef ARM
+    arm_sync_d_i_cache (jit_info.arena_start, jit_info.native_ptr);
+#endif
     return (jit_f)jit_info.arena_start;
 }
 
--- config/auto/jit.pl.orig     Sat Jul 13 22:39:40 2002
+++ config/auto/jit.pl  Mon Jul 29 00:08:22 2002
@@ -42,11 +42,14 @@ sub runstep {
     $cpuarch = 'i386';
   }
 
+  $cpuarch               =~ s/armv[34]l?/arm/i;
+
   Configure::Data->set(
     archname    => $archname,
     cpuarch     => $cpuarch,
     osname      => $osname,
   );
+
 
   my $jitarchname              =  "$cpuarch-$osname";
   $jitarchname                 =~ s/i[456]86/i386/i;

Re: ARM Jit v2

Reply via email to