On 2/14/21 9:58 AM, Philippe Mathieu-Daudé wrote: > +static bool trans_parallel_compare(DisasContext *ctx, arg_rtype *a, > + TCGCond cond, unsigned wlen) > +{ > + TCGv_i64 c0, c1, ax, bx, t0, t1, t2; > + > + if (a->rd == 0) { > + /* nop */ > + return true; > + } > + > + c0 = tcg_const_tl(0); > + c1 = tcg_const_tl(0xffffffff);
Cheaper for most hosts to load -1 than a 32-bit value zero-extended to 64 bits. That said, you could also use setcond(t0, t0, t1, cond); neg(t0, t0); > + for (int i = 0; i < (64 / wlen); i++) { > + tcg_gen_sextract_i64(t0, ax, wlen * i, wlen); > + tcg_gen_sextract_i64(t1, bx, wlen * i, wlen); > + tcg_gen_movcond_i64(cond, t2, t1, t0, c1, c0); > + tcg_gen_deposit_i64(cpu_gpr[a->rd], cpu_gpr[a->rd], t2, wlen * i, > wlen); > + } For an accumulate loop like this, we'll get better results if the length of the insert is the remaining length of the register. That way, the first insert is width 64, which turns into a move, so that the old value of rd is not used. Further, we can use extract2 to replace the remaining length when deposit is not available. Also, while you will need this compare loop for GT, there's a cheaper way to compute EQ, which we use in several places in QEMU. void gen_pceq(TCGv_i64 d, TCGv_i64 s, TCGv_i64 t, MemOp esz) { TCGv_i64 one = tcg_constant_i64(dup_const(esz, 1)); TCGv_i64 x = tcg_temp_new_i64(); /* Turn s == t into x == 0. */ tcg_gen_xor_i64(x, s, t); /* * See hasless(v,1) from * https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord * Shift the msb down, then use muli to replicate * the one bit across the vector element. */ tcg_gen_sub_i64(d, x, one); tcg_gen_andc_i64(d, d, x); tcg_gen_shri_i64(d, d, (8 << esz) - 1); tcg_gen_and_i64(d, d, one); tcg_gen_muli_i64(d, d, MAKE_64BIT_MASK(0, 8 << esz)); tcg_temp_free_i64(x); } In both cases, I think you should pull out helper functions and then use trans_parallel_logic. r~