Richard Henderson <r...@twiddle.net> writes: > On 08/08/2016 10:57 AM, Richard Henderson wrote: >> On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote: >>> +#define LXV(name, access, swap, type, elems) \ >>> +uint64_t helper_##name(CPUPPCState *env, \ >>> + target_ulong addr) \ >>> +{ \ >>> + type r[elems] = {0}; \ >>> + int i, index, bound, step; \ >>> + if (msr_le) { \ >>> + index = elems - 1; \ >>> + bound = -1; \ >>> + step = -1; \ >>> + } else { \ >>> + index = 0; \ >>> + bound = elems; \ >>> + step = 1; \ >>> + } \ >>> + \ >>> + for (i = index; i != bound; i += step) { \ >>> + if (needs_byteswap(env)) { \ >>> + r[i] = swap(access(env, addr, GETPC())); \ >>> + } else { \ >>> + r[i] = access(env, addr, GETPC()); \ >>> + } \ >>> + addr = addr_add(env, addr, sizeof(type)); \ >>> + } \ >>> + return *((uint64_t *)r); \ >>> +} >> >> This looks more complicated than necessary. >> >> (1) In big-endian mode, surely this simplifies to two 64-bit big-endian >> loads. >> >> (2) In little-endian mode, the overhead of accessing memory surely dominates, >> and therefore we should perform two 64-bit loads and manipulate the data >> after. >> >> AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and >> then swapping bytes. E.g. >> >> uint64_t helper_bswap16x4(uint64_t x) >> { >> uint64_t m = 0x00ff00ff00ff00ffull; >> return ((x & m) << 8) | ((x >> 8) & m); >> } >> >> uint64_t helper_bswap32x2(uint64_t x) >> { >> return deposit64(bswap32(x >> 32), 32, 32, bswap32(x)); >> } > > To correct myself, this big-endian load really only makes sense for lxvh8x. > For lxvw4x, a little-endian load with a word swap is fewer operations. I.e. > > tcg_gen_qemu_ld_i64(t0, addr, ctx->mem_idx, MO_LEQ); > tcg_gen_shri_i64(t1, t0, 32); > tcg_gen_deposit_i64(dest, t1, t0, 32, 32);
I have following for lxvw4x: if (ctx->le_mode) { t0 = tcg_temp_new_i64(); t1 = tcg_temp_new_i64(); tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); tcg_gen_shri_i64(t1, t0, 32); tcg_gen_deposit_i64(xth, t1, t0, 32, 32); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); tcg_gen_shri_i64(t1, t0, 32); tcg_gen_deposit_i64(xtl, t1, t0, 32, 32); tcg_temp_free_i64(t0); tcg_temp_free_i64(t1); } else { tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ); } Here is the test code: __vector uint32_t vrt32; uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243}; asm("lxvw4x %x0, 0, %1 \n\t" \ : "=ws"(vrt32) : "r"(&rb32)); printf("VRT32 = "); vec_put_u32(vrt32); Result On LE: VRT32 = 40414243 30313233 20212223 00010203 Result On BE: VRT32 = 03020100 23222120 33323130 43424140 I had have expected the the BE result to be VRT32 = 00010203 20212223 30313233 40414243 But seems there is something going under the hoods. Am I missing something here? I can fix the BE case using following but not sure if that will be correct ! tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xth, xth); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xtl, xtl); uint64_t helper_bswap32x2(uint64_t x) { return deposit64((x >> 32), 32, 32, (x)); } And then I get the expected result: VRT32 = 00010203 20212223 30313233 40414243 Regards, Nikunj