Re: Another CXL/MMIO tcg tlb corner case

2024-03-15 Thread Jørgen Hansen

> On 15 Mar 2024, at 13.25, Alex Bennée  wrote:
> 
> Jørgen Hansen  writes:
> 
>> Hi,
>> 
>> While doing some testing using numactl-based interleaving of application 
>> memory
>> across regular memory and CXL-based memory using QEMU with tcg, I ran into an
>> issue similar to what we saw a while back - link to old issue:
>> https://lore.kernel.org/qemu-devel/CAFEAcA_a_AyQ=epz3_+cheat8crsk9mou894wbnw_fywamk...@mail.gmail.com/#t.
>> 
>> When running:
>> 
>> numactl --interleave 0,1 ./cachebench …
>> 
>> I hit the following:
>> 
>> numactl --interleave 0,1 ./cachebench --json_test_config 
>> ../test_configs/hit_ratio/graph_cache_follower_assocs/config.json
>> qemu: fatal: cpu_io_recompile: could not find TB for pc=0x7fffc3926dd4
> 
> Ok so this will be because we (by design) don't cache TB's for MMIO
> regions. But as you say:
>> 
>> Looking at the tb being executed, it looks like it is a single instruction 
>> tb,
>> so with my _very_ limited understanding of tcg, it shouldn’t be necessary to
>> do the IO recompile:
>> 
>> (gdb) up 13
>> 
>> #13 0x55ca13ac in cpu_loop_exec_tb (tb_exit=0x749ff6d8, 
>> last_tb=, pc=, tb=0x7fffc3926cc0 
>> , cpu=0x578c19c0) at 
>> ../accel/tcg/cpu-exec.c:904
>> 904 tb = cpu_tb_exec(cpu, tb, tb_exit);
>> (gdb) print *tb
>> $1 = {pc = 0, cs_base = 0, flags = 415285939, cflags = 4278321152, size = 7, 
>> icount = 1, tc = {ptr = 0x7fffc3926d80 , size = 
>> 176}, page_next = {0, 0}, page_addr = {18446744073709551615,
>>18446744073709551615}, jmp_lock = {value = 0}, jmp_reset_offset =
>> {65535, 65535}, jmp_insn_offset = {65535, 65535}, jmp_target_addr =
>> {0, 0}, jmp_list_head = 140736474540928, jmp_list_next = {0, 0},
>> jmp_dest = {0, 0}}
> 
> with an icount of 1 we by definition can do io. This is done in the
> translator_loop:
> 
>/*
> * Disassemble one instruction.  The translate_insn hook should
> * update db->pc_next and db->is_jmp to indicate what should be
> * done next -- either exiting this loop or locate the start of
> * the next instruction.
> */
>if (db->num_insns == db->max_insns) {
>/* Accept I/O on the last instruction.  */
>set_can_do_io(db, true);
>}
> 
> and we see later on:
> 
>tb->icount = db->num_insns;
> 
> So I guess we must have gone into the translator loop expecting to
> translate more? (side note having int8_t type for the saved bool value
> seems weird).
> 
> Could you confirm by doing something like:
> 
>  if (tb->icount == 1 &&  db->max_insns > 1) {
> fprintf(stderr, "ended up doing one insn (%d, %d)", db->max_insns, 
> db->saved_can_do_io);
>  }
> 
> after we set tb->icount?
> 

Thanks for looking into this - I tried what you suggest above and it looks
like that is a case that happens quite often (I see 100s of these just when
booting the VM), so it is hard to tell whether it is related directly to the
issue, e.g.,:

ended up doing one insn (512, 0)ended up doing one insn (512, 0)ended up doing 
one insn (512, 0)ended up doing one insn (512, 0)ended up doing one insn (512, 
0)ended up doing one insn (512, 0)ended up doing one insn (512, 0)qemu: fatal: 
cpu_io_recompile: could not find TB for pc=0x7f4264da3d48
RAX= RBX=7f6798e69040 RCX=0205f958 
RDX=0205f8f0
RSI= RDI=7ffd5663c720 RBP=7ffd5663c720 
RSP=7ffd5663c6c0
R8 =001e R9 =0205f920 R10=0004 
R11=7f67984b56b0
R12=7ffd5663c7e0 R13=7f6798e5d0b8 R14=7f6798e5d588 
R15=7ffd5663c700
RIP=7f6798e39ffd RFL=0246 [---Z-P-] CPL=3 II=0 A20=1 SMM=0 HLT=0


>> If the application is run entirely out of MMIO memory, things work fine (the
>> previous patches related to this is in Jonathans branch), so one thought is 
>> that
>> it is related to having the code on a mix of regular and CXL memory. Since we
>> previously had issues with code crossing page boundaries where only the 
>> second
>> page is MMIO, I tried out the following change to the fix introduced for that
>> issue thinking that reverting to the slow path in the middle of the 
>> translation
>> might not correctly update can_do_io:
>> 
>> diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
>> index 38c34009a5..db6ea360e0 100644
>> --- a/accel/tcg/translator.c
>> +++ b/accel/tcg/translator.c
>> @@ -258,6 +258,7 @@ static void *translator_access(CPUArchState *env, 
>> DisasContextBase *db,
>> if (unlikely(new_page1 == -1)) {
>> tb_unlock_pages(tb);
>> tb_set_page_addr0(tb, -1);
>> +set_can_do_io(db, true);
>> return NULL;
>> }
>> 
>> With that change, things work fine. Not saying that this is a valid fix for 
>> the
>> issue, but just trying to provide as much information as possible :)
> 
> I can see why this works, my worry is if we address all the early exit
> cases here.
> 
>> 
>> Thanks,
>> Jorgen
> 
> --
> 

Re: Another CXL/MMIO tcg tlb corner case

2024-03-15 Thread Alex Bennée
Jørgen Hansen  writes:

> Hi,
>
> While doing some testing using numactl-based interleaving of application 
> memory
> across regular memory and CXL-based memory using QEMU with tcg, I ran into an
> issue similar to what we saw a while back - link to old issue:
> https://lore.kernel.org/qemu-devel/CAFEAcA_a_AyQ=epz3_+cheat8crsk9mou894wbnw_fywamk...@mail.gmail.com/#t.
>
> When running:
>
> numactl --interleave 0,1 ./cachebench …
>
> I hit the following:
>
> numactl --interleave 0,1 ./cachebench --json_test_config 
> ../test_configs/hit_ratio/graph_cache_follower_assocs/config.json
> qemu: fatal: cpu_io_recompile: could not find TB for pc=0x7fffc3926dd4

Ok so this will be because we (by design) don't cache TB's for MMIO
regions. But as you say:
>
> Looking at the tb being executed, it looks like it is a single instruction tb,
> so with my _very_ limited understanding of tcg, it shouldn’t be necessary to
> do the IO recompile:
>
> (gdb) up 13
>
> #13 0x55ca13ac in cpu_loop_exec_tb (tb_exit=0x749ff6d8, 
> last_tb=, pc=, tb=0x7fffc3926cc0 
> , cpu=0x578c19c0) at 
> ../accel/tcg/cpu-exec.c:904
> 904 tb = cpu_tb_exec(cpu, tb, tb_exit);
> (gdb) print *tb
> $1 = {pc = 0, cs_base = 0, flags = 415285939, cflags = 4278321152, size = 7, 
> icount = 1, tc = {ptr = 0x7fffc3926d80 , size = 
> 176}, page_next = {0, 0}, page_addr = {18446744073709551615,
> 18446744073709551615}, jmp_lock = {value = 0}, jmp_reset_offset =
> {65535, 65535}, jmp_insn_offset = {65535, 65535}, jmp_target_addr =
> {0, 0}, jmp_list_head = 140736474540928, jmp_list_next = {0, 0},
> jmp_dest = {0, 0}}

with an icount of 1 we by definition can do io. This is done in the
translator_loop:

/*
 * Disassemble one instruction.  The translate_insn hook should
 * update db->pc_next and db->is_jmp to indicate what should be
 * done next -- either exiting this loop or locate the start of
 * the next instruction.
 */
if (db->num_insns == db->max_insns) {
/* Accept I/O on the last instruction.  */
set_can_do_io(db, true);
}

and we see later on:

tb->icount = db->num_insns;

So I guess we must have gone into the translator loop expecting to
translate more? (side note having int8_t type for the saved bool value
seems weird).

Could you confirm by doing something like:

  if (tb->icount == 1 &&  db->max_insns > 1) {
 fprintf(stderr, "ended up doing one insn (%d, %d)", db->max_insns, 
db->saved_can_do_io);
  }

after we set tb->icount?

> If the application is run entirely out of MMIO memory, things work fine (the
> previous patches related to this is in Jonathans branch), so one thought is 
> that
> it is related to having the code on a mix of regular and CXL memory. Since we
> previously had issues with code crossing page boundaries where only the second
> page is MMIO, I tried out the following change to the fix introduced for that
> issue thinking that reverting to the slow path in the middle of the 
> translation
> might not correctly update can_do_io:
>
> diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
> index 38c34009a5..db6ea360e0 100644
> --- a/accel/tcg/translator.c
> +++ b/accel/tcg/translator.c
> @@ -258,6 +258,7 @@ static void *translator_access(CPUArchState *env, 
> DisasContextBase *db,
>  if (unlikely(new_page1 == -1)) {
>  tb_unlock_pages(tb);
>  tb_set_page_addr0(tb, -1);
> +set_can_do_io(db, true);
>  return NULL;
>  }
>
> With that change, things work fine. Not saying that this is a valid fix for 
> the
> issue, but just trying to provide as much information as possible :)

I can see why this works, my worry is if we address all the early exit
cases here.

>
> Thanks,
> Jorgen

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro