On Wed, Apr 27, 2011 at 12:35 AM, Igor Kovalenko <igor.v.kovale...@gmail.com> wrote: > On Wed, Apr 27, 2011 at 12:07 AM, Blue Swirl <blauwir...@gmail.com> wrote: >> On Tue, Apr 26, 2011 at 10:07 PM, Igor Kovalenko >> <igor.v.kovale...@gmail.com> wrote: >>> On Tue, Apr 26, 2011 at 10:36 PM, Blue Swirl <blauwir...@gmail.com> wrote: >>>> On Tue, Apr 26, 2011 at 8:02 PM, Artyom Tarasenko <atar4q...@gmail.com> >>>> wrote: >>>>> On Mon, Apr 25, 2011 at 10:29 PM, Aurelien Jarno <aurel...@aurel32.net> >>>>> wrote: >>>>>> On Fri, Apr 22, 2011 at 06:14:06PM +0400, Igor Kovalenko wrote: >>>>>>> On Fri, Apr 22, 2011 at 2:39 AM, Laurent Desnogues >>>>>>> <laurent.desnog...@gmail.com> wrote: >>>>>>> > On Thu, Apr 21, 2011 at 9:45 PM, Igor Kovalenko >>>>>>> > <igor.v.kovale...@gmail.com> wrote: >>>>>>> >> On Thu, Apr 21, 2011 at 7:44 PM, Laurent Desnogues >>>>>>> >> <laurent.desnog...@gmail.com> wrote: >>>>>>> >>> On Thu, Apr 21, 2011 at 4:57 PM, Artyom Tarasenko >>>>>>> >>> <atar4q...@gmail.com> wrote: >>>>>>> >>>> On Tue, Apr 12, 2011 at 4:14 AM, Igor Kovalenko >>>>>>> >>>> <igor.v.kovale...@gmail.com> wrote: >>>>>>> >>>>>>> Do you have public test case? >>>>>>> >>>>>>> It is possible to code this delay slot write test but real >>>>>>> >>>>>>> issue may >>>>>>> >>>>>>> be corruption elsewhere. >>>>>>> >>>> >>>>>>> >>>> The test case is trivial: it's just the two instructions, branch >>>>>>> >>>> and wrpr. >>>>>>> >>>> >>>>>>> >>>>> In theory there could be multiple issues including compiler >>>>>>> >>>>> induced ones. >>>>>>> >>>>> I'd prefer to see some kind of reproducible testcase. >>>>>>> >>>> >>>>>>> >>>> Ok, attached a 40 byte long test (the first 32 bytes are not used >>>>>>> >>>> and >>>>>>> >>>> needed only because the bios entry point is 0x20). >>>>>>> >>>> >>>>>>> >>>> $ git pull && make && sparc64-softmmu/qemu-system-sparc64 -bios >>>>>>> >>>> test-wrpr.bin -nographic >>>>>>> >>>> Already up-to-date. >>>>>>> >>>> make[1]: Nothing to be done for `all'. >>>>>>> >>>> /mnt/terra/projects/vanilla/qemu/tcg/tcg.c:1892: tcg fatal error >>>>>>> >>>> Aborted >>>>>>> >>> >>>>>>> >>> The problem seems to be that wrpr is using a non-local >>>>>>> >>> TCG tmp (cpu_tmp0). >>>>>>> >> >>>>>>> >> Just tried the test case with write to %pil - seems like write >>>>>>> >> itself is OK. >>>>>>> >> The issue appears to be with save_state() call since adding >>>>>>> >> save_state >>>>>>> >> to %pil case provokes the same tcg abort. >>>>>>> > >>>>>>> > The problem is that cpu_tmp0, not being a local tmp, doesn't >>>>>>> > need to be saved across helper calls. This results in the >>>>>>> > TCG "optimizer" getting rid of it even though it's later used. >>>>>>> > Look at the log and you'll see what I mean :-) >>>>>>> >>>>>>> I'm not very comfortable with tcg yet. Would it be possible to teach >>>>>>> optimizer working with delay slots? Or do I look in the wrong place. >>>>>>> >>>>>> >>>>>> The problem is not on the TCG side, but on the target-sparc/translate.c >>>>>> side: >>>>>> >>>>>> | case 0x32: /* wrwim, V9 wrpr */ >>>>>> | { >>>>>> | if (!supervisor(dc)) >>>>>> | goto priv_insn; >>>>>> | tcg_gen_xor_tl(cpu_tmp0, cpu_src1, >>>>>> cpu_src2); >>>>>> | #ifdef TARGET_SPARC64 >>>>>> >>>>>> Here cpu_tmp0 is loaded. cpu_tmp0 is a TCG temp, which means it is not >>>>>> saved across TCG branches. >>>>>> >>>>>> [...] >>>>>> >>>>>> | case 6: // pstate >>>>>> | save_state(dc, cpu_cond); >>>>>> | gen_helper_wrpstate(cpu_tmp0); >>>>>> | dc->npc = DYNAMIC_PC; >>>>>> | break; >>>>>> >>>>>> save_state() calls save_npc(), which in turns might call >>>>>> gen_generic_branch(): >>>>> >>>>> Hmm. This is not the only instruction using save_state() and cpu_tmp0. >>>>> At least ldd is another example. >>>>> >>>>>> | static inline void gen_generic_branch(target_ulong npc1, target_ulong >>>>>> npc2, >>>>>> | TCGv r_cond) >>>>>> | { >>>>>> | int l1, l2; >>>>>> | >>>>>> | l1 = gen_new_label(); >>>>>> | l2 = gen_new_label(); >>>>>> | >>>>>> | tcg_gen_brcondi_tl(TCG_COND_EQ, r_cond, 0, l1); >>>>>> | >>>>>> | tcg_gen_movi_tl(cpu_npc, npc1); >>>>>> | tcg_gen_br(l2); >>>>>> | >>>>>> | gen_set_label(l1); >>>>>> | tcg_gen_movi_tl(cpu_npc, npc2); >>>>>> | gen_set_label(l2); >>>>>> | } >>>>>> >>>>>> And here is the TCG branch, which drop the TCG temp cpu_temp0. >>>>>> >>>>>> The solution is either to rewrite gen_generic_branch() without TCG >>>>>> branches, or to use a TCG temp local instead of a TCG temp. >>>>> >>>>> You mean something like >>>>> >>>>> case 6: // pstate >>>>> { >>>>> TCGv r_temp; >>>>> >>>>> r_temp = tcg_temp_new(); >>>>> tcg_gen_mov_tl(r_temp, cpu_tmp0); >>>>> save_state(dc, cpu_cond); >>>>> gen_helper_wrpstate(r_temp); >>>>> tcg_temp_free(r_temp); >>>>> dc->npc = DYNAMIC_PC; >>>>> } >>>>> break; >>>>> >>>>> ? >>>>> This fails with the same error message. >>>> >>>> Close, but you need to use tcg_temp_local_new(). Does this work? >>>> >>>> diff --git a/target-sparc/translate.c b/target-sparc/translate.c >>>> index 3c958b2..52fa2f1 100644 >>>> --- a/target-sparc/translate.c >>>> +++ b/target-sparc/translate.c >>>> @@ -3505,9 +3505,15 @@ static void disas_sparc_insn(DisasContext * dc) >>>> tcg_gen_mov_tl(cpu_tbr, cpu_tmp0); >>>> break; >>>> case 6: // pstate >>>> - save_state(dc, cpu_cond); >>>> - gen_helper_wrpstate(cpu_tmp0); >>>> - dc->npc = DYNAMIC_PC; >>>> + { >>>> + TCGv r_tmp = tcg_temp_local_new(); >>>> + >>>> + tcg_gen_mov_tl(r_tmp, cpu_tmp0); >>>> + save_state(dc, cpu_cond); >>>> + gen_helper_wrpstate(r_tmp); >>>> + tcg_temp_free_ptr(r_tmp); >>>> + dc->npc = DYNAMIC_PC; >>>> + } >>>> break; >>>> case 7: // tl >>>> save_state(dc, cpu_cond); >>>> >>> >>> This one works (I have tested similar change but tcg_temp_free() would >>> be OK as well) >> >> Thanks for spotting, that was not intentional. >> >>> Could anyone please explain why it still works without being in delay >>> slot? Simply pointing at optimizer does not work :( >> >> save_npc() will call gen_generic_branch() only if dc->npc == JUMP_PC, >> which is used for delay slot handling. Otherwise gen_generic_branch() >> is not called and thus brcond does not get used, so the temporary >> will not be flushed for non-delay slot cases. > > Great that explains it all, thanks! I think audit over translate.c > would be needed, alternatively all main switch temps for sparc64 could > be made local as hinted by Laurent - provided there is no significant > speed impact.
I think there may be a few other cases. Local temps are allocated from stack instead of CPU registers, so there could be a speed impact. It would be better if this case could be detected more reliably instead, for example enhancing the debugging mode for TCG.