On Tue, 31 Aug 2021 at 10:53, Peter Zijlstra <pet...@infradead.org> wrote: > > On Tue, Aug 31, 2021 at 08:05:21AM +0000, Christophe Leroy wrote: > > > +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \ > > + asm(".pushsection .text, \"ax\" \n" \ > > + ".align 4 \n" \ > > + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \ > > + STATIC_CALL_TRAMP_STR(name) ": \n" \ > > + " blr \n" \ > > + " nop \n" \ > > + " nop \n" \ > > + " nop \n" \ > > + ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \ > > + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " > > STATIC_CALL_TRAMP_STR(name) " \n" \ > > + ".popsection \n") > > > +static int patch_trampoline_32(u32 *addr, unsigned long target) > > +{ > > + int err; > > + > > + err = patch_instruction(addr++, ppc_inst(PPC_RAW_LIS(_R12, > > PPC_HA(target)))); > > + err |= patch_instruction(addr++, ppc_inst(PPC_RAW_ADDI(_R12, _R12, > > PPC_LO(target)))); > > + err |= patch_instruction(addr++, ppc_inst(PPC_RAW_MTCTR(_R12))); > > + err |= patch_instruction(addr, ppc_inst(PPC_RAW_BCTR())); > > + > > + return err; > > +} > > There can be concurrent execution and modification; the above doesn't > look safe in that regard. What happens if you've say, done the first > two, but not the latter two and execution happens (on a different > CPU or through IRQ context, etc..)? > > > +void arch_static_call_transform(void *site, void *tramp, void *func, bool > > tail) > > +{ > > + int err; > > + unsigned long target = (long)func; > > + > > + if (!tramp) > > + return; > > + > > + mutex_lock(&text_mutex); > > + > > + if (!func) > > + err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR())); > > + else if (is_offset_in_branch_range((long)target - (long)tramp)) > > + err = patch_branch(tramp, target, 0); > > These two are single instruction modifications and I'm assuming the > hardware is sane enough that execution sees either the old or the new > instruction. So this should work. > > > + else if (IS_ENABLED(CONFIG_PPC32)) > > + err = patch_trampoline_32(tramp, target); > > + else > > + BUILD_BUG(); > > + > > + mutex_unlock(&text_mutex); > > + > > + if (err) > > + panic("%s: patching failed %pS at %pS\n", __func__, func, > > tramp); > > +} > > +EXPORT_SYMBOL_GPL(arch_static_call_transform); > > One possible solution that we explored on ARM64, was having the > trampoline be in 2 slots: > > > b 1f > > 1: blr > nop > nop > nop > > 2: blr > nop > nop > nop > > Where initially the first slot is active (per "b 1f"), then you write > the second slot, and as a final act, re-write the initial branch to > point to slot 2. > > Then you execute synchronize_rcu_tasks() under your text mutex > (careful!) to ensure all users of your slot1 are gone and the next > modification repeats the whole thing, except for using slot1 etc.. > > Eventually I think Ard came up with the latest ARM64 proposal which puts > a literal in a RO section (could be the text section I suppose) and > loads and branches to that. >
Yes. The main reason is simply that anything else is premature optimization: we have a clear use case (CFI) where out-of-line static calls are faster than compiler generated indirect calls, even if the static call sequence is based on a literal load and an indirect branch, but CFI is not upstream [yet]. Once other use cases emerge, we will revisit this. > Anyway, the thing is, you can really only modify a single instruction at > the time and need to ensure concurrent execution is correct.