Tao Chiu <andyb...@gmail.com> writes: > Björn Töpel <bj...@kernel.org> 於 2024年9月11日 週三 下午10:37寫道: > >> >> Andy Chiu <andyb...@gmail.com> writes: >> >> > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote: >> >> Björn Töpel <bj...@kernel.org> writes: >> >> >> >> > Andy Chiu <andy.c...@sifive.com> writes: >> >> > >> >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since >> >> >> instruction fetch can break down to 4 byte at a time, it is impossible >> >> >> to update two instructions without a race. In order to mitigate it, we >> >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code >> >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a >> >> > enable ftrace >> >> > >> >> >> function. This limits the reach of each ftrace entry to +-2KB >> >> >> displacing >> >> >> from ftrace_caller. >> >> >> >> >> >> Starting from the trampoline, we add a level of indirection for it to >> >> >> reach ftrace caller target. Now, it loads the target address from a >> >> >> memory location, then perform the jump. This enable the kernel to >> >> >> update >> >> >> the target atomically. >> >> > >> >> > The +-2K limit is for direct calls, right? >> >> > >> >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented >> >> > using call_ops later)? >> >> >> >> Thinking a bit more, and re-reading the series. >> >> >> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE, >> >> but >> >> >> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) >> >> +{ >> >> + unsigned long distance, orig_addr; >> >> + >> >> + orig_addr = (unsigned long)&ftrace_caller; >> >> + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr; >> >> + if (distance > JALR_RANGE) >> >> + return -EINVAL; >> >> + >> >> + return __ftrace_modify_call(rec->ip, addr, false); >> >> +} >> >> + >> >> >> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within >> >> the JALR_RANGE. >> > >> > >> > Yes, it is hardly possible that a direct trampoline will be within the >> > range. >> > >> > Recently I have been thinking some solutions to address the issue. One >> > solution is replaying AUIPC at function entries. The idea has two sides. >> > First, if we are returning back to the second instruction at trap return, >> > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is >> > to fire synchronous IPI that does remote fence.i at right timings to >> > prevent concurrent executing on a mix of old and new instructions. >> > >> > Consider replacing instructions at a function's patchable entry with the >> > following sequence: >> > >> > Initial state: >> > -------------- >> > 0: AUIPC >> > 4: JALR >> > >> > Step1: >> > write(0, "J +8") >> > fence w,w >> > send sync local+remote fence.i >> > ------------------------ >> > 0: J +8 >> > 4: JALR >> > >> > Step2: >> > write(4, "JALR'") >> > fence w,w >> > send sync local+remote fence.i >> > ------------------------ >> > 0: J +8 >> > 4: JALR' >> > >> > Step3: >> > write(0, "AUIPC'") >> > fence w,w >> > send sync local+remote fence.i (to activate the call) >> > ----------------------- >> > 0: AUIPC' >> > 4: JALR' >> > >> > The following execution sequences are acceptable: >> > - AUIPC, JALR >> > - J +8, (skipping {JALR | JALR'}) >> > - AUIPC', JALR' >> > >> > And here are sequences that we want to prevent: >> > - AUIPC', JALR >> > - AUIPC, JALR' >> > >> > The local core should never execute the forbidden sequence. >> > >> > By listing all possible combinations of executing sequence on a remote >> > core, we can find that the dangerous seqence is impossible to happen: >> > >> > let f be the fence.i at step 1, 2, 3. And let numbers be the location of >> > code being executed. Mathematically, here are all combinations at a site >> > happening on a remote core: >> > >> > fff04 -- updated seq >> > ff0f4 -- impossible, would be ff0f04, updated seq >> > ff04f -- impossible, would be ff08f, safe seq >> > f0ff4 -- impossible, would be f0ff04, updated seq >> > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated) >> > f04ff -- impossible, would be f08ff, safe seq >> > 0fff4 -- impossible, would be 0fff04, updated seq >> > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated) >> > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 >> > (updated) >> > 04fff -- old seq >> > >> > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J >> > +8, JALR') >> > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or >> > (AUIPC', JALR') >> > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR') >> > >> > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR') >> > >> > To correctly implement the solution, the trap return code must match JALR >> > and adjust sepc only for patchable function entries. This is undocumently >> > possible because we use t0 as source and destination registers for JALR >> > at function entries. Compiler never generates JALR that uses the same >> > register pattern. >> > >> > Another solution is inspired by zcmt, and perhaps we can optimize it if >> > the hardware does support zcmt. First, we allocate a page and divide it >> > into two halves. The first half of the page are 255 x 8B destination >> > addresses. Then, starting from offset 2056, the second half of the page >> > is composed by a series of 2 x 4 Byte instructions: >> > >> > 0: ftrace_tramp_1 >> > 8: ftrace_tramp_2 >> > ... >> > 2040: ftrace_tramp_255 >> > 2048: ftrace_tramp_256 (not used when configured with 255 tramps) >> > 2056: >> > ld t1, -2048(t1) >> > jr t1 >> > ld t1, -2048(t1) >> > jr t1 >> > ... >> > 4088: >> > ld t1, -2048(t1) >> > jr t1 >> > 4096: >> > >> > It is possible to expand to 511 trampolines by adding a page >> > below, and making a load+jr sequence from +2040 offset. >> > >> > When the kernel boots, we direct AUIPCs at patchable entries to the page, >> > and disable the call by setting the second instruction to NOP4. Then, we >> > can effectively enable/disable/modify a call by setting only the >> > instruction at JALR. It is possible to utilize most of the current patch >> > set to achieve atomic patching. A missing part is to allocate and manage >> > trampolines for ftrace users. >> >> (I will need to digest above in detail!) >> >> I don't think it's a good idea to try to handle direct calls w/o >> call_ops. What I was trying to say is "add the call_ops patch to your >> series, so that direct calls aren't broken". If direct calls depend on >> call_ops -- sure, no worries. But don't try to get direct calls W/O >> call_ops. That's a whole new bag of worms. >> >> Some more high-level thoughts: ...all this to workaround where we don't >> want the call_ops overhead? Is there really a use-case with a platform >> that doesn't handle the text overhead of call_ops? > > Sorry for making any confusions. I have no strong personal preference > on what we should do. Just want to have a technical discussion on what > is possible if we want to optimize code size.
Understood! I'm -- as you know -- really eager to have a text patching mechanism that is not horrible. ;-) I'd rather wait with the text size optimizations. TL;DR -- your series is fine from my POV, but it's missing call_ops, so that it doesn't break direct calls. I will take your patch series, and Puranjay's call_ops series see how they play together. We can discuss it at LPC next week! Cheers, Björn