On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko <[email protected]> wrote: > On 03/10/2015 01:51 PM, Ingo Molnar wrote: >> >> * Denys Vlasenko <[email protected]> wrote: >> >>> PER_CPU(old_rsp) usage is simplified - now it is used only >>> as temp storage, and userspace stack pointer is immediately stored >>> in pt_regs->sp on syscall entry, instead of being used later, >>> on syscall exit. >>> >>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >>> uses pt_regs->sp now. >>> >>> FIXUP/RESTORE_TOP_OF_STACK are simplified. >> >> Just trying to judge the performance impact: >> >>> --- a/arch/x86/kernel/entry_64.S >>> +++ b/arch/x86/kernel/entry_64.S >>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >>> * manipulation. >>> */ >>> .macro FIXUP_TOP_OF_STACK tmp offset=0 >>> - movq PER_CPU_VAR(old_rsp),\tmp >>> - movq \tmp,RSP+\offset(%rsp) >>> movq $__USER_DS,SS+\offset(%rsp) >>> movq $__USER_CS,CS+\offset(%rsp) >>> movq RIP+\offset(%rsp),\tmp /* get rip */ >>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >>> .endm >>> >>> .macro RESTORE_TOP_OF_STACK tmp offset=0 >>> - movq RSP+\offset(%rsp),\tmp >>> - movq \tmp,PER_CPU_VAR(old_rsp) >>> + /* nothing to do */ >>> .endm >>> >>> /* >>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>> */ >>> ENABLE_INTERRUPTS(CLBR_NONE) >>> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >>> + movq %rcx,RIP(%rsp) >>> + movq PER_CPU_VAR(old_rsp),%rcx >>> + movq %r11,EFLAGS(%rsp) >>> + movq %rcx,RSP(%rsp) >>> + movq_cfi rax,ORIG_RAX >>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >>> movq $-ENOSYS,RAX(%rsp) >>> - movq_cfi rax,ORIG_RAX >>> - movq %r11,EFLAGS(%rsp) >>> - movq %rcx,RIP(%rsp) >>> CFI_REL_OFFSET rip,RIP >>> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >>> jnz tracesys >> >> So there are now +2 instructions (5 instead of 3) in the system_call >> path, but there are -2 instructions in the SYSRETQ path, > > Unfortunately, no. There is only this change in SYSRETQ path, > which simply changes where we get RSP from: > > @@ -293,7 +289,7 @@ ret_from_sys_call: > CFI_REGISTER rip,rcx > movq EFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > - movq PER_CPU_VAR(old_rsp), %rsp > + movq RSP(%rsp),%rsp > /* > * 64bit SYSRET restores rip from rcx, > * rflags from r11 (but RF and VM bits are forced to 0), > > Most likely, no change in execution speed here. > At best, it is one cycle faster somewhere in address generation unit > because for PER_CPU_VAR() address evaluation, GS base is nonzero. > > > Since this patch does add two extra MOVs, > I did benchmark these patches. They add exactly one cycle > to system call code path on my Sandy Bridge CPU. >
Personally, I'm willing to pay that cycle. It could be a bigger savings on context switch, and the simplification it enables is pretty good. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

