On Wed, Mar 18, 2015 at 3:17 PM, Denys Vlasenko <dvlas...@redhat.com> wrote: > On 03/18/2015 10:55 PM, Andy Lutomirski wrote: >> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko <dvlas...@redhat.com> wrote: >>> On 03/18/2015 10:32 PM, Linus Torvalds wrote: >>>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski <l...@amacapital.net> >>>> wrote: >>>>>> >>>>>> crash> disassemble page_fault >>>>>> Dump of assembler code for function page_fault: >>>>>> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a9 <+9>: sub $0x78,%rsp >>>>>> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 <error_entry> >>>>> >>>>> The callq was the double-faulting instruction, and it is indeed the >>>>> first function in here that would have accessed the stack. (The sub >>>>> *changes* rsp but isn't a memory access.) So, since RSP is bogus, we >>>>> page fault, and the page fault is promoted to a double fault. The >>>>> surprising thing is that the page fault itself seems to have been >>>>> delivered okay, and RSP wasn't on a page boundary. >>>> >>>> Not at all surprising, and sure it was on a page boundry.. >>>> >>>> Look closer. >>>> >>>> %rsp is 00007fffa55eafb8. >>>> >>>> But that's *after* page_fault has done that >>>> >>>> sub $0x78,%rsp >>>> >>>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a >>>> different page. >> >> Ah, I forgot to add 0x78. You're right, of course. >> >>>> >>>> And that page happened to be mapped. >>>> >>>> So what happened is: >>>> >>>> - we somehow entered kernel mode without switching stacks >>>> >>>> (ie presumably syscall) >>>> >>>> - the user stack was still fine >>>> >>>> - we took a page fault, which once again didn't switch stacks, >>>> because we were already in kernel mode. And this page fault worked, >>>> because it just pushed the error code onto the user stack which was >>>> mapped. >>>> >>>> - we now took a second page fault within the page fault handler, >>>> because now the stack pointer has been decremented and points one user >>>> page down that is *not* mapped, so now that page fault cannot push the >>>> error code and return information. >>>> >>>> Now, how we took that original page fault is sadly not very clear at >>>> all. I agree that it's something about system-call (how could we not >>>> change stacks otherwise), but why it should have started now, I don't >>>> know. I don't think "system_call" has changed at all. >>>> >>>> Maybe there is something wrong with the new "ret_from_sys_call" logic, >>>> and that "use sysret to return to user mode" thing. Because this code >>>> sequence: >>>> >>>> + movq (RSP-RIP)(%rsp),%rsp >>>> + USERGS_SYSRET64 >>>> >>>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the >>>> kernel with a user stack poiinter, maybe we're *exiting* the kernel, >>>> and have just reloaded the user stack pointer when "USERGS_SYSRET64" >>>> takes some fault. >>> >>> Yes, so far we happily thought that SYSRET never fails... >>> >>> This merits adding some code which would at least BUG_ON >>> if the faulting address is seen to match SYSRET64. >> >> sysret64 can only fail with #GP, and we're totally screwed if that >> happens, although I agree about the BUG_ON in principle. Where would >> we add it that would help in this case, though? We never even made it >> to C code. >> >> In any event, this was a page fault. sysret64 doesn't access memory. > > Let's see. > > Faulting SYSRET will still be in CPL0. > It would drop CPU into the #GP handler > but %rsp is already loaded with _user_ %rsp (!). > > #GP handler will start pushing stuff onto stack, > happily thinking that it is a kernel stack. > > This can cause a page fault. > > Most likely, this page fault won't succeed, > and we'd get a double fault with %pir somewhere in #GP handler. > > Yes, this doesn't entirely matches what we see... > > There is an easy way to test the theory that SYSRET is to blame. > > Just replace > > movq RCX(%rsp),%rcx > cmpq %rcx,RIP(%rsp) /* RCX == RIP */ > jne opportunistic_sysret_failed > > this "jne" with "jmp", and try to reproduce. >
This is a classic root exploit, and it's why we check for non-canonical RIP. In theory, that's the only way this can happen. Intel screwed up -- AMD never fails SYSRET. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/