Hi - 

The vulnerability on Intel machines where a sysret/sysexit to a
non-canonical %rip has come up a few times, such as in a027bf8f981e
("x86: Fixes context security").  I thought about it a little more
today, and it turned into a nice explanation of how the UVPT/VPT
mappings work that some of you might like.

There are at least two ways to make the kernel pop a non-canon address.
- Just hack up a sw_ctx and ask the kernel to pop it.  We enforce that
  the user's RIP is canonical when popping something the user
  controlled.  That's what the commit above fixes.
- Run something that could have a sw_ctx with non-canon TF.

For the latter, the user can't write-mmap the topmost page below the
canonical limit, so they can't *write* a syscall instruction (0x0f 0x05)
as the last two bytes there.  (the usual attack here is to write those
values to 0x00007ffffffffffe, then jmp 0x00007ffffffffffe, such that
your %rip after the SYSCALL becomes 0x0000800000000000, which is
non-canonical).

But could they trick the kernel into writing 0f 05 to that location?
That memory is the UVPT window.  What does the user see when it looks at
0x00007ffffffffffe?  If it can ever see 0f 05, then we're hosed.

To know if that's a problem, we need to understand the UVPT mapping.


Recall that addresses in the UVPT region are a window into all of the
page tables of the process.  The kernel has a similar window mapped at
VPT.  UVPT is user read-only.  VPT/UVPT is one of the things we
inherited from JOS / the old 'texas' OS lab. 

the page table itself its mapped into 0x00007f8000000000 (which is the
symbol UVPT).  specifically:

        boot_kpt[PML4(UVPT)] = PADDR(boot_kpt) | PTE_U | PTE_P;

        Which means the PTE in the PML4 page table corresponding to
        UVPT points to the page table itself.


Let's break 0x00007ffffffffffe down into the respective parts for its
four PML walks.

Step :bits: value
--------------------------------------
PGOFF:  12: 111111111110
PML1 :   9: 111111111
PML2 :   9: 111111111
PML3 :   9: 111111111
PML4 :   9: 011111111

Convince yourself the PML4 part selects 0x00007f8000000000 in the
overall address space.  That points to the actual page table (read-only)

The pml3 part now selects a PTE from that page table.  The walk we're
doing is on the pml3 stage, but the table we're looking at consists of
PML4 entries - that's one step back in the page-walk.  The
'one-step-back' magic is how we'll be able to see the page table values
at the end.  

Another way to put it: the pml3-walk selects the PTE from the page table
corresponding to its bits, as if it was a PML4 entry.  Thus, those 9
bits of pml3-walk find the PTE as if those 9 bits were for a
pml4-walk.  pml3's bits are normally in a PTE at [39..30].  Due to the
fact that PML4 walked to itself, dropping off a page-table-walk, the
pml3's walk is now telling us about the *page table contents" as if
pml3's bits were at [48..39].

All this magic is just about using the page walking mechanisms to find
a page table at the end, instead of a regular page.  That page table is
physical memory.  The hardware page-table walk gives us access to that
page via the magic UVPT/VPT virtual address.

The same shifting goes for PML2 and PML1.  So when we do a memory
reference for 0x00007ffffffffffe, we can just shift the PGADDR
(i.e., ignoring the PGOFF 12 bits) left by 9 (a PML_SHIFT) to see what
address range we're going to see the page table of in the end.  So that
lookup should get 0x0000ffffffe00000 (27 bits of 1, shifted left over 21
0s).

That address has the 48th bit == 1, so the 1 gets sign-extended and the
address is actually 0xffffffffffe00000.

So any UVPT lookups from [0x00007ffffffff000, 0x00007fffffffffff] get
directed to the pml1 page table (physical) for whatever is mapped
at virtual address [0xffffffffffe00000, 0xffffffffffffffff].  That's 21
bits of address space, which corresponds to a PML2_PTE_REACH (1 << 21).
(recall that a PML2_PTE is a PML1 page table, so the reach of a
PML2_PTE is the reach of a PML1 PT).

As a sanity check, you can draw out the PML walks for each of those
intermediate tables.  Or you can cheat and notice that each
intermediate PTE selects the last PTE in the page table, so you know
we'd access the top-most part of the address space.

So now that we're looking at the PML1 for [0xffffffffffe00000,
0xffffffffffffffff], what is it we actually see when we do the various
page offsets?  e.g. 0xffe?  We're looking at the contents of a PML1 page
table, made up of 2^9, 8 byte PTEs.  The PTEs point to the final page
table, including the various permissions bits.

The top 9 bits of 0xffe selects a PTE (specifically 0b111111111), and
the lower 3 bits (0b110) reads chunks of that PTE.  those 9 bits are
selecting a chunk from the PML1 reach.  We can append those bits to
0xffffffffffe00000 and see that we're now looking at the PTE for
0xfffffffffffff000.  

Given that all the bits are 1, this should be no surprise (topmost
page of the address space).  In fact, if you do this again, when we
ignored the PGOFF earlier, we could have just ignored the lower 3 bits
(the bits that index within a PTE) and kept the 9 bits that would index
within the PML1.

Specifically, given a UVPT/VPT address 

        e.g. 0x00007ffffffffffe

We mask the lower 3 bits, (and save them, 0b110):

        0x00007ffffffffff8

Remove the PML4 bits (0x00007f8000000000) (part of the walk):
 
        0x0000007ffffffff8

Shift that left by PML_SHIFT (9):

        0x0000fffffffff000

Sign-extend if necessary:

        0xfffffffffffff000

and now we're looking at the PTE for that address.  (note that this was
a lot easier to do by hand since the bits were all 1s).

Now, what does that PTE look like?  The lower 12 bits are permissions
and flags, and the upper bits are the paddr of the actual physical page
backing 0xfffffffffffff000.  Assuming the page is mapped.  It could be
0, or gibberish if PTE_P == 0.

Normally, at this point, you'd look at the PTE and look up various fun
things.  But in our case, we want to know if we can trick the kernel
into putting 0x0f 0x05 into the PTE at offset 0b110 within that PTE.
let's call that 6 now, instead of looking at the binary.

That means we need the byte 6 to be 0x0f and byte 7 to be 0x05. So
that'd be a PTE of the form 0x0507------------.  (imagine reading the
PTE from right to left, the '-' are don't cares).  That would require
the kernel to map 0xfffffffffffff000 to physical address
0x0507---------.  We currently map the kernel from virtual
[0xffffffffc0000000, 0xffffffffffffffff] to physical address 0, so
0xfffffffffffff000 is mapped to 0x000000003ffff000 (i.e., vaddr -
KERN_LOAD_ADDR, which is what PADDR() does).  So that mapping does not
point to paddr 0x0507---------, and we're ok.  For now.

If the user could somehow trick the kernel to remap that address, then
there's all sorts of other things it could do too, and we're hosed
anyways.

Ah, but wait, there's a little more, and it involves jumbo pages!
Let's play around with showmapping, which is a monitor tool built to
look at the page tables.

ROS(Core 0)> showmapping 1 0x00007ffffffffffe
           Virtual            Physical  Ps Dr Ac CD WT U W P EPTE
-----------------------------------------------------------------
0x00007ffffffff000  0x0000000000000000  1  1  1  0  0  0 0 1 0x7fc0000083

Why is this 0?  Note the PS bit is set, and scroll up a bit.  You'll
probably see 

        1 GB Jumbo pages supported

The kernel attempts to use a 1 GB jumbo page for this mapping, so the
VA does map to 0.  Just note that showmapping is silently rounding off
the jumbo offset from the virtual address you asked for.

If we run on a machine without 1 GB pages (and only 2MB jumbos) (or
hack entry64.S to use 2MB pages) we get:

ROS(Core 0)> showmapping 1 0x00007ffffffffffe
           Virtual            Physical  Ps Dr Ac CD WT U W P EPTE
-----------------------------------------------------------------
0x00007ffffffff000  0x000000003fe00000  1  0  0  0  0  0 0 1 0x3fe000e3

That's still a jumbo page, but the paddr is for a 2MB one (21 bits of
0).

So what's happening now, when we access 0x00007ffffffffffe?  Or what's
happening at all, when we do this UVPT-walk into jumbo pages?

Ultimately, the UVPT/VPT mappings use the hardware page-table walk to 
provide a window into a particular physical page.  For the full PML
walks (i.e. down to the contents of a pml1), this is contents of a
physical page that is a page table.  However, if the walk stops short,
due to a jumbo page, then the hardware walk gives us a window into a
jumbo page - not just a 4K page-table page.  That means a UVPT walk that
hits a jumbo might have read access to something other than its page
tables!

Let's look at two cases: PML2 jumbo (a 2MB page) and PML3 jumbo (1
GB).  Recall that a given PML level n is walked at level n - 1, due
to the UVPT/VPT mapping.  

When we do the UVPT walk, it looks like this:


Walk Stage: PML4 -> PML3  -> PML2  -> PML1  -> 4K Phys page
-------------------------------------------------------
Normally  : UVPT -> PML4  -> PML3  -> PML2  -> PML1        -> 4K Phys Pg
2MB Jumbo : UVPT -> PML4  -> PML3  -> PML2j -> 2MB Phys Pg 
1GB Jumbo : UVPT -> PML4  -> PML3j -> 1 GB Phys Pg

The hardware walk stops at "Phys page", or when it finds a jumbo.  In
the normal case, it stops on a PML1 and thinks it is a physical page.
This is okay, since we want the user to see the PML1.

So for the PML2 jumbo (in the original address space), the actual UVPT
walk sees the PML2 PTE when it is doing the final PML1 walk (meaning,
we're on the last walk step, trying to find an actual page, where we'd
normally be looking for PML1 PTEs).  The hardware thinks it is looking
at a PML1 PTE, but it actually looking at a PML2 (Called PML2j in the
table above). 

In this case, PTE_PS (jumbo bit), actually means PTE_PAT (page
attribute table), which controls various things like caching.  It isn't
an actual jumbo page.  So the j in the PML2j is ignored, as far as jumbo
pages go.

However, the contents of the PTE at that point aren't to a PML1, they
are to a 2MB page.  The hardware will let the user access (read) the
contents of the first 4K of that 2MB jumbo page through the UVPT
window.  Yikes!

Let's consider a PML3 jumbo (1GB) in the original address space.

In the 1GB jumbo case, we had a jumbo PML3 table.  That table is at
some arbitrary physical address (some 4KB page table page (and for
those curious, the page directly above it (physically) is the
corresponding EPT PML3)).  However, the hardware thinks we're at PML2
stage, so the *walk stops*, and you have a window into 2MB of memory at
that location.

The hardware walked to the physical address in the PML3j and thinks it
is a PML2.  It also thinks it is a 2MB PTE.  So the hardware will let
the user access 2MB of memory, starting at whatever address is in the
PML3j - basically the first 2 MB of the 1 GB page.

Back to our example address: 

        0x00007ffffffffffe

That stops walking at PML2, so let's split it at the jumbo offset mark:

        0x00007fffffe00000 and offset 0x1ffffe

0x00007fffffe00000 will map to the 1 GB jumbo page that has paddr = 0.
So the user has read access to the first two MB of physical memory
through that hole.  Whatever magic happens to be at 0x1ffffe physical
will be readable from 0x00007ffffffffffe.  And this is true for
whatever 1 GB page happens to be mapped into the address space.

Fear not, all hope is not lost.  Let's try it out.  I dropped this
into hello.c:

        hexdump(stdout, (void*)0x00007ffffffffff0, 16);

Thread has unhandled fault: 14, err: 5, aux: 0x7ffffffffff0
[user] HW TRAP frame 0x0000000000615148

Hmm, it page faulted, what gives?

Recall that I said that userspace has *read* access to its page tables
from the UVPT.  Here are the two mappings (k/a/x/pmap64.c):

    /* VPT mapping: recursive PTE inserted at the VPT spot */
    boot_kpt[PML4(VPT)] = PADDR(boot_kpt) | PTE_W | PTE_P;
    /* same for UVPT, accessible by userspace (RO). */
    boot_kpt[PML4(UVPT)] = PADDR(boot_kpt) | PTE_U | PTE_P;

The VPT (kernel's mapping) is PTE_W, the user's is PTE_U.  The way
permissions work on page tables is that to be writable, you must be
writable on every PTE in the walk.  Since the PML4 mapping for UVPT
does not have PTE_W, then the final mapping will not be writable.  Good.

Consider inspecting the *kernel's* address space: when using either the
VPT or UVPT window, we also walk the intermediate PTs, and for the
kernel mappings, those do not have PTE_U set.  So any user walk will
fail.

In the "Walk stage" table above (regardless of jumbo), when the hardware
is on PML3, it is looking at a PML4, and that PML4 is for the kernel's
address space.  That does not have the PTE_U bit set, so the user walk
fails.  Even if there were a 512 GB, PML4 jumbo, the walk would still
fail.

Due to the UVPT mapping, the user cannot access 0x00007ffffffffffe,
even in a read-only manner.  So we're OK from the sysret vulnerability.

However, the kernel still can see the memory beyond the page table -
using either the VPT or the UVPT mapping.  Try this out:

in k/s/init.c:

        uint16_t* kva = KADDR(0x1ffffe);
        *kva = 0xbeef; 

(It's 0 normally, so set it to something we can see).

ROS(Core 0)> kfunc pahexdump 0x1ffffe 2
ffff8000001ffffe: ef be 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
................

Then get a process spinning and enter the monitor on its core, so that
the kernel is in the right address space (breakpoint will work, while 1
in an SCP, whatever you want).

ROS(Core 1)> kfunc hexdump 0x00007ffffffffffe 2
7ffffffffffe: ef beHW TRAP frame at 0xffff80003fd09b80 on core 1
  rax  0x0000000000000003
  rbx  0x0000000000000002
  rcx  0x00000000000003d4
  rdx  0x0000000000001ca3
  rbp  0xffff80003fd09c80
  rsi  0x00000000000007a0
  rdi  0xffffffffc21ae000
  r8   0x00000000000003f8
  r9   0x0000000000000020
  r10  0x000000000000000d
  r11  0x000000000000000a
  r12  0x000080000000000e
  r13  0x00007ffffffffffe
  r14  0x0000000000000000
  r15  0x0000800000000000
  trap 0x0000000d General Protection
  gsbs 0xffffffffc7da53c0
  fsbs 0x0000000000000000
  err  0x--------00000000
  rip  0xffffffffc200c340
  cs   0x------------0008
  flag 0x0000000000010093
  rsp  0xffff80003fd09c40
  ss   0x------------0000

What happened here?  The kernel was able to read the first two bytes
(see the ef and be), but hexdump tries to keep on reading (it seems to
round up '2'), then it GPF'd since it tried to read a non-canonical
address (note r15, and rip):

ffffffffc200c340:   41 0f b6 37             movzbl (%r15),%esi

So feel free to ignore the GPT.

There is still one issue to consider.  The user can't use the UVPT
mapping to walk any of the kernel's PTEs, where the concern was that
the user stopped-short on a jumbo and saw extra memory.

However, if we have any jumbo pages mapped PTE_U, (i.e. in the user's
address space, usually where the address we are inspecting via the
window is below ULIM), the user can use the window to walk those.

Say the user has a 1 GB jumbo page.  Then when they inspect that
mapping via UVPT:  (recall the table)

Walk Stage: PML4 -> PML3  -> PML2  -> PML1  -> 4K Phys page
-------------------------------------------------------
1GB Jumbo : UVPT -> PML4  -> PML3j -> 1 GB Phys Pg

They will have a 2MB window into whatever PML3j points to.  But
remember that that mapping is for a *user* page.  The user can use UVPT
to gain a read-only window of 2 MB into its own 1 GB page.  And since
page table permissions require an AND at each level for PTE_U and
PTE_W, they won't be able to see anything they shouldn't already be
seeing.  

The jumbo case is fine, since it's the intended final page of the
walk.  The only concern would be if we could somehow see a 2 MB window
into an *intermediate* 4 KB page table of kernel memory.  That will not
happen, since once we have the jumbo bit (PTE_PS) set, the paddr is
pointing to the actual mapped memory, and it does not point to an
intermediate page table.

So all is well in the world, for now.

Good times!

Barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to