On Wed, 1 Aug 2018, Dave Hansen wrote: > > From: Dave Hansen <[email protected]> > > The kernel image is mapped into two places in the virtual address > space (addresses without KASLR, of course): > > 1. The kernel direct map (0xffff880000000000) > 2. The "high kernel map" (0xffffffff81000000) > > We actually execute out of #2. If we get the address of a kernel > symbol, it points to #2, but almost all physical-to-virtual > translations point to #1. > > Parts of the "high kernel map" alias are mapped in the userspace > page tables with the Global bit for performance reasons. The > parts that we map to userspace do not (er, should not) have > secrets. > > This is fine, except that some areas in the kernel image that > are adjacent to the non-secret-containing areas are unused holes. > We free these holes back into the normal page allocator and > reuse them as normal kernel memory. The memory will, of course, > get *used* via the normal map, but the alias mapping is kept. > > This otherwise unused alias mapping of the holes will, by default > keep the Global bit, be mapped out to userspace, and be > vulnerable to Meltdown. > > Remove the alias mapping of these pages entirely. This is likely > to fracture the 2M page mapping the kernel image near these areas, > but this should affect a minority of the area. > > This unmapping behavior is currently dependent on PTI being in > place. Going forward, we should at least consider doing this for > all configurations. Having an extra read-write alias for memory > is not exactly ideal for debugging things like random memory > corruption and this does undercut features like DEBUG_PAGEALLOC > or future work like eXclusive Page Frame Ownership (XPFO). > > Before this patch: > > current_kernel:---[ High Kernel Mapping ]--- > current_kernel-0xffffffff80000000-0xffffffff81000000 16M > pmd > current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro > PSE GLB x pmd > current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro > GLB x pte > current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW > NX pte > current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro > PSE GLB NX pmd > current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW > PSE NX pmd > current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW > NX pte > current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW > PSE NX pmd > current_kernel-0xffffffff83200000-0xffffffffa0000000 462M > pmd > > current_user:---[ High Kernel Mapping ]--- > current_user-0xffffffff80000000-0xffffffff81000000 16M > pmd > current_user-0xffffffff81000000-0xffffffff81e00000 14M ro > PSE GLB x pmd > current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro > GLB x pte > current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW > NX pte > current_user-0xffffffff82000000-0xffffffff82600000 6M ro > PSE GLB NX pmd > current_user-0xffffffff82600000-0xffffffffa0000000 474M > pmd > > > After this patch: > > current_kernel:---[ High Kernel Mapping ]--- > current_kernel-0xffffffff80000000-0xffffffff81000000 16M > pmd > current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro > PSE GLB x pmd > current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro > GLB x pte > current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K > pte > current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro > PSE GLB NX pmd > current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro > NX pte > current_kernel-0xffffffff82488000-0xffffffff82600000 1504K > pte > current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW > PSE NX pmd > current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW > NX pte > current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K > pte > > current_user:---[ High Kernel Mapping ]--- > current_user-0xffffffff80000000-0xffffffff81000000 16M > pmd > current_user-0xffffffff81000000-0xffffffff81e00000 14M ro > PSE GLB x pmd > current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro > GLB x pte > current_user-0xffffffff81e11000-0xffffffff82000000 1980K > pte > current_user-0xffffffff82000000-0xffffffff82400000 4M ro > PSE GLB NX pmd > current_user-0xffffffff82400000-0xffffffff82488000 544K ro > NX pte > current_user-0xffffffff82488000-0xffffffff82600000 1504K > pte > current_user-0xffffffff82600000-0xffffffffa0000000 474M > pmd > > Signed-off-by: Dave Hansen <[email protected]> > Cc: Kees Cook <[email protected]> > Cc: Thomas Gleixner <[email protected]> > Cc: Ingo Molnar <[email protected]> > Cc: Andrea Arcangeli <[email protected]> > Cc: Juergen Gross <[email protected]> > Cc: Josh Poimboeuf <[email protected]> > Cc: Greg Kroah-Hartman <[email protected]> > Cc: Peter Zijlstra <[email protected]> > Cc: Hugh Dickins <[email protected]> > Cc: Linus Torvalds <[email protected]> > Cc: Borislav Petkov <[email protected]> > Cc: Andy Lutomirski <[email protected]> > Cc: Andi Kleen <[email protected]> > --- > > b/arch/x86/mm/init.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff -puN arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image > arch/x86/mm/init.c > --- a/arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image > 2018-07-30 09:53:14.862915689 -0700 > +++ b/arch/x86/mm/init.c 2018-07-30 09:53:14.866915689 -0700 > @@ -778,8 +778,26 @@ void free_init_pages(char *what, unsigne > */ > void free_kernel_image_pages(void *begin, void *end) > { > - free_init_pages("unused kernel image", > - (unsigned long)begin, (unsigned long)end); > + unsigned long begin_ul = (unsigned long)begin; > + unsigned long end_ul = (unsigned long)end; > + unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT; > + > + > + free_init_pages("unused kernel image", begin_ul, end_ul); > + > + /* > + * PTI maps some of the kernel into userspace. For > + * performance, this includes some kernel areas that > + * do not contain secrets. Those areas might be > + * adjacent to the parts of the kernel image being > + * freed, which may contain secrets. Remove the > + * "high kernel image mapping" for these freed areas, > + * ensuring they are not even potentially vulnerable > + * to Meltdown regardless of the specific optimizations > + * PTI is currently using. > + */ > + if (cpu_feature_enabled(X86_FEATURE_PTI)) > + set_memory_np(begin_ul, len_pages); > } > > void __ref free_initmem(void) > _
Ironically, that set_memory_np() is giving me a problem. I don't see it when booting the 8GB laptop normally, but when booting with "mem=1G", I get a not-present fault when ext4_iget() is trying to do its business in starting init. But boots fine with "mem=1G nopti". I get the feeling that set_memory_np() is marking those freed pages as not-present in the direct map, so they're no longer usable at all. I can jot down some console messages if you need, but hope I've said enough for you to see it immediately, and just say whoops, forget 5/5? Hugh

