Huge pages braindump

Stefan Kristiansson Wed, 31 Jul 2013 00:50:09 -0700

On Tue, Jul 30, 2013 at 02:35:01PM +0200, Jonas Bonn wrote:
> The arch spec allows PTE's to be located either in a level-1 page
> directory or in a level-2 page table.  A bit "L" in the page
> directory entry (level 1) indicates whether the entry points to a
> page containing a page table or whether it points to a "huge page".
> A huge page has a 24 bit offset is thus 16MB in size.
> 
> When a page is "huge", the TLB needs to know about it.  That's what
> the PL1 bit is for.  I'd like to see this bit renamed HUGE in order
> to indicate that it's just matching the high 8-bits of the page
> frame when looking for a translation.
> 
> An example user of the "huge page" mechanism would be the Linux
> kernel which maps itself into contiguous physical memory from 0 to
> end_of_kernel.  If we carefully manage the fact that it's not using
> 16MB of physical memory, we could use the "huge page" mechanism to
> prevent a lot of TLB misses when accessing kernel space code and
> data.
> 
> (Of course, 16MB might actually be too large for reasonable huge
> pages... 2MB or 1MB might be better, see end of this mail)
> 
> For this to work, the PL1 bit would need to be implemented... the
> fact that it's not today is a bug in all our implementations as it's
> not an optional feature.
>


Yes, this seems to be a valid explanation of what the arch spec describes
(although the arch specs way of saying this is a lot more unclear).

> Some changes along these lines that may be needed in the arch spec are:
> 
> 8.4.1 DMMUCR
> PTBP should be bits 31-13, not 31-10... page frames are always 8kB
> in size and need to be page aligned
> 
> 8.4.2 DMMUPR
> Drop this register altogether (see 8.8 below).  4 bits in each set
> gives 16 combinations, but many of these really don't make sense so
> this flexibility really isn't needed.
> 
> 8.4.3 IMMUCR
> PTBP should be bits 31-13, not 31-10... page frames are always 8kB
> in size and pretty much need to be page aligned
> 
> 8.4.4 IMMUPR
> Drop this register altogether (see 8.8 below)
> 
> Note that this register is overdimensioned... it has 7 sets with 2
> bits each.
> 
> 8.4.6
> Change name of PL1 to HUGE with description:
> 0: normal page, 8kB
> 1: huge page, 16MB (or 2MB, see below)
> 
> Change LRU from "last recently used" to "least recently used" (cosmetic)
> 
> 8.4.9 - 8.4.11
> Drop ATB's altogether.  We can get 16MB pages without them and the
> 32GB pages aren't realistic anyway.
> 
> 
> 8.8 PTE
> 
> Change PPN size to 19 bits (bits 31-13).
> 
> PPI:  Why only 7 sets of protection bits?  Why not 8?  Because value
> 0 is overloaded to mean the entry is invalid, but this prevents the
> field from being used as a sane bitmask.  Change the PPI field to 3
> individual bits indicating Writable, User access, and Executability
> and drop the Protection Registers altogether.
> 
> As per Stefan's earlier mail, make PTE something like this:
> 
> | 31 ... 13 | 12 |  11  |   10  | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
> |    PPN    |OS-specific|Present| L | X | W | U | D | A |WOM|WBC|CI |CC |
> 
> ...and we need a VALID bit in there somewhere.
> 

I agree on close to all of the above, except I think we should keep the xMMUPR
registers, but change the indexing from 0-7 (so 8 sets of protection bits).
Not so much because I think that the flexibility is needed, but because
using a "lookup table" to translate the X/W/U bits into
SRE/SXE/SWE/URE/UXE/UWE actually makes sense.

e.g. for the DTLB case:

X | W | U
---------
0 | 0 | 0 = SRE0
0 | 0 | 1 = SRE1 | URE1
0 | 1 | 0 = SRE2 | SWE2
0 | 1 | 1 = SRE3 | SWE3 | URE3 | UWE3
1 | 0 | 0 = SRE4
1 | 0 | 1 = SRE5 | URE5
1 | 1 | 0 = SRE6 | SWE7
1 | 1 | 1 = SRE7 | SWE7 | URE7 | UWE7

Software would need a pair of shift and mask operations to pick entries
out of the "lookup table" and hardware can easily do bitfield table lookups
directly from the register.

While we're at it, the bit order in DMMUPR should probably be changed
to match the one in DTLBTR, asi is now they don't match.
i.e. DMMUPR = UWE|URE|SWE|SRE and DTLBTR = SWE|SRE|UWE|URE

Regarding the PTE, isn't PRESENT == VALID?

> ----------------------
> 
> So how do we get 2MB huge pages... here's my suggestion.
> 
> Top-level page directory
>           ---------------------------
> 0x0000... | 8-bit index entry, L=0  |
>           ---------------------------
> 0x0020... | Empty                   |
>  to       ~ ...                     ~
> 0x00e0... | Empty                   |
>           ---------------------------
> 0x0100... | Next 8-bit index entry, L=1 |
>           ---------------------------
> 0x0120... | 2 MB page entry (L=1)   |
>  to       ~ ...                     ~
> 0x01e0... | 2 MB page entry (L=1)   |
>           ---------------------------
> 0x0200... | Next 8-bit index entry, L=0 |
>         |          ~~~            |
>           ~                         ~
> 
> The top-level page directory is an 8kB page, and it's 8-bit indexing
> makes it sparsely populated.  If we find that the L bit (huge page)
> is set on an 8-bit indexed entry, then we could do a second indexing
> on the remaining three bits (11 bit index total) to find the entry
> to the 2MB huge page in the "free space".
> 
> This could get us 2MB huge pages and we could then keep the ATB
> stuff around for the less useful 16MB huge pages.
> 
> This all plays reasonably nicely with the arch spec we've got today.
> What would need clarifying is that these huge pages are 2MB and not
> 16MB, but this is all so vague in the spec as it stands and
> otherwise unimplemented in practice that it ought to be doable.
> 

This is of course bending the meaning of the L bit, perhaps that should
be renamed then?
Because now you always have a two-level structure, but with the difference
that you are pointing back into the page directory in the second level.
Would it *have* to point into the page directory though?
Or could we use the entry fetched from the 8-bit indexed to get the
table pointer (and this could happen to point back into the pgd on
Linux as a memory saving optimization)?

This all is of course "breaking" (fixing) the arch spec a bit, but as you
said, there are no (known*) implementations using this and there will never be
any implementations using it if it isn't useful.

* In the unlikely event that there would be any unknown implementations
actually using this stuff, this conversion is kept in the public,
so they are free to join in and raise their voices ;)

Stefan
_______________________________________________
Linux mailing list
Linux@lists.openrisc.net
http://lists.openrisc.net/listinfo/linux

Re: [ORLinux] MMU/TLB/Huge pages braindump

Reply via email to