There are a couple of big decisions that were made in Solaris/PPC 2.6: 1) big-endian; 2) always-resident kernel address space.
Endian ------ Originally, Solaris/PPC was implemented little-endian. This was because IBM had some reason they needed it that way. And, Sun was not that committed, either way. Little-endian is a bit more complicated, because data access to the page table is always big-endian, no matter what the mode of the processor (MSR[LE]). Solaris/PPC 2.11 is big-endian. Sun had a policy of trying to maintain the code so that it would work in either mode. But, there were a few places where things needed to be adjusted, once it was really put to the test. kernel address space -------------------- Kernel address space is always mapped, and therefore reduces usable address space for all user-land processes. The kernel reserves the highest two segments (14 and 15). I bet this is an unpopular decision. The reasons are partly for performance, partly due to limitations of the PowerPC instruction set architecture, and partly historical. Sun already had a port to x86 that carved out user address in this way. IBM already had Unix code for other processors that worked this way. Moving data between the kernel and user-land could not be done as efficiently without always-resident kernel mappings. Sparc has load and store instructions through Address Space Identifiers (ASI), but neither PowerPC nor x86 has any such thing. I have seen slideware for new PowerPC models that have something like ASIs, but I can't see designing the Solaris HAT code around that. Not now, anyway. For 2.11, this was not changed. A cool project would be to redesign the Solaris/PPC to allow user-land to use all 16 segments, and do a fast shuffle of mappings, as needed, to implement copyin() and copyout() and related functions. That would be a big project, considering the amount of kernel code that "knows" about these two segments, and relies on just having its way with the remaining user-land address space, and accessing kernel address space at the same time. And then, there is the fact that the ABI has baked that in. But, for some embedded applications, it may be that nobody cares about legacy code. Aspect: Compiler ---------------- 2.6 used a Sun compiler that I know little about. I do know that I never did like the inlining facilities of Sun compilers, and judging by the amount of use it got, a lot of other people had issues with it, too. A great deal of code was written in assembly language that would not have been necessary if there were an inlining mechanism that had "arrived". That is, if it were easier to use, more standard, more easily understood. For Polaris, I made heavy use of GCC extensions, especially __inline__ and __asm__(). It is not like we have to worry about getting along with Sun's current compiler for PowerPC. I will worry about that problem when the time comes. It would be a good problem to have, because it would mean that Solaris/PPC made it as a product. Anyway, all uses of __inline__ and __asm__() are easy to reverse. The fall-back position, in the worst case, is to revert to calling external functions. A less drastic reform would be to adhere strictly to a convention for using the preprocessor to make things work either way, at all times. It is conceptually easy. I should have done that. I just didn't get around to it. Sorry. Pretty much any single PowerPC instruction that did something that can't be done in C was made into a function of the same name as the opcode. One exception is that I had to name sync() something else, because it is taken. Sync() is called by common code. It is the function to sync up filesystems. The inline function, ppc_sync(), is the single PowerPC instruction, 'sync'. In addition, a few two-instruction sequences were added. See usr/src/uts/ppc/sys/ppc_instr.h. I did go overboard. I tried to implement even non-local flow of control functions in GCC, such as setjmp() and longjmp(). That was a big mistake. What can I say? I was overly enthusiastic, and didn't think it through very well. I backed off on that. By the way, the heavy use of GCC extensions is not confined to the HAT layer. It gets mention here, because the HAT code was modified to exploit them, first, and still makes the most use of them, for things like segment register and BAT register management, and so on. Besides getting the hang of the __inline__ and __asm__(), you have to have some confidence in the optimizer. I played around with -xO3, -xO4, and -xO5, and a few other command line options and examined the resulting code. It falls short sometimes, but seems good enough. We make very little use of GCC optimization, so far, for two reasons. First, because it is just generally a bad idea to be concerned about optimization at that level of abstraction, early on. Second, because we were trying to make use of a debugger that could not understand how to relate back to source code when any optimization was done, and objdump -S did not work very well on optimized code. All that I felt was needed at the time was a sample of GCC optimization to gain sufficient confidence in it and make a decision about how much I could get away with writing in C. But, actually doing any optimization, now, is not so important. Aspect: Constructors and accessors ---------------------------------- 2.6 made use of bit masks and C bit fields in an ad hoc way, as needed. I guess that is pretty standard practice. But, I have always had trouble with bit fields. They can be different, depending on big-endian vs. little-endian, and even with the same endianness, the C compiler can assign bits in either order. Also, C bit fields are not well-defined for long or long long integral types. When I did the Solaris/IA64 HAT, I did everything that might ordinarily be done with C bit fields, shifts, and masks using constructor and accessor functions, instead. I used a very simple mini-language to describe bit fields, and generated header files from those specifications. It worked out well. So, on Solaris/PPC, I did the same thing, and systematically converted all HAT code to do things that way. See usr/src/uts/ppc/sysgen/*.fd. There are several little decisions about what notation to use. My rule is, use whatever notation is in the hardware reference manuals. For IA64, bits are numbered right to left, so I did that. The PowerPC reference manuals number bits from left to right (bit 0 is most significant bit), so I use that notation. The idea is to make it as easy as possible to transcribe directly and faithfully from the reference manual. Since header files are generated, naming conventions are enforced. Extracting a field is always done by calling the accessor function, <OBJECT>_GET_<FIELD>(value). For example: SR_GET_VSID(x) extracts the VSID field from a segment register. SR_SET_VSID(sr, v) deposits the value v in the VSID field of a segment register. sr is not modified; the modified value is the return value of the SR_SET_VSID function. SR_NEW(t, ks, kp, n, vsid) constructs a segment register. Accessor functions can always be implemented as macros that use C bit fields and/or shifts and masks, simply by changing the definition; but you cannot switch the other way around -- not in C. Accessor functions can be extended with extra validity checking, or with extra instrumentation. And, of course, that can be controlled by a preprocessor symbol. If I had it to do over, I would make even more heavy use of constructor and accessor functions. One thing I would do, for sure, is to use a notation for field specifications that has more redundancy. This approach is not without cost. Generated header files make the build process more complex, especially since make, in general, and ON build in particular are not very smart about generated header files. But, there are other generated header files, for example, headers related to RPC and XDR. And, there are other reasons to move in the direction of even more generated header files. Once the price is paid for even one generated .h file, and the work is done to come up with a workable, if not ideal, way of living with make and ON makefiles, then the price is paid. The marginal cost for more generated files is negligible. Opaque struct hat ----------------- The HAT interface, as defined in common code, uses a 'struct hat *' as the first argument to a large class of HAT functions. See usr/src/uts/common/vm/hat.h. Unless you go out of your way to ensure that 'struct hat' is opaque, this exposes information about the members of a HAT internal bookkeeping data structure. The 2.6 HAT code did not take any measures to prevent that. In Solaris/PPC 2.11, 'struct hat' is declared, but never defined; that is, it remains an incomplete data type. Some other part of Solaris can pass a 'struct hat *' to HAT interface functions, but non-HAT code cannot refer to any members of a 'struct hat', because "there is no there there". HAT functions use a different structure, and all interface functions must assign or cast to the internal 'hat_t *', in order to really get at any members. In Solaris/PPC 2.11, the type used internally is 'hat_t'. In the paragraphs that follow, I will refer to hat_t, just for brevity. Except for visibility, 'hat_t' is a synonym for 'struct hat', and I will use 'hat_t' to refer to 'struct hat' even when describing 2.6 code, which never used that type definition. page table allocation --------------------- The 2.6 HAT code allocated and initialized its own page table. When it is time to take over translations from the firmware, all existing translations, managed by the firmware, are mapped in the new page table, and then the hardware is switched over to the new page table. Then, the firmware is notified that, from now on, it is to use callback functions, provided by the HAT, to do any mapping or unmapping on its behalf. The 2.11 HAT just inherits the page table, in place. There is no need for the code, the memory, or the time, to be used establishing a new page table. There is a slight risk that we would regret not having the kernel allocate its own page table, because we may run into a situation where we run, not under VOF, but under an implementation of "real" Open Firmware by a hardware manufacturer, and something about its page table may not be suitable for our purposes. It may not be where we want it; it may not be big enough, or perhaps it is bigger than we need or would like. I believe we do not have to worry, because OpenFirmware is dead, for all practical purposes. If we run under VOF, we control it, so it is not a problem. If we run with some completely different bootware scheme, then several things need to be changed in startup, anyway. If we move to a more stand-alone approach, then we are in control. The worst case is that we re-introduce the code to allocate a second page table and copy translations, or something like it. PTEs ---- "Don't fight the hardware" is a pretty well established rule for HAT code. And, in many ways the 2.6 HAT code did an admirable job of doing things the PowerPC way. You know how they say that some programmers can write BASIC in any programming language. Advocates of one language or another talk about "the <language> way", and stress that it is not enough to learn how to write programs that run correctly; it is highly desirable to adapt to the style encouraged by the language. There is something like that with MMU hardware, as well. The reference books give you all the facts you need to know, but are pretty skimpy on explanations, rationale, and code examples showing "the way". Maybe it is another case of not needing to state these things explicitly, because they are obvious. Or it could be part of an overall trend toward confining documentation to sterile, hard facts, and leaving out any "extra" expository writing. For example, there used to be a Rationale section in the ANSI C specifications -- but no more. One way in which the 2.6 code seemed to fall short of doing things the PowerPC way is in functions that search for PTEs. There were several places where the code worked, but was unnecessarily complex -- both slow and non-obvious. This was not a case of a trade-off between performance at the cost of added complexity; it is one of those win-win cases, where the simpler more straight-forward code is also faster and more compact. One observation about the "obvious" is that all the information contained in an 8-byte PTE is partitioned into the two 32-bit words so that everything you need to know about the virtual address is in the first word, and the second word describes properties of the physical page or properties of the translation. The first word is the key, the second word is the data. This is no accident. The hardware has no need to do anything more complex than a simple 32-bit compare on the first word of a PTE. There are small code examples that show how simple the search logic can be, once things are preconditioned by constructing the right 32-bit key. For example, the code for TLB Miss handling on platforms with no hardware page-table walker. The 2.6 code had searches of PTEs involving complex logical combinations of conditions to be tested. But all that logic can be hoisted, and the loop can be reduced to a simple 32-bit compare. Never compute that which can be precomputed. In 2.11, many of the PTE searching functions were reformed, and there is more work that can be done along these lines. Software PTEs ------------- 2.6 HAT code defined two flavors of PTE: hardware PTE and software PTE. Software PTEs (swpte) differed from hardware PTEs in two ways: 1. swptes were stored in native data layout. When running in little-endian mode, this mattered, because the PowerPC page-tables are always accessed big-endian, regardless of the current processor state. 2. swptes carry extra bits of information that otherwise would have to be kept in a separate place, such as in an HME. In 2.11, I got rid of the swpte. If it were just a matter of big-endian vs little-endian access, I would feel an obligation to keep swpte's, because I think it is important to retain the ability to work in either mode. I have not continually tested that my changes are truly endian-independent, and I so I would bet that some new dependencies on big-endian have crept in. But, at least I have kept in mind the policy, "do no harm". However, the swpte data structure was mostly used for a purpose that has gone away. They were used primarily to maintain a separate linear page table for kernel address space. But, that applied only to the 601, which had no hardware page table walker. That optimization will likely make a comeback, say on an Efika box. But, when that time comes, the code will have to be redone, because the way of describing the range of kernel address for which this optimization applies has changed between 2.6 and 2.11. So, for now, swpte's have just been eliminated. They will almost certainly not be brought back, as is. New and better code would be created to do the same kind of optimization. Also, I have not given up on mapping kernel text and data with BAT registers, at least as an option, at least for some embedded applications. In that case, a linear page-table for kernel address space is utterly useless, even on a platform with no hardware page-table walker. struct hat ---------- In 2.6, 'struct hat' data structures were pre-allocated. The capacity planning was based on the maximum number of processes, which was a configurable parameter. In Solaris/PPC 2.11, hat_t's are allocated dynamically. They have their own kmem cache, "ppcmmu_hat_cache". The hat_t for the kernel, 'khat', is the one hat_t that is not dynamically allocated. It is of storage class 'extern', so it is always at a known location. Some HAT operations treat the kernel address space differently. Testing for hat == &khat is a cheap test to discriminate between kernel and non-kernel. By the time ppcmmu_hat_cache needs to be ready, the kernel memory allocator has been up and running for a long time. HMEs ---- Solaris needs to keep more information about translations than is provided for in the hardware PTE. That is where the HAT Mapping Entry (HME) comes in. Other aspects of Solaris VM design, particularly pageout, mean that it is necessary to navigate quickly from a physical page to all translations to that page. Linux does not have that requirement. But, even if Solaris did away with this navigation overhead, there is always some amount of information needed to be kept about each translation, in addition to what is supported by hardware. It has to be kept somewhere. It has to be quick and easy to get from a PTE to the supplemental information. Aside from the navigation overhead, the supplemental PTE could be just a few bits. On some MMU architectures, there are unused bits in the PTE. Of those that have unused bits, some are reserved and others are explicitly made available for use by software. For example, the IA64, operating in linear page table mode (VHPT Short Format), has 11 bits that are available for use by the HAT layer. That was sufficient, so that no additional storage was needed for an HME, except for pure navigation overhead. Note that PTEs and HMEs should not contain information about the virtual address. Nor should it contain information about the underlying physical page; that sort of information belongs in a page_t, or failing that, some other data structure that keeps data about physical pages. PTEs and HMEs should contain information ONLY about the _relationship_ between the virtual address and the physical page. On PowerPC, PTEs are quite full. This is mostly because the PowerPC MMU architecture uses an inverted page table, and so the bulk of the fully-qualified virtual address must be contained in each PTE. Since the PowerPC hardware page-table is an inverted page-table, and there is only one, which is global for all address spaces, Solaris/PPC 2.6 keeps all HMEs in an array, parallel to the hardware page-table, with the same number of entries. This way, fast and simple address arithmetic is all that is needed to move back and forth between a PTE and its corresponding HME. In 2.6, each HME contains: next pointer prev pointer page pointer hat index payload This was common to HAT code for Sparc and for x86, at the time. Solaris/PPC 2.6 HME data structure was just copied from srmmu or x86 code, and then modified slightly, as needed. 'next' and 'prev' pointers are provided so that a doubly-linked list of HMEs can be maintained for each page_t. A pointer to the page_t is there for quick navigation, because you may arrive at an HME, not by traversing the doubly-linked list of translations for a page, but by way of a PTE lookup, in which case you would not necessarily know in advance what page_t is involved. Also, there is a way to navigate from an HME to the hat_t for the address space for this translation. And then there are those extra bits that we could not fit into a PTE, which for our immediate purposes, we can just lump together and call them the 'payload'. For, in a sense, this is the only "real" information in an HME; the rest is navigation overhead, and could be considered redundant. Altogether, that adds up to 20 bytes per HME, for a total of 28 bytes per potential translation. Notice that I said _potential_ translation. Since PTEs and HMEs are both pre-allocated in an array, HMEs occupy space even when they correspond to unused (invalid) PTEs. In 2.6, they did manage to save a bit of space by storing a 16 bit index into a table of hat_t data structures instead of a 32-bit pointer. But, in 2.11, we actually would need more space for that, because the scheme for allocating hat_t instances was changed from a fixed size array to truly dynamic allocation, so now we really would need a pointer to navigate back to the hat_t, and that would mean 32-bits. On a 64 bit machine, storage requirements for HMEs would, of course, be much worse -- almost double. In 2.11, many preparations were made to move to a scheme that involved much less overhead per translation. We still need some room for the 'payload', those extra bits that we could not sneak into unused space in a PTE. One byte per PTE is sufficient. It would still be organized as a parallel array of 1 byte HMEs. OK, but what about the 'next' and 'prev' pointers, and the 'page' pointer, and what about the pointer back to the hat_t? page pointer --- The PTE contains the RPN (PowerPC MMU terminology, Real Page Number), the physical page frame number (pfn in Solaris type terminology). It is not that expensive to navigate from a pfn to a page_t. There are times when it is necessary, but these are not the most common cases, and it is not _that_ performance critical. It is unlikely that simply eliminating the HME page pointer will be a performance problem; but if it were, a better investment would be in improving performance of pfn-to-page_t. There are plenty of opportunities for that, if it should be needed. 'next' and 'prev' --- At any time, it is pretty likely that a significant fraction of HMEs are not in use. In a way, they don't count. Of the HMEs in use, most will be for translations to a page that is not shared; that is, the doubly-linked list of HMEs would be: { next=NULL; prev=NULL; payload } A page_t contains the head of the linked list of translations to that page (p_mapping), and a share count (p_share). For pages with only one translation, the p_mapping field can be a pointer to the PTE, directly, and we can dispense with the doubly-linked list altogether. Now, all we have to worry about is providing HMEs for the small (but important) minority of pages that are shared. At this stage, even a great deal more overhead per shared page would be acceptable. But, even that is not necessary. In fact, there are performance reasons for reducing per-HME overhead even further, by using unrolled linked lists with node sizes of one or two cache lines. That just leaves navigation from an HME to its hat_t. If we were writing a HAT for per-address-space page-tables, either linear or forward-mapped, then we would not need to navigate back to the hat_t. It is only because there are situations where we are rummaging around in a global page-table that we don't know what address space applies to a random PTE. But, the PTE contains the VSID. We are already paying the price of storing the VSID in every PTE. There is a many-to-one correspondence between VSIDs and hat_t's. In 2.6 the many-to-one mapping is simple: the mapping of VSID _ranges_ to hat_t is bijective. There might be reasons to change to a more flexible way of allocating VSIDs, one at a time. I see no reason to be in a hurry to change the current scheme of using blocks of 16 VSIDs. The situations where you need to navigate back to the hat_t are not the most common, and do not require performance at any cost. The performance of navigating from HME (PTE) to hat_t just has to be "good enough". An extensible array or hash table to map VSID range to hat_t would be good enough, unless its implementation is messed up, somehow. Also, the kernel VSID range can be made a special case. PTEG pairs ---------- The PowerPC MMU organizes PTEs into groups of 8. Each group is called a PTE Group, or PTEG for short. The hardware hash functions resolve to a group, so within each PTEG it is necessary to search for a matching PTE. Without having any other information on the side, that pretty much means a linear search of up to 8 8-byte PTEs. PTEGs are paired up, so that if a PTEG gets full, a secondary hash function is used to navigate to the the other PTEG in that PTE-Group-pair, or PTEGP, for short. Because a PTEG-Pair is an important natural unit for operating on PTEs, Solaris/PPC 2.6 keeps some bookkeeping data for each PTEG-Pair. The per-PTEG-pair data structure contains: 1. ptegp_mutex: a mutex protecting the PTEG-pair to regulate concurrent accesses to the page-table at the granularity of a PTEG-pair. 2. ptegp_validmap: One bit per PTE in the PTEG-Pair, all in one 16-bit short int, indicating whether the corresponding PTE is known to be invalid (available). This value makes it unnecessary to scan the PTEG-Pair for a free slot, except when all entries have been used. If none of the entries in a PTEG-Pair are available, the PTEG-Pair is scanned and an attempt is made to unload any mappings that have become "stale" (no longer associated with in-use VSIDs). 3. ptegp_lockmap: A 2-bit lock count for each PTE. A PTE with a non-zero lock count is for a locked translation, that is, one not subject to displacement. Since a PTE can be locked multiple times we need to maintain a lock count for each PTE. On PPC, the PTEs of the same address space are not grouped like in other architectures (e.g. x86, srmmu) so we need a separate counter for each PTE. The occurrence of multiple locks on a PTE is not common, and so we are using a two level scheme to minimize memory for the ptelock counters. We use 2 bits per PTE in the PTEG-Pair structure which keeps a lock count of up to 2. A value of 3 indicates an overflow. For lock counts greater than 2 we use a small hash table to maintain the true lock count for those PTEs. This part of Solaris/PPC HAT bookkeeping has not changed. But that is only for lack of time. The plan is to change the granularity of PTEG-Pair locking from a PTEG-Pair to a single PTEG, so that the PTEG valid map and the PTEG lock map data combined all fit in a 32-bit word. That way, operations on the bookkeeping data for each PTEG are done atomically, and no mutex is required. The 32-bit word contains all the bookkeeping data and acts as its own lock. No other functionality of a mutex is required. No blocking is required. There is no need to know anything about the owner of the "lock". A 32-bit word to cover 8 PTEs means that we have a budget of 4 bits per PTE. That budget can used for 1 valid bit plus a 3-bit lock count, where the value 7 would indicate overflow. Also, the size of the hash table for lock counts is fixed, based on fixed fraction of the number of entries in the page table. I have this thing about fixed size allocations. I would change that, given time. VSID ranges ----------- Solaris/PPC 2.6 HAT had its own allocator for VSID ranges. You cannot blame them for that. vmems were not around when the HAT was designed, and even when the code was written. The same is true for use of the kernel memory allocator. Jeff Bonwick's slab allocator was brand new, at the time. Its development was concurrent with Solaris/PPC HAT development. It had not yet "arrived". In Solaris/PPC 2.11, the scheme for allocating VSIDs was changed to just use the vmem allocator. The 2.6 design had a scheme for cycling through all VSID ranges and delaying the actual removal of PTEs until a VSID range needs to be reused. Although this looks to me like a slick idea, and might be revisited sometime, it did not seem to really be useful because common code is not aware of it; common code calls upon the HAT layer to unload all the mappings of an address space on exit. So, the whole notion of keeping "stale" PTEs in the page table has to be rethought. Either there has to be more communication between common code and the HAT layer, or the HAT layer has to be made smarter. For now, I just use the vmem allocator for VSID ranges, and don't bother trying to do lazy unloading of PTEs, and just reuse VSID ranges, rather than try to cycle through all of them, then do VSID-range garbage collection. It is almost certain that we can get almost all of the benefit from a lazy PTE-unload scheme, just by being smart about batch processing of cross calls for TLB shoot-down. For a single-processor, I don't think the lazy PTE-unload buys much at all; almost certainly not enough to justify its complexity and its profligate use of VSID ranges and the resources needed to keep track of them, and to garbage collect "stale" PTEs. BATs ---- Solaris/PPC 2.6 had a simple array of information about the BAT registers in use. Information about both instruction and data BAT registers was kept in a single array. One of the properties stored about each entry was which type it was: Instruction (IBAT) or Data (DBAT). Very little use is made of BAT registers, unfortunately. However some updates were made, and while I was at it, I made some other changes. First, there is now room for up to 8 IBATs and 8 DBATs, and configuration variables are used to tell whether there are 4 or 8. Those configuration variables are set by cpuid(), which dispatches to model-specific code. The code for MPC7450 knows how to interrogate and set the appropriate HID0 bits and then set the configuration variables. Second, there are two separate arrays, one for IBATs and one for DBATs. Most searches are for a specific purpose, where it is known in advance whether it is for IBATs only or for DBATs only. Those need not look in the array for the other type of BAT. Also, there is no need to store the type in each BAT entry. Other optimizations are planned, but not done. They are not useful, yet. ------------------------------------------------------------------------ Table: Summary of changes ------------------------- Resource 2.6 2.11 ---------------------- ---------------------- -------------------- Page table allocated inherited from VOF translations copied HMEs 20 bytes still 20 bytes Future: 1 byte hat_t fixed size pool dynamically allocated kmem_cache_alloc() "ppcmmu_hat_cache" VSID ranges custom allocator vmem bit maps, linked lists PTEGP bookkeeping protected by mutex unchanged validmap + lockmap Future: data modified atomically; no mutex BAT registers 4 4 or 8 data and instr Unified I and D Segregated I and D ------------------------------------------------------------------------ -- Guy Shaw