The HAT interface is covered in books, such as "Solaris Internals", and in source code, and Sun courses. There is no need to duplicate any of that.
What follows is a supplement that might be a helpful guide to decomposing the HAT interface along natural boundaries, and then reasoning about more manageable subsets. Often, classification is the first step toward understanding. Also, there are some recurring themes that tend not to get coverage. Perhaps because they are obvious, but I will mention a few of them, anyway. It would be nice if we could just separate interface from implementation completely, like they pretend to do in the textbooks and in the classroom. But, certain interface choices really do make demands on the implementation. Failure is not an option ------------------------ One thing you might notice, just by looking at the generic HAT interface, is that for the basic operations of mapping, unmapping, and modifying attributes of mappings there is no return code. These hat_* functions are all of return type 'void', and there is no other mechanism to indicate failure. A few functions have a way of indicating that a feature is not supported, but none of the basic functions do. The HAT is, after all, just a cache. Caches are expected to just do their job. They have to do whatever it takes to operate correctly and to make forward progress. Returning ENOMEM is not an option. You just have to evict some other entries from the cache to make room, or cause some other part of the system to give up resources. Returning EINVAL is also not an option. If VM code hands bad arguments to the HAT, that is fatal, and rightfully so, since it is an internal error, and you know you are in really serious trouble. One area where there are significant differences among Unix flavors is in behavior of some system calls when memory is full. One Unix flavor might always return success for a fork() system call, even though there would be insufficient memory or mapping resources for the new process, while another might return ENOMEM under the same circumstances, because it has a policy of reserving some resources for a new process or returning ENOMEM if the reservation fails, rather than let the system thrash, after the new process competes for memory and mapping resources. But that sort of policy is not decided by the HAT layer. The HAT always does what it takes to complete a request to map memory, even if it is bad policy. VA-ranges vs pages ------------------ Let us set aside, for now, the interface between bootware and startup code that calls into the HAT to initialize data structures at key points during boot/startup. Once things are up and running, the bulk of the HAT interface is for creating, modifying, and destroying mappings and asking questions about mappings. And out of those, most are for operating on virtual address ranges. But then, another significant category is functions that operate, not on VA-ranges but on underlying pages. That is a design choice in Solaris common VM code. It does not have to be that way. Linux VM is different in that way. In Solaris, pageout sweeps through pages of physical memory looking for candidate pages to be written to durable storage. This accesses pages directly, completely independent of any VA-view by any address space. And so, effectively, at any time, out of nowhere, a random page can just be plucked out from memory, without consulting any address space, or cooperating with any other VA-range oriented accounting for translations to that page. This means that there must be a way of navigating quickly form any page to all of the translations that map that page. That is the sole reason for having HAT functions on pages, and on some processors it is the sole reason for needing HAT Mapping Entries (HME). Let's leave DR (dynamic reconfiguration) out of the picture. Linux, for example, navigates through processes and their accounting of the memory they map, in order to cover physical memory, looking for victim pages. And so, Linux has no such thing as an HME. HMEs consume a lot of space. The navigation overhead is often larger than the PTE (or TTE). But, the heaviest price to be paid for this whole way of doing things is that code operating on VA-ranges must be prepared at _any_ time to lose a race with pageout. So, code for operating on mappings is always more complex. Boundaries within boundaries within boundaries ---------------------------------------------- If you look at all the HAT functions that operate on VA ranges, they all have the same structure. A common problem must be solved, independent of what operation is to be performed: respecting natural boundaries while sweeping through the given VA range. The MMU hardware for any processor will have some natural boundaries that must at least be tested for. With either forward-mapped page tables or linear page tables, the span of a page of PTEs is a significant natural boundary. It is more significant for forward-mapped page tables, and there are possibly multiple levels of super boundaries. Also, there may be segments, and, perhaps a bit of adjustment has to be done when crossing a segment boundary. Also, you might have to check for a VA hole, or otherwise check for some toxic VA range. In addition to MMU hardware natural boundaries, even on processors with TLBs that are managed entirely by software, there may be units of software accounting for multiple page sizes, or for clusters of translations, for which there is an advantage to keeping their information grouped. For some operations like modifying mappings for text, cache line boundaries are significant, because, for example, the I-cache for a set of pages may have to be invalidated. All of this adds up to a whole mess of logic for sweeping through an address range, descending down a hierarchy of boundaries, watching for pitfalls along the way. Like bcopy ---------- It is somewhat analogous to an optimized version of block copy. You have to take care of unaligned data, or data that is not of the same phase with respect to word size, cache line size, or page size, or all three. There must be logic to take care of a small portion at the beginning, if any. Also, there may be a small piece left over at the end. Once you get things set up on a natural boundary, it is smooth sailing for a while. Short blocks are a pretty common special case. You probably want to test for that and do something special, with reduced overhead. The bulk of the logic that makes it difficult to understand, difficult to debug, and that takes up all that code space, is in dealing with boundaries. The actual load and store is buried in there somewhere; it is trivial. And, so it is with HAT code that operates on VA-ranges, although to a lesser degree. In several HAT implementations, you will see the same logic repeated for mapping, unmapping, changing protections, etc. Only the call to the underlying operation on a single PTE is different. So, after a while, you can look at a lot of HAT code and think, "yada yada ... been there, done that ..." and then get down to business and concentrate on the functions that modify a single PTE. Inversion of control -------------------- In some HAT implementations the repeated logic is made manageable using macros. In the Solaris/IA64 HAT, there really was just one function to sweep through a VA range without regard to what is to be done to each PTE. A set of HAT operations was encoded and the 3-tuple, {hat_op, attr, flags} was simply handed down to the bottom layer, without interpretation. Finally, the modify_pte function decided what to do with the PTE. This worked out well, partly because the IA64 had enough registers so that the typical set of arguments could be passed in registers and kept in place to be passed down to the next level, without saving and restoring. On a register-starved architecture, like IA32, it might not work out so well; maybe passing a pointer to the 3-tuple would work better. Or, maybe even that would be prohibitive. I have not tried it. An idea for a Solaris/PPC 2.<future> project would be to see how well that works out with PowerPC stack, general-purpose registers and calling conventions. Choke point ----------- At least during bringup, it is a good idea to make sure there is only one function to modify a PTE, or if that is impractical, three functions: add, modify, delete. That way, extra constraints and instrumentation, like an MMU flight recorder, can be added to the one strategic place. Later, there may be performance reasons for having optimized variants of that code for specific cases. But, that comes at a price in flexibility and maintainability. -- Guy Shaw