The HAT interface is covered in books, such as "Solaris
Internals", and in source code, and Sun courses.  There is
no need to duplicate any of that.

What follows is a supplement that might be a helpful guide
to decomposing the HAT interface along natural boundaries,
and then reasoning about more manageable subsets.  Often,
classification is the first step toward understanding.
Also, there are some recurring themes that tend not to
get coverage.  Perhaps because they are obvious, but
I will mention a few of them, anyway.

It would be nice if we could just separate interface from
implementation completely, like they pretend to do in the
textbooks and in the classroom.  But, certain interface
choices really do make demands on the implementation.

Failure is not an option
------------------------
One thing you might notice, just by looking at the generic
HAT interface, is that for the basic operations of mapping,
unmapping, and modifying attributes of mappings there
is no return code.  These hat_* functions are all of
return type 'void', and there is no other mechanism to
indicate failure.  A few functions have a way of indicating
that a feature is not supported, but none of the basic
functions do.

The HAT is, after all, just a cache.  Caches are expected
to just do their job.  They have to do whatever it takes
to operate correctly and to make forward progress.
Returning ENOMEM is not an option.  You just have to
evict some other entries from the cache to make room, or
cause some other part of the system to give up resources.

Returning EINVAL is also not an option.  If VM code hands
bad arguments to the HAT, that is fatal, and rightfully
so, since it is an internal error, and you know you are
in really serious trouble.

One area where there are significant differences among
Unix flavors is in behavior of some system calls when
memory is full.  One Unix flavor might always return
success for a fork() system call, even though there would
be insufficient memory or mapping resources for the new
process, while another might return ENOMEM under the same
circumstances, because it has a policy of reserving some
resources for a new process or returning ENOMEM if the
reservation fails, rather than let the system thrash,
after the new process competes for memory and mapping
resources.  But that sort of policy is not decided by
the HAT layer.  The HAT always does what it takes to
complete a request to map memory, even if it is bad policy.

VA-ranges vs pages
------------------
Let us set aside, for now, the interface between bootware
and startup code that calls into the HAT to initialize data
structures at key points during boot/startup.  Once things
are up and running, the bulk of the HAT interface is
for creating, modifying, and destroying mappings and
asking questions about mappings.  And out of those, most
are for operating on virtual address ranges.  But then,
another significant category is functions that operate,
not on VA-ranges but on underlying pages.

That is a design choice in Solaris common VM code.
It does not have to be that way.  Linux VM is different
in that way.

In Solaris, pageout sweeps through pages of physical
memory looking for candidate pages to be written to
durable storage.  This accesses pages directly, completely
independent of any VA-view by any address space.  And so,
effectively, at any time, out of nowhere, a random page
can just be plucked out from memory, without consulting
any address space, or cooperating with any other VA-range
oriented accounting for translations to that page.

This means that there must be a way of navigating quickly
form any page to all of the translations that map that
page.  That is the sole reason for having HAT functions
on pages, and on some processors it is the sole reason
for needing HAT Mapping Entries (HME).

Let's leave DR (dynamic reconfiguration) out of the
picture.

Linux, for example, navigates through processes and their
accounting of the memory they map, in order to cover
physical memory, looking for victim pages.  And so, Linux
has no such thing as an HME.

HMEs consume a lot of space.  The navigation overhead is
often larger than the PTE (or TTE).  But, the heaviest
price to be paid for this whole way of doing things is that
code operating on VA-ranges must be prepared at _any_
time to lose a race with pageout.  So, code for operating
on mappings is always more complex.

Boundaries within boundaries within boundaries
----------------------------------------------
If you look at all the HAT functions that operate on
VA ranges, they all have the same structure.  A common
problem must be solved, independent of what operation
is to be performed: respecting natural boundaries while
sweeping through the given VA range.  The MMU hardware for
any processor will have some natural boundaries that must
at least be tested for.  With either forward-mapped page
tables or linear page tables, the span of a page of PTEs is
a significant natural boundary.  It is more significant for
forward-mapped page tables, and there are possibly multiple
levels of super boundaries.  Also, there may be segments,
and, perhaps a bit of adjustment has to be done when
crossing a segment boundary.  Also, you might have to check
for a VA hole, or otherwise check for some toxic VA range.
In addition to MMU hardware natural boundaries, even on
processors with TLBs that are managed entirely by software,
there may be units of software accounting for multiple
page sizes, or for clusters of translations, for which
there is an advantage to keeping their information grouped.

For some operations like modifying mappings for text, cache
line boundaries are significant, because, for example,
the I-cache for a set of pages may have to be invalidated.

All of this adds up to a whole mess of logic for sweeping
through an address range, descending down a hierarchy of
boundaries, watching for pitfalls along the way.

Like bcopy
----------
It is somewhat analogous to an optimized version of block
copy.  You have to take care of unaligned data, or data
that is not of the same phase with respect to word size,
cache line size, or page size, or all three.  There must
be logic to take care of a small portion at the beginning,
if any.  Also, there may be a small piece left over at
the end.  Once you get things set up on a natural boundary,
it is smooth sailing for a while.  Short blocks are a
pretty common special case.  You probably want to test
for that and do something special, with reduced overhead.

The bulk of the logic that makes it difficult to
understand, difficult to debug, and that takes up all that
code space, is in dealing with boundaries.  The actual
load and store is buried in there somewhere; it is trivial.

And, so it is with HAT code that operates on VA-ranges,
although to a lesser degree.  In several HAT implementations,
you will see the same logic repeated for mapping, unmapping,
changing protections, etc.  Only the call to the underlying
operation on a single PTE is different.  So, after
a while, you can look at a lot of HAT code and think,
"yada yada ... been there, done that ..." and then get
down to business and concentrate on the functions that
modify a single PTE.

Inversion of control
--------------------
In some HAT implementations the repeated logic is made
manageable using macros.  In the Solaris/IA64 HAT, there
really was just one function to sweep through a VA range
without regard to what is to be done to each PTE.  A set
of HAT operations was encoded and the 3-tuple,

 {hat_op, attr, flags}

was simply handed down to the bottom layer, without
interpretation.  Finally, the modify_pte function decided
what to do with the PTE.  This worked out well, partly
because the IA64 had enough registers so that the typical
set of arguments could be passed in registers and kept
in place to be passed down to the next level, without
saving and restoring.  On a register-starved architecture,
like IA32, it might not work out so well; maybe passing a
pointer to the 3-tuple would work better.  Or, maybe even
that would be prohibitive.  I have not tried it.  An idea
for a Solaris/PPC 2.<future> project would be to see how
well that works out with PowerPC stack, general-purpose
registers and calling conventions.

Choke point
-----------
At least during bringup, it is a good idea to make sure
there is only one function to modify a PTE, or if that
is impractical, three functions: add, modify, delete.
That way, extra constraints and instrumentation, like
an MMU flight recorder, can be added to the one strategic
place.  Later, there may be performance reasons for having
optimized variants of that code for specific cases.
But, that comes at a price in flexibility and maintainability.

-- Guy Shaw


Reply via email to