Here is a description of the z990 TLB from the IBM Journal of Research and
Development Volume 48 number 3/4 2004 article, "The IBM eServer z990
Microprocessor". Sorry for the graphics... The TLB1 has 512 entries.
The TLB2 has a unique two part design that buffers a large number
translation cycles and is tagged by the LPARs. Ultimately 4K page table
entries are stored there. Also the engine is not used to do the
translation but a separate programmable state machine is does it. Thus the
processor and the translator can run in parallel until the processor
actually requires the result of the translation. VM manages virtual Linux
Images by using separate TLB entries with its virtual space. That is its
virtual machines share a real memory space that VM manages, so the TLB
entries do not collide. Thus it doesn't purge TLB entries when switching
virtual machines. However, such a virtual machine environment does put
more stress on the TLB causing more misses, and so the TLB in system z is
relatively large. Similarly, zOS has multiple virtual spaces and so it
also drives TLB stress. There are also instances that require purging of
the TLBs. Finally, PR/SM does have separate memory spaces for each of its
partitions and so you want to protect each partition from purges by others.
Thus there are tags in the TLB. The information below describes how this
is done.
TLB2 and the programmable translator
We have found that many applications that run on zSeries systems can obtain
significantly
better performance by using much larger TLBs than are common in the industry.
Part of this is
due to the characteristics of the applications themselves, but another factor
is that many
customers run numerous z/OS* and Linux** images simultaneously in a logically
partitioned
(LPAR) environment. With Linux, each image has its own virtual machine
environment created by
z/VM* and actively uses numerous virtual pages. Obviously, since each virtual
page requires a
TLB entry, very large TLBs are required to avoid thrashing when context
switches are performed
among the operating systems and virtual machines.
TLB2
The z990 microprocessor provides a TLB arrangement which advantageously uses
two buffers: a
relatively small first-level TLB1 and a larger second-level TLB2. The
second-level TLB feeds
address translation information to the first-level TLB when the desired
virtual address is not
contained in the first-level TLB. The TLB2 comprises two four-way
set-associative subunits:
one, called the Combined Region Segment Table Entry (CRSTE) TLB2, covers the
higher-level
address-translation levels; the other one, the Page Table Entry (PTE) TLB2,
covers the lowest
translation level. An advantage of this scheme is that the output of the CRSTE
TLB2 is a valid
page-table origin when a match is found for the higher address bits and a
valid entry was built
before. In this case, since all accesses to the higher-level translation
tables (region and
segment tables) are bypassed, there is a considerable performance gain when
there is a hit in
the CRSTE TLB2 but a miss in the PTE TLB2. With this feature, the start
address of the page
table can be found within one cycle and can be used for the last table access
to obtain the
absolute address. A diagram of the TLB2 is shown in Figure 5.
(Embedded image moved to file: pic05446.jpg)Figure 5Figure 5
The linkage of the CRSTE to the PTE TLB2 is established by means of seven bits
of the segment
index from the full 64-bit virtual address. These bits serve as an index
address covering the
address range of the CRSTE TLB2; the same information is used as tag
information in the PTE
TLB2 and is used as a quick reference for any lookup operation in order to
find the absolute
address of the relevant virtual address translation.
Programmable translator
The z/Architecture provides software with numerous different modes for
defining virtual address
spaces. These modes offer tremendous flexibility and make software running on
zSeries
processors very robust, but they also impose complexities on a processor
implementing the
z/Architecture. The translator unit comprises a new control concept to ease
the implementation
of the complicated algorithms for dynamic address and access-register
translation in a virtual
guest environment. Instead of a hardware state machine that has been used in
prior processors,
the overall control of the unit is accomplished by an embedded programmable
processor called a
picoengine with its control program stored in a small RAM, called picocode
RAM. The
programmable translator has, essentially, the same performance as prior
translator designs that
were implemented purely in hardware.
The main advantages of the picoengine-based translator are as follows:
All dataflow control functions are programmable. If a bug is found late
in the
development cycle, it can easily be fixed with a simple change to the
contents of the
picocode RAM. This RAM is loaded during the power-on reset phase of the
system.
New translation modes can be added to the instruction set architecture
after the
processor is committed to silicon.
Design changes do not affect the cycle time of the control logic; in
general, they can be
implemented as picocode change.
The picoengine is composed of standard logical building blocks (picocode
RAM, branch
decoder, etc.), which simplifies the problem analysis.
Error checking is much easier to implement than in a complex state
machine, since simple
parity checking of the picocode RAM provides good coverage of the state
controls. The
remainder of the engine is checked using traditional methods.
A diagram of the picocode engine is shown in Figure 6. When the caches miss in
their TLB1s for
a virtual address, the request is sent to the TLB2 and translator unit. If the
request also
misses in the TLB2, the translator unit decodes the request to obtain the
starting address for
one of the numerous translation algorithms stored in the picocode RAM. The
picocode
instructions are horizontally organized and are executed in one microprocessor
clock cycle.
They are of two different types: Either they control the multiplexers of the
dataflow part
(three-stage pipelined dataflow control) or they are used as branch
instructions to transfer
control to one out of four different branch targets (four-way branch decoder)
if a preset
condition is met. If no condition is met, or if the instruction type is a
control instruction,
the next sequential instruction (NSI) address stored in RAM is used to access
the next
instruction. Subroutine execution is supported by means of a hard-wired branch
address decoder
combined with a branch-return stack. When the translation operation is
completed, the results
are returned and are stored in the TLB2 and the TLB1 from which the request
originated.
(Embedded image moved to file: pic18015.jpg)Figure 6Figure 6
TLB purge operations
One of the drawbacks associated with large TLBs is the inordinately large
performance loss when
they must be purged. Several instructions in the z/Architecture require the
TLBs to be purged
of all entries or of selected entries. The Purge TLB (PTLB) and Compare And
Swap And Purge
(CSP) instructions respectively cause the TLBs to be completely purged on this
processor or on
all processors in the system. The Invalidate Page Table Entry (IPTE), Set
Storage Key Extended
(SSKE), and the new Invalidate DAT Table Entry (IDTE) instructions cause TLB
entries to be
selectively purged on all processors in the system. When TLBs on all
processors have to be
purged, it causes the entire system to be quiesced; this quiesced state is
necessary so that
the TLBs on all processors can be updated atomically with the resource being
modified. This is
called a broadcast purge operation.
A z990 system can have up to 64 physical processors installed (with up to 48
being normal
processors accessible to a customer) and up to 60 logical partitions (LPARs).
It has been shown
on a prior-generation 16-way z900 system that up to ten percent of all time
was spent idling by
processors in the quiesced state or waiting for the last processor to reach a
quiesced state.
This problem had been evident for some time and was partially solved on the G5
processor [3].
Although those earlier mechanisms were implemented on the z900, it still had
this very
significant performance loss due to quiesce effects. To make matters worse,
the performance
loss grows with the square of the number of processors. Therefore, the z990
processor
implements several new features to combat this system-quiesce performance loss
for TLB purge
operations.
The first new feature is that each TLB2 entry stored in the higher-level
subunit is tagged with
an identifier to indicate which LPAR partition created that entry. This allows
several
improvements to purge instructions:
It is possible to keep the entries for several different LPARs in the
TLB2 at one time.
This significantly improves performance when numerous z/OS or Linux
images are running on
the system.
A PTLB requires only those entries in the TLB2 that were formed by the
currently active
LPAR partition to be purged. On broadcast purge CSP instructions, only
those entries must
be purged in which there is a match between the LPAR identifier stored
in the TLB2 and
the LPAR partition of the quiesce initiator processor. Similar limited
purges are
implemented for IPTE and SSKE.
Another new feature is LPAR partition filtering for broadcast purge
operations. Previously, all
processors had to wait for the last processor to respond to the broadcast
quiesce. The
enhancement added to the G5 processor generation was that after responding, a
processor could
continue with normal work subject to the restriction that it had to stop if it
missed in its
TLB. After the last processor responds to the quiesce request, the
restrictions are lifted.
With partition filtering, when a processor initiates a broadcast purge
operation, only those
other processors which are currently operating in the same LPAR partition as
the initiator
respond to the broadcast immediately. Other processors perform the TLB purge
operation when
they are not doing other useful work. But the real gain is that fewer
processors have to
respond and hence overall system performance is increased.
Finally, the z990 implements address filtering for certain broadcast purge
operations. When a
processor receives a broadcast IPTE purge request, it saves the page index
portion of the
address; then it resumes normal processing subject to the restriction above.
However, if it
misses in the TLB and has to translate a virtual address, it is allowed to
continue as long as
the page index (or indices in the event of a pageable guest) needed for
translation does not
match the page index that was saved from the prior broadcast IPTE. If they do
not match, the
processor continues executing instructions. A similar mechanism is implemented
for broadcast
SSKE purge operations, but here a portion of the absolute address is saved and
compared in the
event of a storage key miss in the TLB.
Joe Temple
Distinguished Engineer
Sr. Certified IT Specialist
[EMAIL PROTECTED]
845-435-6301 295/6301 cell 914-706-5211
Home office 845-338-1448 Home 845-338-8794
Alan Cox
<[EMAIL PROTECTED]
u.org.uk> To
Sent by: Linux on [email protected]
390 Port cc
<[EMAIL PROTECTED]
IST.EDU> Subject
Re: Fw: [LINUX-390] Who's been
reading our list...
05/18/2006 01:17
PM
Please respond to
Linux on 390 Port
<[EMAIL PROTECTED]
IST.EDU>
On Iau, 2006-05-18 at 09:51 -0400, Joseph Temple wrote:
> Yes tagging works, but you will find that the system z holds a lot more
> translations in a two tiered TLB and has tagging as well. Thus the
System
> z does not have to retranslate as often.
How many tags does the Z have in the TLBs ?
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or
visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
<<attachment: pic05446.jpg>>
<<attachment: pic18015.jpg>>
