Re: Fw: [LINUX-390] Who's been reading our list...

Joseph Temple Thu, 18 May 2006 10:58:40 -0700

Here is a description of the z990 TLB from the IBM Journal of Research and
Development Volume 48 number 3/4 2004  article, "The IBM eServer  z990
Microprocessor".  Sorry for the graphics...   The TLB1 has 512 entries.
The TLB2 has a unique two part design that buffers  a large number
translation cycles and is tagged by the  LPARs.  Ultimately 4K page table
entries are stored there.  Also the engine is not used to do the
translation but a separate programmable state machine is does it.  Thus the
processor and the translator can run in parallel until the processor
actually requires the result of the translation.   VM manages virtual Linux
Images by using separate TLB entries with its virtual space. That is its
virtual machines share a real memory space that VM manages, so the TLB
entries do not collide. Thus it doesn't purge TLB entries when switching
virtual machines.  However, such a virtual machine environment does put
more stress on the TLB causing more misses, and so the TLB in system z is
relatively large.  Similarly, zOS has multiple virtual spaces and so it
also drives TLB stress. There are also instances that require purging of
the TLBs.  Finally, PR/SM does have separate memory spaces for each of its
partitions and so you want to protect each partition from purges by others.
Thus there are tags in the TLB.   The information below describes how this
is done.
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
 TLB2 and the programmable translator                                  
                                                                       
                                                                       
 We have found that many applications that run on zSeries systems can obtain 
significantly
 better performance by using much larger TLBs than are common in the industry. 
Part of this is
 due to the characteristics of the applications themselves, but another factor 
is that many
 customers run numerous z/OS* and Linux** images simultaneously in a logically 
partitioned
 (LPAR) environment. With Linux, each image has its own virtual machine 
environment created by
 z/VM* and actively uses numerous virtual pages. Obviously, since each virtual 
page requires a
 TLB entry, very large TLBs are required to avoid thrashing when context 
switches are performed
 among the operating systems and virtual machines.                     
                                                                       
                                                                       
 TLB2                                                                  
                                                                       
                                                                       
 The z990 microprocessor provides a TLB arrangement which advantageously uses 
two buffers: a
 relatively small first-level TLB1 and a larger second-level TLB2. The 
second-level TLB feeds
 address translation information to the first-level TLB when the desired 
virtual address is not
 contained in the first-level TLB. The TLB2 comprises two four-way 
set-associative subunits:
 one, called the Combined Region Segment Table Entry (CRSTE) TLB2, covers the 
higher-level
 address-translation levels; the other one, the Page Table Entry (PTE) TLB2, 
covers the lowest
 translation level. An advantage of this scheme is that the output of the CRSTE 
TLB2 is a valid
 page-table origin when a match is found for the higher address bits and a 
valid entry was built
 before. In this case, since all accesses to the higher-level translation 
tables (region and
 segment tables) are bypassed, there is a considerable performance gain when 
there is a hit in
 the CRSTE TLB2 but a miss in the PTE TLB2. With this feature, the start 
address of the page
 table can be found within one cycle and can be used for the last table access 
to obtain the
 absolute address. A diagram of the TLB2 is shown in Figure 5.       
                                                                       
                                                                       
 (Embedded image moved to file: pic05446.jpg)Figure 5Figure 5      
                                                                       
                                                                       
 The linkage of the CRSTE to the PTE TLB2 is established by means of seven bits 
of the segment
 index from the full 64-bit virtual address. These bits serve as an index 
address covering the
 address range of the CRSTE TLB2; the same information is used as tag 
information in the PTE
 TLB2 and is used as a quick reference for any lookup operation in order to 
find the absolute
 address of the relevant virtual address translation.                  
                                                                       
                                                                       
 Programmable translator                                               
                                                                       
                                                                       
 The z/Architecture provides software with numerous different modes for 
defining virtual address
 spaces. These modes offer tremendous flexibility and make software running on 
zSeries
 processors very robust, but they also impose complexities on a processor 
implementing the
 z/Architecture. The translator unit comprises a new control concept to ease 
the implementation
 of the complicated algorithms for dynamic address and access-register 
translation in a virtual
 guest environment. Instead of a hardware state machine that has been used in 
prior processors,
 the overall control of the unit is accomplished by an embedded programmable 
processor called a
 picoengine with its control program stored in a small RAM, called picocode 
RAM. The
 programmable translator has, essentially, the same performance as prior 
translator designs that
 were implemented purely in hardware.                                  
                                                                       
                                                                       
 The main advantages of the picoengine-based translator are as follows:
       All dataflow control functions are programmable. If a bug is found late 
in the
       development cycle, it can easily be fixed with a simple change to the 
contents of the
       picocode RAM. This RAM is loaded during the power-on reset phase of the 
system.
       New translation modes can be added to the instruction set architecture 
after the
       processor is committed to silicon.                              
       Design changes do not affect the cycle time of the control logic; in 
general, they can be
       implemented as picocode change.                                 
       The picoengine is composed of standard logical building blocks (picocode 
RAM, branch
       decoder, etc.), which simplifies the problem analysis.          
       Error checking is much easier to implement than in a complex state 
machine, since simple
       parity checking of the picocode RAM provides good coverage of the state 
controls. The
       remainder of the engine is checked using traditional methods.   
                                                                       
                                                                       
 A diagram of the picocode engine is shown in Figure 6. When the caches miss in 
their TLB1s for
 a virtual address, the request is sent to the TLB2 and translator unit. If the 
request also
 misses in the TLB2, the translator unit decodes the request to obtain the 
starting address for
 one of the numerous translation algorithms stored in the picocode RAM. The 
picocode
 instructions are horizontally organized and are executed in one microprocessor 
clock cycle.
 They are of two different types: Either they control the multiplexers of the 
dataflow part
 (three-stage pipelined dataflow control) or they are used as branch 
instructions to transfer
 control to one out of four different branch targets (four-way branch decoder) 
if a preset
 condition is met. If no condition is met, or if the instruction type is a 
control instruction,
 the next sequential instruction (NSI) address stored in RAM is used to access 
the next
 instruction. Subroutine execution is supported by means of a hard-wired branch 
address decoder
 combined with a branch-return stack. When the translation operation is 
completed, the results
 are returned and are stored in the TLB2 and the TLB1 from which the request 
originated.
                                                                       
                                                                       
 (Embedded image moved to file: pic18015.jpg)Figure 6Figure 6      
                                                                       
                                                                       
 TLB purge operations                                                  
                                                                       
                                                                       
 One of the drawbacks associated with large TLBs is the inordinately large 
performance loss when
 they must be purged. Several instructions in the z/Architecture require the 
TLBs to be purged
 of all entries or of selected entries. The Purge TLB (PTLB) and Compare And 
Swap And Purge
 (CSP) instructions respectively cause the TLBs to be completely purged on this 
processor or on
 all processors in the system. The Invalidate Page Table Entry (IPTE), Set 
Storage Key Extended
 (SSKE), and the new Invalidate DAT Table Entry (IDTE) instructions cause TLB 
entries to be
 selectively purged on all processors in the system. When TLBs on all 
processors have to be
 purged, it causes the entire system to be quiesced; this quiesced state is 
necessary so that
 the TLBs on all processors can be updated atomically with the resource being 
modified. This is
 called a broadcast purge operation.                                   
                                                                       
                                                                       
 A z990 system can have up to 64 physical processors installed (with up to 48 
being normal
 processors accessible to a customer) and up to 60 logical partitions (LPARs). 
It has been shown
 on a prior-generation 16-way z900 system that up to ten percent of all time 
was spent idling by
 processors in the quiesced state or waiting for the last processor to reach a 
quiesced state.
 This problem had been evident for some time and was partially solved on the G5 
processor [3].
 Although those earlier mechanisms were implemented on the z900, it still had 
this very
 significant performance loss due to quiesce effects. To make matters worse, 
the performance
 loss grows with the square of the number of processors. Therefore, the z990 
processor
 implements several new features to combat this system-quiesce performance loss 
for TLB purge
 operations.                                                           
                                                                       
                                                                       
 The first new feature is that each TLB2 entry stored in the higher-level 
subunit is tagged with
 an identifier to indicate which LPAR partition created that entry. This allows 
several
 improvements to purge instructions:                                   
       It is possible to keep the entries for several different LPARs in the 
TLB2 at one time.
       This significantly improves performance when numerous z/OS or Linux 
images are running on
       the system.                                                     
       A PTLB requires only those entries in the TLB2 that were formed by the 
currently active
       LPAR partition to be purged. On broadcast purge CSP instructions, only 
those entries must
       be purged in which there is a match between the LPAR identifier stored 
in the TLB2 and
       the LPAR partition of the quiesce initiator processor. Similar limited 
purges are
       implemented for IPTE and SSKE.                                  
                                                                       
                                                                       
 Another new feature is LPAR partition filtering for broadcast purge 
operations. Previously, all
 processors had to wait for the last processor to respond to the broadcast 
quiesce. The
 enhancement added to the G5 processor generation was that after responding, a 
processor could
 continue with normal work subject to the restriction that it had to stop if it 
missed in its
 TLB. After the last processor responds to the quiesce request, the 
restrictions are lifted.
 With partition filtering, when a processor initiates a broadcast purge 
operation, only those
 other processors which are currently operating in the same LPAR partition as 
the initiator
 respond to the broadcast immediately. Other processors perform the TLB purge 
operation when
 they are not doing other useful work. But the real gain is that fewer 
processors have to
 respond and hence overall system performance is increased.            
                                                                       
                                                                       
 Finally, the z990 implements address filtering for certain broadcast purge 
operations. When a
 processor receives a broadcast IPTE purge request, it saves the page index 
portion of the
 address; then it resumes normal processing subject to the restriction above. 
However, if it
 misses in the TLB and has to translate a virtual address, it is allowed to 
continue as long as
 the page index (or indices in the event of a pageable guest) needed for 
translation does not
 match the page index that was saved from the prior broadcast IPTE. If they do 
not match, the
 processor continues executing instructions. A similar mechanism is implemented 
for broadcast
 SSKE purge operations, but here a portion of the absolute address is saved and 
compared in the
 event of a storage key miss in the TLB.








Joe Temple
Distinguished Engineer
Sr. Certified IT Specialist
[EMAIL PROTECTED]
845-435-6301  295/6301   cell 914-706-5211
Home office 845-338-1448  Home 845-338-8794


                                                                       
             Alan Cox                                                  
             <[EMAIL PROTECTED]                                         
             u.org.uk>                                                  To
             Sent by: Linux on         [email protected]         
             390 Port                                                   cc
             <[EMAIL PROTECTED]                                         
             IST.EDU>                                              Subject
                                       Re: Fw: [LINUX-390] Who's been  
                                       reading our list...             
             05/18/2006 01:17                                          
             PM                                                        
                                                                       
                                                                       
             Please respond to                                         
             Linux on 390 Port                                         
             <[EMAIL PROTECTED]                                         
                 IST.EDU>                                              
                                                                       
                                                                       




On Iau, 2006-05-18 at 09:51 -0400, Joseph Temple wrote:
> Yes tagging works, but you will find that the system z holds a lot more
> translations in a two tiered TLB and has tagging as well.   Thus the
System
> z does not have to retranslate as often.

How many tags does the Z have in the TLBs ?

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or
visit
http://www.marist.edu/htbin/wlvindex?LINUX-390

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390

<<attachment: pic05446.jpg>>

<<attachment: pic18015.jpg>>

Re: Fw: [LINUX-390] Who's been reading our list...

Reply via email to