http://unix.derkeiler.com/Newsgroups/comp.unix.solaris/2003-04/1246.html

From: Gavin Maltby ([EMAIL PROTECTED])
Date: 04/09/03


From: Gavin Maltby <[EMAIL PROTECTED]>
Date: Wed, 09 Apr 2003 12:06:02 +0100

Weiguang Shi wrote:

> I understand that in ultra-sparc II, a TLB entry can be configured to cover a
> large memory area instead of just one page. How this feature is exploited in
> Solaris?

Solaris 9 introduced ppgsz and a preload library that allows you to
influence page sizes used for stack, head and anonymous memory.
In earlier releases large pages have only been used by the kernel
(for it's nucleus instruction and data pages, 4MB each) and by
intimate shared memory (ISM) which would try to be allocated
in 4MB pages,

> The reason I am asking is that I run a program to explore the latencies of the
> different levels of the memory hierarchy of a Sun box. I know the D-TLB size is
> 64 entries (from the h/w manual).

ITLB and DTLB on US-II are 64-way fully associative with a hardware
replacement policy (pseudo random or pseudo lru, forget what).
Kernel and application share the same TLBs. In fact the
kernel locks down at least 3 entries - one in ITLB for
nucleus text, and two in DTLB for nucleus data and the
current user TSB. Further entries in DTLB (up to 4)
can be locked for the kernel TSB. So don't think
of all 64 ways as being the reach of the TLB for
an application.

> If the program accesses the 1st word of each
> page of a 8M memory area in a round-robin fashion, and if the page size is 8K,
> there would be 1K pages to visit, one at a time. Then each access would cause a
> cache miss (for the L2 cache is 2M)

The L2 is direct mapped and physically-indexed physically-tagged. So just
which L2 line your read of the 1st word goes to depends on the
physical address of the page being used - ie, "physical colour".
The allocators try to arrange that an application gets pages of
all physical colours to make best use of L2. You've only got
1024 cache lines to read in, and you have 32768 possible indices
to which they may go in L2, depending on colouring. So I'm
not certain each access will get an L2 miss - mileage will
vary.

> as long as a TLB miss, thus the latency

but yes, you will get lots of DTLB misses since you're accessing 1024
pages and this only caches a maximum of 64 at a time (more like high 50's
as above). You can watch the DTLB miss rate using trapstat(1M) in Solaris 9.

> would be about twice the memory access latency: first fetch the page to TLB;

We just need to fetch the translation to install it in the TLB. Solaris
has a software-managed translation cache called the TSB (translation
storage buffer) in front of the full translation hash lists. If your
application is exercising these 1024 pages then all/most of those
translations will be in its TSB. The TSB is also cacheable in
L2, so the fetch to get the translation will very often hit in
L2 and not go to memory.

> then fetch the word to cache. This, however, is _not_ case; the latency is the
> same as that of the memory access. I am wondering if the OS is doing something,
> e.g., increasing the granularity of a TLB entry as long as the data area is
> larger than the size of the 2nd level cache.

Nothing like that for US-II. In US-III which has more TLBs there is some
dynamic balancing of how a particular process will use each TLB (eg,
if you use ppgsz to use lots of 64K pages you might get a TLB dedicated
to 64K pages for that process).

> Any hint, direction, reference is very much appreciated.

It would be nice if a hardware performance counter measured
access times - see man cpc. But I'm not sure this is one
of the things counted.

If you want to measure latencies of memory access, I guess it pays
to bypass both TLB and L2 where possible. You can perform
ASI accesses which do this. These includes physical addressing
ASIs (bypass TLB effects) and L2 bypass ASIs (some will snoop
in L2 but not fill in it). It's also nice to be able
to resolve physical address to physical location (board etc) -
I don't think there is a public interface to do that? But
I also remember some new tool which will tell you where your
application memory is resident - can't remeber that name now
(or if it is included in 5.9).

Gavin

From: Gavin Maltby <[EMAIL PROTECTED]>
Date: Fri, 11 Apr 2003 10:19:19 +0100

Weiguang Shi wrote:

[cut]

>>The L2 is direct mapped and physically-indexed physically-tagged. So just
>
> This L2 (off-chip?) is 32-way associative, according to my measurement.

Yes the L2 is off chip for US-II. But it is definitely not 32-way associative.
All US-II are direct mapped, and there are a couple of variants (IIi and/or IIe)
that have 2 (or maybe it's 4) way associativity.

[cut]

>
> What I do not understand though, is that there is no difference in average
> access time between, e.g., the following situations:
>
> 1. Visiting the first word of every page of the 8M memory, round robin.
> 2. Visiting the first and 1024th words of every page of the 8M memory, round
> robin.
>
> The total numbers of visits are the same, say N. For situation 1, there will
> be N L2 misses and N TLB misses. For situation 2, there will be N L2 misses
> but N/2 TLB misses.

I'm not sure what you mean by the total number of visits being the same.
In the first there are 1024 loads (first word of each of 1024 8K pages
in an 8MB chunk) and around 1024 TLB misses (one for each page). In
the second there are 2048 loads (two words from each of 1024 pages)
and around 1024 TLB misses (one for each page)??

>So I am expecting the average latency calculated from
> situation 1 would be larger than that from situation 2. The measurement
> results, however, show the two are the same! There are more situations like
> this. In summary, as long as the program visits words that are more than one
> cache line (64B) apart, the latency is the same as that in situtation 1.

In the second case I think pipelining may be defeating your measurements
a bit. Your two loads are likely close together in the instruction
stream - even if not adjacent or in the same instruction group there
is likely little of substance between them. Both loads storm down
the initial stages of the pipeline and then stall at the data stage
(data not yet available). Then, at a much more leisurely pace
the data arrives - for both stalled instructions at much the same
time - and they continue down the pipeline.

[cut]

Hope that helps.


Reply via email to