Re: Determining CPU features / cache organization from userland

Bruce M Simpson Mon, 13 Oct 2003 12:34:26 -0700

All,

Here are detailed design documents for determining cache and TLB
geometry across our currently supported processor architectures,
with recommendations outlined for implementation.


What I haven't addressed yet is how indirect consumers of the API might
use it, e.g. mutex consumers vs. UMA, in the context of allocating
cache-aligned mutexes from a mutex pool.

Please let me know your thoughts.

BMS

Detailed design for cache/tlb geometry discovery
------------------------------------------------

all
---
Review NetBSD's uvm_page_recolor() viz applicability to FreeBSD VM/UMA.

alpha
-----
Action: Add code to machdep.c in identifycpu() to fill out hw_cacheinfo.

Cache discovery: Static tables keyed on specific CPU model.
TLB discovery: Static tables keyed on specific CPU model.

Cache heuristic:
 8Kb L1 Split Direct Mapped (21064)
 2MB L2 Unified Direct Mapped (21064)
 All CPUs below 21264 have a 32-byte L1 line size.
 21264 (EV6) has a 64-byte L1 line size.
 Optional L3 cache.
TLB heuristic:
 ITLB 8KB page 8 lines, 4MB page 4 lines (21064)
 DTLB 32 lines, all page sizes, fully associative. (21064)

ia64
----
Action: Add code to machdep.c in identifycpu() to fill out hw_cacheinfo.
        Review Linux's pal.h and palinfo.c. files.

Cache discovery: Call the platform functions PAL_CACHE_SUMMARY and
 PAL_CACHE_INFO to get this information.
TLB discovery: Static tables keyed on specific CPU model.

Cache heuristic:
 L1 typically split 4-way set-associative 16KB,
 L2 256KB unified, L3 3MB-6MB unified.
 Line size isn't defined by the architecture.
TLB heuristic:
 L1 TLB, split, data/instruction 32 entries each, fully associative
 L2 TLB, split, data/instruction 128 entries each, fully associative

i386 pc98 amd64
---------------
Action: Add code to identcpu.c to fill out hw_cacheinfo.

Cache discovery: Extended CPUID.
 Static tables if 486-class machine. No cache on 386.
TLB discovery: Extended CPUID.
 Static tables if 486-class machine. No cache on 386.

Cache heuristic (Intel): L1: 4-way, 32 bytes/line
Cache heuristic (AMD): L2: 8-way, 64 bytes/line
TLB heuristic (Intel):
 4KB Code: 32 entries, 4-way, LRU
 4MB Code: 2 entries, Fully associative, LRU
 4KB Data: 64 entries, 4-way, LRU
 4MB Data: 8 entries, 4-way, LRU
TLB heuristic (AMD):
 4KB L1 Code: 16 entries, Fully associative, LRU
 4MB/2MB L1 Code: 8 entries, Fully associative, LRU
 4KB L1 Data: 24/32 entries, Fully associative, LRU
 4MB/2MB L1 Data: 8 entries, 4-way, LRU
 4KB L2 Code: 256 entries, 4-way, LRU
 4KB L2 Data: 256 entries, 4-way, LRU

(That's 6 distinct TLBs to deal with on AMD-based i386 architectures).

powerpc
-------
Action: Adapt from NetBSD as appropriate.

Cache discovery:
 Open Firmware on CHRP if available.
 Static tables keyed on specific CPU model.
TLB discovery:
 Open Firmware on CHRP if available.
 Static tables keyed on specific CPU model.

Cache heuristic:
  L1 line size: 32 bytes across family.
   Pre-G5: 32KB/32KB Split, 8-way
   G5: 64KB/32KB Split, 1-way
  L2 line size: 32/64/128 bytes,
TLB heuristic:
 PPC 601e:
  4KB Instruction TLB, 4 entries, most recently used translations
  UTLB, 256 entries, 2-way set associative, software selectable block size

OFW properties:
 i-cache-size i-cache-sets i-cache-block-size
 d-cache-size d-cache-sets d-cache-block-size
 tlb-size tlb-sets l2-cache

[*] CHRP only

mips
----
Action: Adapt from NetBSD as appropriate.

Cache discovery: Static tables keyed on specific CPU model.
TLB discovery: MIPS32/MIPS64 Privileged Resource Architecture registers
Cache heuristic: Split/unified L1/L2, unified L3.
TLB heuristic: 16KB page size, 64 entries, fully associative (R10000)

sparc64
-------
Action:
 Adapt existing code in cache.c to fill out and use hw_cacheinfo.
 Review assembly code, particularly that which abuses the TLB.
 Work closely with jake@ to avoid code churn.

Cache discovery: Open Firmware.
TLB discovery: Open Firmware.
Cache heuristic: Split L1, Unified L2.
TLB heuristic: Split L1 TLB. Fully Associative. NLU. 64 lines each.

OFW properties:
icache-size icache-line-size icache-associativity
dcache-size dcache-line-size dcache-associativity
ecache-size ecache-line-size ecache-associativity
#dtlb-entries #itlb-entries

Maintain information about cache and TLB geometry in an MI structure.
The abstraction is intended to reflect current and future machine
architectures.

It is expected that the contents of these structures may not change over
the lifetime of the kernel. Keeping this information in a structure doesn't
significantly increase the cost of retrieving it from userland anyway.

Userland consumers such as thread libraries and memory allocators should
take a copy of this structure upon initialization. Kernel consumers
may feel free to cache the information in local variables as they like.

TLBs are 'caches' for virtual address lookups. Like data and instruction
caches, they may employ set associativity to reduce the risk of
unnecessary cache flushes/misses in multiprogramming environments.

Some architectures segregate their TLBs according to page size. If this
is the case, set the segsize member accordingly. These segregated
TLBs are counted as seperate TLBs. If this particular TLB has a
software-programmable page size, set pagesize to PAGESIZE_PROG.

At an extreme, it's possible to detect cache/TLB properties of a machine
at runtime using algorithms such as those used by Stefan Manegold's
'Calibrator' program (although this isn't suitable for boot-time kernel use):
<URL:http://homepages.cwi.nl/~manegold/Calibrator/calibrator.shtml>

----------------------------------------------------------------------------

/*
 * Per-cache information structure.
 *
 * Properties of the cache may be determined using the
 * simple macros below. Where a unified cache exists, consumers should check
 * the 'code' member first, as the 'data' member shall be used for
 * CACHE_OTHER members.
 */

struct hw_cache {
        u_int16_t       type;
        u_int16_t       linesize;
        u_int32_t       nsets;
        u_int32_t       nlines;
} __packed;

#define CACHE_ISDISABLED(_c)            ((_c)->nsets == 0)
#define CACHE_ISDIRECTMAPPED(_c)        ((_c)->nsets == 1)
#define CACHE_ISFULLYASSOC(_c)          ((_c)->nsets == (_c)->nlines)
#define CACHE_SIZE(_c)                  ((_c)->nsets * (_c)->nlines)

/* definitions for type member */
#define CACHE_NONE              0       /* does not exist */
#define CACHE_UNIFIED           1       /* is unified */
#define CACHE_CODE              2       /* is for instructions only */
#define CACHE_DATA              3       /* is for data only */
#define CACHE_OTHER             4       /* is something different */

/*
 * Per-TLB information structure.
 *
 * There can be up to MAX_TLB distinct TLBs present in current architectures.
 * By the rules above, AMD Athlon processors have approx. 6 TLBs.
 *
 * The properties of a TLB must be represented and referenced using the
 * same CACHE_* macros above.
 */

struct hw_tlb {
        u_int32_t type;
        u_int32_t pagesize;
        u_int32_t nsets;
        u_int32_t nlines;
} __packed;

#define PAGESIZE_PROG   0UL
#define MAX_TLB         8

/*
 * Maintain information about all caches and TLBs in the system. Assume that
 * this information is consistent across all CPUs in an SMP system.
 *
 * Export this structure to userland via a sysctl.
 *
 * XXX TODO: Make it easy for assembly language routines to use this
 * structure.
 */

struct hw_cacheinfo {
        struct {
                struct hw_cache icache;
                struct hw_cache dcache;
        } l1, l2, l3;
        struct hw_tlb tlb[MAX_TLB];
} __packed;

----------------------------------------------------------------------------

pgp00000.pgp
Description: PGP signature

Re: Determining CPU features / cache organization from userland

Reply via email to