> On Jan 4, 2018, at 4:03 AM, David Chisnall <thera...@freebsd.org> wrote: > > On 3 Jan 2018, at 22:12, Nathan Whitehorn <nwhiteh...@freebsd.org> wrote: >> >> On 01/03/18 13:37, Ed Schouten wrote: >>> 2018-01-01 11:36 GMT+01:00 Konstantin Belousov <kostik...@gmail.com>: >>>>>>> On x86, the CPUID instruction leaf 0x1 returns the information in >>>>>>> %ebx register. >>>>>> Hm, weird. Why don't we extend sysctl to include this info? >>>> For the same reason we do not provide a sysctl to add two integers. >>> I strongly agree with Kostik on this one. Why add stuff to the kernel, >>> if userspace is already capable of extracting this? Adding that stuff >>> to sysctl has the downside that it will effectively introduce yet >>> another FreeBSDism, whereas something generic already exists. >>> >> >> Well, kind of. The userspace version is platform-dependent and not always >> available: for example, on PPC, you can't do this from userland and we >> provide a sysctl machdep.cacheline_size to userland. It would be nice to >> have an MI API. > > On ARMv8, similarly, sometimes the kernel needs to advertise the wrong size. > A few big.LITTLE cores have 64-byte cache lines on one cluster and 32-byte on > the other. If you query the size from userspace while running on a 64-byte > cluster, then issue the zero-cache-line instruction while migrated to the > 32-byte cluster, you only clear half the size. Linux works around this by > trapping and emulating the instruction to query the cache size and always > reporting the size for the smallest cache lines. ARM tells people not to > build systems like this, but it doesn’t always stop them. Trapping and > emulating is much slower than just providing the information in a shared > page, elf aux args vector, or even (often) a system call. > > To give another example, Linux provides a very cheap way for a userspace > process to enquire which core it’s running on. Some more recent > high-performance mallocs use this to have a second-layer per-core cache after > the per-thread cache for free blocks. Unlike the per-thread cache, the > per-core cache does need a lock, but it’s very unlikely to be contended (it > will only be contended if either a thread is migrated in between checking and > locking, so acquires the wrong CPU’s lock, or if a thread is preempted in the > middle of middle of the very brief fill operation). The author of the > SuperMalloc paper tried doing this with CPUID and found that it was slower by > a sufficient margin to almost entirely offset the benefits of the extra layer > of caching. > > Just because userspace can get at the information directly from the hardware > doesn’t mean that this is the most efficient or best way for userspace to get > at it. > > Oh, and some of these things are useful in portable code, so having to write > some assembly for every target to get information that the kernel already > knows is wasteful. > > David
This idea of Arm big.LITTLE systems having cache lines of different lengths really, really bothers me - how on earth is the cache coherency supposed to work in such a system? I doubt the usual cache coherency protocols would work - probably need a really MESSY protocol to deal with this config :-) Jon.
Description: S/MIME cryptographic signature