Correction --- A direct-mapping function to index a cache, to me, means the use of specific well-known index bits within the physical address to determine the set into which it will be placed when cached.
becomes A direct-mapping function to index a cache, to me, means the use of specific well-known index bits within the address (virtual or physical) to determine the set into which it will be placed when cached. Apologies. On Mon, Jun 1, 2015 at 9:30 AM, Alex Merritt <merritt.a...@gmail.com> wrote: > Matt, > > Thank you for the insight into the L1 operation. As L1 uses a > direct-mapped indexing function, we can apply known techniques such as use > of offsets to ensure specific placement within the cache, as you mention. > > My question is not in regard to whether virtual addresses or physical > addresses are used for indexing, but rather the function itself that the > hardware uses to perform indexing. Below is sample code which parses CPUID > to extract this information. On an Intel Haswell, it shows the L3 to have > "complex indexing" whereas L1/L2 to have direct. On an Intel Westmere, all > caches use direct mapping. I noticed processors since Sandy Bridge have > complex indexing in the LLC. > > A direct-mapping function to index a cache, to me, means the use of > specific well-known index bits within the physical address to determine the > set into which it will be placed when cached. Complex indexing suggests > this is no longer true. If true, how can we be sure the coloring strategy > used by the kernel to sort pages, based on specific index bits, will > continue to have the same effect on modern processors? > > -Alex > > /* Intel Programmer Manual instruction set reference > * CPUID, table 3-17. > */ > #include <stdio.h> > static const char *CACHE_STR[4] = { > "NULL", > "Data cache", > "Instruction cache", > "Unified cache", > }; > int main(void) > { > unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0; > unsigned int func = 4, val; > while (1) { > func = 4; > unsigned int _ecx = ecx++; > __asm__("cpuid \n\t" > : "=a"(eax), "=b"(ebx), "=c"(_ecx), "=d"(edx) > : "a"(func), "b"(ebx), "c"(_ecx), "d"(edx) > :); > /* check if a cache type is specified */ > if (!(val = (eax & 0x1f))) > break; > printf("\ntype: %s\n", CACHE_STR[val]); > > val = ((eax >> 5) & 0x7); > printf("level: %d\n", val); > > val = ((ebx >> 22) & 0x3ff); > printf("ways: %d\n", val+1); > > val = (_ecx & 0xffffffff); > printf("number of sets: %d\n", val+1); > > val = ((edx >> 2) & 0x1); > printf("complex index: %d (%s)\n", > val, (val ? "complex indexing" : "direct-mapped")); > > printf("\n"); > } > } > > > On Sat, May 30, 2015 at 7:21 PM, Matthew Dillon <dil...@backplane.com> > wrote: > >> I think what you are describing is Intel's virtually indexed physical >> cache. It is designed to allow the L1 cache access to occur concurrent >> with the PTE (page table entry) lookup, which is much more efficient than >> having to wait for the page table lookup first and then start the memory >> access on the L1 cache. >> >> The downside of this is that being virtually indexed, many programs tend >> to load at the same virtual memory address and memory map operations also >> tend to map at the same virtual memory address. When these represent >> private data rather than shared data, the cpu caches can wind up not being >> fully utilized. They are still N-way set associative so all is not lost, >> but they aren't optimally used. >> >> The general solution is to implement an offset in the userland memory >> allocator (not so much in the kernel) which is what we do for larger memory >> allocations. >> >> -Matt >> >> >> On Fri, May 29, 2015 at 8:52 AM, Alex Merritt <merritt.a...@gmail.com> >> wrote: >> >>> I learned this recently, having gained access to newer Intel processors: >>> these CPUs (Sandybridge, Haswell) use a form of indexing into the LLC which >>> is no longer direct (i.e. taking specific bits from a physical address to >>> determine which set the cache line in the LLC it goes into), but rather >>> what they call "complex indexing"[1]. Presumably this is some proprietary >>> hashing. >>> >>> I wanted to ask -- does page coloring, using direct indexing logic by >>> the kernel, have an advantage if such hashing is used, also if we are >>> unaware of the specific algorithm used to index the LLC? If we are unable >>> to determine which pages will conflict in the cache without careful study, >>> and assuming this algorithm may change between microarchitectures, it seems >>> there may be less benefit to applying the technique. >>> >>> [1] Intel Manual Vol.2A Table 3-17, cpuid command 04H >>> >>> -Alex >>> >>> On Tue, Apr 14, 2015 at 10:47 AM, Matthew Dillon <dil...@backplane.com> >>> wrote: >>>> >>>> -- >>>> >>>> If I recall, FreeBSD mostly removed page coloring from their VM page >>>> allocation subsystem. DragonFly kept it and integrated it into the >>>> fine-grained-locked VM page allocator. There's no advantage to >>>> manipulating the parameters for two reasons. >>>> >>>> First, all page coloring really does is try to avoid degenerate >>>> situations in the cpu caches. The cpu caches are already 4-way or 8-way >>>> set-associative. The page coloring improves this but frankly even the set >>>> associativeness in the base cpu caches gets us most of the way there. So >>>> adjusting the page coloring algorithms will not yield any improvements. >>>> >>>> Secondly, the L1 cache is a physical memory cache but it is also >>>> virtually indexed. This is a cpu hardware optimization that allows the >>>> cache lookup to be initiated concurrent with the TLB lookup. Because of >>>> this, physical set associatively does not actually solve all the problems >>>> which can occur with a virtually indexed cache. >>>> >>>> So the userland memory allocator implements an offsetting feature for >>>> allocations which attempts to address the virtually indexed cache issues. >>>> This feature is just as important as the physical page coloring feature for >>>> performance purposes. >>>> >>>> -Matt >>>> >>>> >>>> On Tue, Apr 14, 2015 at 10:10 AM, Alex Merritt <merritt.a...@gmail.com> >>>> wrote: >>>> >>>>> Hello! >>>>> >>>>> I am interested in learning whether Dragonfly supports large pages (2M >>>>> and 1G), and secondly, what mechanisms exist for applications to have >>>>> influence over the colors used to assign the physical pages backing their >>>>> memory, specifically for private anonymous mmap'd regions. Regarding >>>>> coloring, I'd like to be able to evaluate applications with a small number >>>>> of colors (restricting their access to the last-level cache) and compare >>>>> their performance to more/all colors available. I am initially looking to >>>>> work in hacks to achieve this to perform some preliminary experiments, >>>>> perhaps by way of a kernel module or something. >>>>> >>>>> A cursory search of the code showed no hints at support for large >>>>> pages, but I did find there are more internal functions governing the >>>>> allocation of pages based on colors, compared to FreeBSD (10.1). In >>>>> FreeBSD >>>>> it seems colors are only considered for regions which are added that are >>>>> backed by a file, but I am not 100% certain. >>>>> >>>>> I appreciate any help! >>>>> >>>>> Thanks, >>>>> Alex >>>>> >>>> >>>> >>> >> >