* David Goulet ([email protected]) wrote: > > > On 10-08-17 04:51 PM, Mathieu Desnoyers wrote: >> * David Goulet ([email protected]) wrote: >>> >>> >>> On 10-08-17 04:24 PM, Mathieu Desnoyers wrote: >>>> * David Goulet ([email protected]) wrote: >>>>> On 10-08-17 03:45 PM, Mathieu Desnoyers wrote: >>>> [...] >>>>>> Yes. The performance degradation caused by cache-line bouncing is _way_ >>>>>> worse than extra cache pressure. >>>>>> >>>>> >>>>> There is something I don't understand here. Correct me if (most likely) >>>>> I am wrong. >>>>> >>>>> How cache line bouncing is affected by the cache line size? If I >>>>> understand correctly, cache line bounce is the problem where CPUs shares >>>>> data and have to fetch it from CPU0 to CPU7 (between caches). And, I >>>>> surely agree, this is costly! >>>> >>>> That's ok up to here. >>>> >>>>> >>>>> However, if the size of the cache is bigger then the normal cache, you >>>>> just loose space... For arch with 64 cache line size, you loose two line >>>>> per structure aligned... How lowering down to 64 bytes will cause cache >>>>> line bouncing? >>>> >>>> Let's take the following example: >>>> >>>> A multiprocessor machine with 256 bytes cache line size. >>>> The program is built thinking the cache line size is only 128 bytes. >>>> >>>> So we allocate an array of what we hope are per-cpu variables: >>>> >>>> malloc(nr_cpus * sizeof(struct type)); >>>> >>>> Where struct type is __attribute__((aligned(128)) >>>> >>>> So we end up having two structures sharing a cache-line, and these will >>>> bounce between CPUs, even though the structures are not shared: only the >>>> cache-lines are shared, because the structures happen to be on the same >>>> cache line. >>>> >>>> So for allocation of individual objects which are meant to be per-cpu, >>>> e.g. a structure controlling the per-cpu buffer, the allocator can put >>>> one structure next to another (belonging to another cpu), thus causing >>>> cache line bouncing. >>>> >>>> This phenomenon is called "false sharing". >>>> >>> >>> Very nice. That clarify yes! >>> >>> However, please refer to Intel® 64 and IA-32 Architectures Software >>> Developer's Manual Volume 3A: System Programming Guide. >>> >>> http://www.intel.com/Assets/PDF/manual/253668.pdf >>> >>> P. 527, Table 11-1 >>> >>> • Pentium 4 and Intel Xeon processors (Based on Intel NetBurst >>> microarchitecture): 8-KByte, 4-way set associative, 64-byte cache line >>> size. >>> • Pentium 4 and Intel Xeon processors (Based on Intel NetBurst >>> microarchitecture): 16-KByte, 8-way set associative, 64-byte cache line >>> size. >> >> Dunno why the Linux kernel choses that for P4. But we definitely have to >> handle NUMA systems. >> > > arch_numa.h ... possible?
See my comment to Alexandre about multiplying the number of targets needlessly. Which one will be chosen by distros ? We'll do it if you can find a real-world benchmark that is affected by this. Good luck ;) Mathieu > >> Mathieu >> >>> >>> David >>> >>>> Mathieu >>>> >>>>> >>>>> Thanks for your help on that! >>>>> David >>>>> >>>> >>> >>> -- >>> David Goulet >>> LTTng project, DORSAL Lab. >>> >>> PGP/GPG : 1024D/16BD8563 >>> BE3C 672B 9331 9796 291A 14C6 4AF7 C14B 16BD 8563 >>> >> > > -- > David Goulet > LTTng project, DORSAL Lab. > > PGP/GPG : 1024D/16BD8563 > BE3C 672B 9331 9796 291A 14C6 4AF7 C14B 16BD 8563 > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ ltt-dev mailing list [email protected] http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
