> > Also in the current P9 itself, two neighbouring core-pairs form a quad. > > Cache latency within a quad is better than a latency to a distant core-pair. > > Cache latency within a core pair is way better than latency within a quad. > > So if we have only 4 threads running on a DIE all of them accessing the same > > cache-lines, then we could probably benefit if all the tasks were to run > > within the quad aka MC/Coregroup. > > > > Did you test this? WRT load balance we do try to balance "load" over the > different domain spans, so if you represent quads as their own MC domain, > you would AFAICT end up spreading tasks over the quads (rather than packing > them) when balancing at e.g. DIE level. The desired behaviour might be > hackable with some more ASYM_PACKING, but I'm not sure I should be > suggesting that :-) >
Agree, load balance will try to spread the load across the quads. In my hack, I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few load spreading rules when SD_PREFER_SIBLING was not set. And this was on a slightly older kernel (without recent Vincent's load balance overhaul). > > I have found some benchmarks which are latency sensitive to benefit by > > having a grouping a quad level (using kernel hacks and not backed by > > firmware changes). Gautham also found similar results in his experiments > > but he only used binding within the stock kernel. > > > > IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT > binding thing. > > That's also where things get interesting (for me) because I experienced > something similar on another arm64 platform (ThunderX1). This was more > about cache bandwidth than cache latency, but IMO it's in the same bag of > fabric quirks. I blabbered a bit about this at last LPC [1], but kind of > gave up on it given the TX1 was the only (arm64) platform where I could get > both significant and reproducible results. > > Now, if you folks are seeing this on completely different hardware and have > "real" workloads that truly benefit from this kind of domain partitioning, > this might be another incentive to try and sort of generalize this. That's > outside the scope of your series, but your findings give me some hope! > > I think what I had in mind back then was that if enough folks cared about > it, we might get some bits added to the ACPI spec; something along the > lines of proximity domains for the caches described in PPTT, IOW a cache > distance matrix. I don't really know what it'll take to get there, but I > figured I'd dump this in case someone's listening :-) > Very interesting. > > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC > > domain need not be LLC domain for Power. > > From what I understood your MC domain does seem to map to LLC; but in any > case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with > your changes your sd_llc is gonna be SMT, and that's not going to be a very > big mask. IMO you do want to correctly reflect your LLC situation via this > flag to make cpus_share_cache() work properly. I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE, then I dynamically rename the DOMAIN as CACHE and enable SD_SHARE_PKG_RESOURCES in that domain. > > [1]: https://linuxplumbersconf.org/event/4/contributions/484/ Thanks for the pointer. -- Thanks and Regards Srikar Dronamraju