Hi Chen,
> > > It seems that cpumask_first(llc_mask(i)) is accessing
> > > NULL cpu_coregroup_mask():
> >
> > > has_coregroup_support() is false, thus cpu_coregroup_map
> > > is never allocated in smp_prepare_cpus().
> > > This machine is a "shared system" VM. We should probably
> > > let the LLC id generation fall back to using L2 id if
> > > cpu_coregroup_mask is unavailable (which restores the
> > > behavior before this patch). I'm wondering if the following
> > > change would help(need IBM friends' help on this):
> >
> > Power9 and below systems, dont have coregroup.
> > Its not because of shared LPAR. But its true for dedicated LPARs too.
> > Only Power10 and above systems have hemisphere where we add MC/coregroup
> > support.
> >
>
> OK, thanks for the correction. Are you saying coregroup_enabled is false
> on Power9 and older hardware, and set to true on Power10? Power10 has a
> corresponding device-tree property, which is parsed to enable hemisphere
> support in find_possible_nodes(). This is why has_coregroup_support()
> returns true for Power10.
>
Yes, Chen,
coregroup_enabled is true only on Power 10 +
Yes we decipher coregroup from the device-tree properties.
> > > +struct cpumask *cpu_coregroup_mask(int cpu)
> > > +{
> > > + if (!has_coregroup_support())
> > > + return cpu_l2_cache_mask(cpu);
> > > +
> > > + return per_cpu(cpu_coregroup_map, cpu);
> > > +}
> > > +
> >
> > While this is a work-around for the problem in Power9
> > It will hurt Power10 and Power11 systems.
> > As has been alluded by Prateek, MC is not LLC on Power.
>
> Could you please elaborate on the cache topology?
> Specifically, could you clarify what the LLC is for Power9
> and Power10 respectively? Is it always the L2 cache?
>
> I have checked the IBM documentation available at:
> https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf
> According to the document, a hemisphere corresponds to a 64MB
> L3 cache shared by 8 cores. Since the MC domain spans a single
> hemisphere, I wonder why the SD_SHARE_LLC flag is not enabled
> for the MC domain?
If we look at the presentation you pointed above, L2 is 2Mb per SMT8 Core.
L3 is local 8MB per SMT8 core which together form a 64MB l3-buffer per
hemisphere. L3 is a Victim cache and All L3 together form a L3.1 buffer.
In practice, we split the cache per small core aka SMT4 core. So we have
1Mb L2 per SMT4 core, 4Mb L3 per SMT4 Core. L3 is a Victim cache and All L3
combine to form L3.1 buffer. Hence for now we still consider L2 to be LLC.
On Power9, L2 is at CACHE domain
On all other Power Systems (P7,P8, P10, P11), L2 is at SMT domain.
On Power, We haven taken L2 as LLC.
lscpu (on Power 10)
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 480
On-line CPU(s) list: 0-479
Thread(s) per core: 8
Core(s) per socket: 15
Socket(s): 4
NUMA node(s): 4
Model: 2.0 (pvr 0080 0200)
Model name: POWER10, altivec supported
CPU max MHz: 3249.0000
CPU min MHz: 3249.0000
L1d cache: 32K
L1i cache: 48K
L2 cache: 1024K
L3 cache: 4096K
NUMA node0 CPU(s): 0-119
NUMA node1 CPU(s): 120-239
NUMA node2 CPU(s): 240-359
NUMA node3 CPU(s): 360-479
L2 Cache reported here is for SMT4 Core.
lscpu (on Power 9)
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 8
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Physical sockets: 2
Physical chips: 1
Physical cores/chip: 8
L2 Cache reported here is for SMT8 Core aka CACHE domain.
>
> > So by using llc_mask as cpu_coregroup_mask() we run the trouble of assuming
> > MC to be similar to LLC. So it will impact Power 10/11 Systems.
> >
> > In commit b5ea300a17e3 sched/cache: Make LLC id continuous, we define
> > #define llc_mask(cpu) cpu_coregroup_mask(cpu)
> >
> > defining it llc_mask to cpu_coregroup_mask means MC should be LLC.
> > This is not true for some architectures atleast on Power.
> >
>
> OK.
>
> > So shouldn't it be using
> > #define llc_mask(cpu) per_cpu(sd_llc, cpu)
> >
> > This should work for systems where LLC is sub-coregroup, coregroup (or super
> > coregroup: Lets say some archs want LLC at PKG and cluster at coregroup).
> >
> > if we do that, I dont think we even need the else case where we say
> > #define llc_mask(cpu) cpumask_of(cpu)
> >
>
> I suppose you are referring to
> sched_domain_span(per_cpu(sd_llc, cpu)).
>
> Indeed, deriving the LLC from the SD_SHARE_LLC level offers
> better scalability. However, this approach would involve scheduler
> domains, which can be truncated by cpuset partitions - a scenario we
> prefer to avoid.
>
Shouldnt cache-aware scheduling be worried about cpuset partitions too.
If a cpuset has subset of LLC cores, then we should Scheduler assume it can
control complete LLC?
--
Thanks and Regards
Srikar Dronamraju