> -----Original Message----- > From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com] > Sent: Wednesday, January 13, 2021 12:00 AM > To: Morten Rasmussen <morten.rasmus...@arm.com>; Tim Chen > <tim.c.c...@linux.intel.com> > Cc: Song Bao Hua (Barry Song) <song.bao....@hisilicon.com>; > valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org; > r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org; > gre...@linuxfoundation.org; Jonathan Cameron <jonathan.came...@huawei.com>; > mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com; > rost...@goodmis.org; bseg...@google.com; mgor...@suse.de; > mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com; > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; > linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O) > <xuw...@huawei.com>; Zengtao (B) <prime.z...@hisilicon.com>; tiantao (H) > <tiant...@hisilicon.com> > Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and > add cluster scheduler > > On 11/01/2021 10:28, Morten Rasmussen wrote: > > On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote: > >> > >> > >> On 1/8/21 7:12 AM, Morten Rasmussen wrote: > >>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote: > >>>> On 1/6/21 12:30 AM, Barry Song wrote: > > [...] > > >> I think it is going to depend on the workload. If there are dependent > >> tasks that communicate with one another, putting them together > >> in the same cluster will be the right thing to do to reduce communication > >> costs. On the other hand, if the tasks are independent, putting them > >> together > on the same cluster > >> will increase resource contention and spreading them out will be better. > > > > Agree. That is exactly where I'm coming from. This is all about the task > > placement policy. We generally tend to spread tasks to avoid resource > > contention, SMT and caches, which seems to be what you are proposing to > > extend. I think that makes sense given it can produce significant > > benefits. > > > >> > >> Any thoughts on what is the right clustering "tag" to use to clump > >> related tasks together? > >> Cgroup? Pid? Tasks with same mm? > > > > I think this is the real question. I think the closest thing we have at > > the moment is the wakee/waker flip heuristic. This seems to be related. > > Perhaps the wake_affine tricks can serve as starting point? > > wake_wide() switches between packing (select_idle_sibling(), llc_size > CPUs) and spreading (find_idlest_cpu(), all CPUs). > > AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently > all wakeups are (llc-)packed. > > select_task_rq_fair() > > for_each_domain(cpu, tmp) > > if (tmp->flags & sd_flag) > sd = tmp; > > > In case we would like to further distinguish between llc-packing and > even narrower (cluster or MC-L2)-packing, we would introduce a 2. level > packing vs. spreading heuristic further down in sis(). > > IMHO, Barry's current implementation doesn't do this right now. Instead > he's trying to pack on cluster first and if not successful look further > among the remaining llc CPUs for an idle CPU.
Right now in the main cases of using wake_affine to achieve better performance, processes are actually bound within one numa which is also a LLC in kunpeng920. Probably LLC=NUMA is also true for X86 Jacobsville, Tim? So one possible way to pretend a 2-level packing might be: if the affinity cpuset of waker and waker are both subset of one same LLC, we totally use cluster as the factor to determine packing or not and ignore LLC. I haven't really done this, but the below code can make the same result by forcing llc_id=cluster_id: diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index d72eb8d..3d78097 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -107,7 +107,7 @@ int __init parse_acpi_topology(void) cpu_topology[cpu].cluster_id = topology_id; topology_id = find_acpi_cpu_topology_package(cpu); cpu_topology[cpu].package_id = topology_id; - +#if 0 i = acpi_find_last_cache_level(cpu); if (i > 0) { @@ -119,8 +119,11 @@ int __init parse_acpi_topology(void) if (cache_id > 0) cpu_topology[cpu].llc_id = cache_id; } - } +#else + cpu_topology[cpu].llc_id = cpu_topology[cpu].cluster_id; +#endif + } return 0; } #endif With this, I have seen some major improvement in hackbench especially for monogamous communication model (fds_num=1, one sender for one receiver): numactl -N 0 hackbench -p -T -l 200000 -f 1 -g $1 I have tested -g(group_nums) 6, 12, 18, 24, 28, 32, For each different g, I ran 20 times and got the average value. The result is as below: g= 6 12 18 24 28 32 w/o 1.3243 1.6741 1.7560 1.9036 2.0262 2.1826 w/ 1.1314 1.1864 1.4494 1.6159 1.9078 2.1249 Using top -H and hit "f" to show cpu of each thread, I am seeing the two threads in one group are likely to run in a cluster. That's why the hackbench latency is decreasing much. Thanks Barry