On 17 December 2012 16:24, Alex Shi <alex....@intel.com> wrote: >>>>>>> The scheme below tries to summaries the idea: >>>>>>> >>>>>>> Socket | socket 0 | socket 1 | socket 2 | socket 3 | >>>>>>> LCPU | 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | >>>>>>> buddy conf0 | 0 | 0 | 1 | 16 | 2 | 32 | 3 | 48 | >>>>>>> buddy conf1 | 0 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | >>>>>>> buddy conf2 | 0 | 0 | 16 | 16 | 32 | 32 | 48 | 48 | >>>>>>> >>>>>>> But, I don't know how this can interact with NUMA load balance and the >>>>>>> better might be to use conf3. >>>>>> >>>>>> I mean conf2 not conf3 >>>>> >>>>> So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it >>>>> is unbalanced for different socket. >>>> >>>> That the target because we have decided to pack the small tasks in >>>> socket 0 when we have parsed the topology at boot. >>>> We don't have to loop into sched_domain or sched_group anymore to find >>>> the best LCPU when a small tasks wake up. >>> >>> iteration on domain and group is a advantage feature for power efficient >>> requirement, not shortage. If some CPU are already idle before forking, >>> let another waking CPU check their load/util and then decide which one >>> is best CPU can reduce late migrations, that save both the performance >>> and power. >> >> In fact, we have already done this job once at boot and we consider >> that moving small tasks in the buddy CPU is always benefit so we don't >> need to waste time looping sched_domain and sched_group to compute >> current capacity of each LCPU for each wake up of each small tasks. We >> want all small tasks and background activity waking up on the same >> buddy CPU and let the default behavior of the scheduler choosing the >> best CPU for heavy tasks or loaded CPUs. > > IMHO, the design should be very good for your scenario and your machine, > but when the code move to general scheduler, we do want it can handle > more general scenarios. like sometime the 'small task' is not as small > as tasks in cyclictest which even hardly can run longer than migration
Cyclictest is the ultimate small tasks use case which points out all weaknesses of a scheduler for such kind of tasks. Music playback is a more realistic one and it also shows improvement > granularity or one tick, thus we really don't need to consider task > migration cost. But when the task are not too small, migration is more For which kind of machine are you stating that hypothesis ? > heavier than domain/group walking, that is the common sense in > fork/exec/waking balance. I would have said the opposite: The current scheduler limits its computation of statistic during fork/exec/waking compared to a periodic load balance because it's too heavy. It's even more true for wake up if wake affine is possible. > >> >>> >>> On the contrary, move task walking on each level buddies is not only bad >>> on performance but also bad on power. Consider the quite big latency of >>> waking a deep idle CPU. we lose too much.. >> >> My result have shown different conclusion. > > That should be due to your tasks are too small to need consider > migration cost. >> In fact, there is much more chance that the buddy will not be in a >> deep idle as all the small tasks and background activity are already >> waking on this CPU. > > powertop is helpful to tune your system for more idle time. Another > reason is current kernel just try to spread tasks on more cpu for > performance consideration. My power scheduling patch should helpful on this. >> >>> >>>> >>>>> >>>>> And the ground level has just one buddy for 16 LCPUs - 8 cores, that's >>>>> not a good design, consider my previous examples: if there are 4 or 8 >>>>> tasks in one socket, you just has 2 choices: spread them into all cores, >>>>> or pack them into one LCPU. Actually, moving them just into 2 or 4 cores >>>>> maybe a better solution. but the design missed this. >>>> >>>> You speak about tasks without any notion of load. This patch only care >>>> of small tasks and light LCPU load, but it falls back to default >>>> behavior for other situation. So if there are 4 or 8 small tasks, they >>>> will migrate to the socket 0 after 1 or up to 3 migration (it depends >>>> of the conf and the LCPU they come from). >>> >>> According to your patch, what your mean 'notion of load' is the >>> utilization of cpu, not the load weight of tasks, right? >> >> Yes but not only. The number of tasks that run simultaneously, is >> another important input >> >>> >>> Yes, I just talked about tasks numbers, but it naturally extends to the >>> task utilization on cpu. like 8 tasks with 25% util, that just can full >>> fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need >>> to wake up another CPU socket while local socket has some LCPU idle... >> >> 8 tasks with a running period of 25ms per 100ms that wake up >> simultaneously should probably run on 8 different LCPU in order to >> race to idle > > nope, it's a rare probability of 8 tasks wakeuping simultaneously. And Multimedia is one example of tasks waking up simultaneously > even so they should run in the same socket for power saving > consideration(my power scheduling patch can do this), instead of spread > to all sockets. This is may be good for your scenario and your machine :-) Packing small tasks is the best choice for any scenario and machine. It's a more tricky point for not so small tasks because different machine will want different behavior. >> >> >> Regards, >> Vincent >> >>>> >>>> Then, if too much small tasks wake up simultaneously on the same LCPU, >>>> the default load balance will spread them in the core/cluster/socket >>>> >>>>> >>>>> Obviously, more and more cores is the trend on any kinds of CPU, the >>>>> buddy system seems hard to catch up this. >>>>> >>>>> >>> >>> >>> -- >>> Thanks >>> Alex > > > -- > Thanks > Alex _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev