On Thu, Jan 12, 2017 at 4:14 AM, Mark Rutland <mark.rutl...@arm.com> wrote: > On Tue, Jan 10, 2017 at 12:51:58PM -0800, David Carrillo-Cisneros wrote: >> On Tue, Jan 10, 2017 at 8:38 AM, Mark Rutland <mark.rutl...@arm.com> wrote: >> > Hi, >> > >> > On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote: >> >> During sched in, only events that match the current CPU and cgroup can >> >> be scheduled in. These events should be added to the PMU in increasing >> >> timestamp order to guaratee fairness in the event rotation. >> >> >> >> In task contexts, events with no CPU affinity or with affinity to the >> >> current CPU are eligible. >> >> In CPU contexts, events with no cgroup restriction (CPU events) or with >> >> a cgroup that is current (either current's task cgroup or one of its >> >> ancestor) are eligible. >> >> >> >> For both context types, we need to iterate over events with different >> >> rb-tree key groups (different CPUs or different cgroups) in timestamp >> >> order. >> >> >> >> Finding the events in timestamp order is at odds with indexing by >> >> by CPU/cgroup. >> > >> > That's a fair point. Sorting by CPU before runtime we'll get subtrees we >> > won't get fairness unless we sort the events solely by runtime at >> > sched_in time. If we sort by with runtime before CPU we'll have to skip >> > events not targeting the current CPU when scheduling task events in. I >> > note the latter is true today anyhow. >> >> That's were ctx->inactive_groups comes in. That list is sorted by runtime >> and the rb-tree is used to skip to the part of the list that has the events >> that matter. > > Is the list only sorted by runtime and not {cpu,runtime}? If it's the > latter, I'm not sure I follow. If it's the former, sorry for the noise!
The former. List only sorted by runtime. > > The case I'm worried about is a set of task-bound events that have CPU > filters. For example, if the user opens a set of task-bound events for > any CPU: > > perf_event_open(attr1, pid, -1, -1, 0); > perf_event_open(attr2, pid, -1, -1, 0); > > ... and also some for the same task, but limited to a specific CPU: > > perf_event_open(attr3, pid, 1, -1, 0); > perf_event_open(attr4, pid, 1, -1, 0); > > ... if CPU is before runtime in the sort, one of these groups will > always be considered first when scheduling, and may starve the other > group. > > Today, the rotation gives us some degree of fairness here that we would > lose. Yes, that case is the reason to have the sorted inactive_list and the tree. I tried to explain this in the change log of this patch. Please see new attempt below. > >> > In Peter's original suggestion we didn't sort by cgroup. IIRC there was >> > some email thread where the cgroup was considered for the sort (maybe >> > that was *only* for cpu contexts? I'm not too familiar with cgroups), >> > though I can't find the relevant mail, if it existed. :/ >> >> FWIW, in this approach, we only sort by cgroup in CPU contexts, since cgroup >> events are only installed in CPU contexts. > > Sure. However, I think a similar issue to the above applies when > scheduling events where some are bound to a specific cgroup, and others > are not. Yes, it's addressed in the same way. For both task and cpu contexts the idea is the same. It is: 1. For each "group" of events (either pmu/cpu or cgroup/NULL) that is eligible to sched_in in current cpu/cgroup, find the pair of minimum and maximum events in timestamp order, using the rb-tree. 2. Each (min,max) pair is an interval in the ctx->inactive_groups list (list is sorted by timestamp only). 3. Merge all those min/maxs for all eligible cpus/cgroups and obtain an overall (min,max). That's the interval in the ctx->inactive_groups to use by ctx_sched_in. The overall interval of events may contain ineligible events (i.e. events bound to a a CPU or cgroup that cannot be sched in), but that's ok because: a) events are filtered using event_filter_match anyways and, b) since events are sorted by timestamp, events that not passed the filter move behind in the list in each ctx switch in/out cycle. All ineligible events are left behind after a few cycles of ctx sw in/out. This work in amortized linear time for the worst case. The worst case occurs after a CPU/cgroup change. The stationary case is optimal. Thanks, David > > Thanks, > Mark.