Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-02-02 Thread Tejun Heo
Hello,

On Thu, Feb 02, 2023 at 02:26:06PM +, Tvrtko Ursulin wrote:
> When you say active/inactive - to what you are referring in the cgroup
> world? Offline/online? For those my understanding was offline was a
> temporary state while css is getting destroyed.

Oh, it's just based on activity. So, for example, iocost puts a cgroup on
its active list which is canned periodically when an IO is issued from an
inactive cgroup. If an active cgroup doesn't have any activity between two
scans, it becomes inactive and dropped from the list. drm can prolly use the
same approach?

> Also, I am really postponing implementing those changes until I hear at
> least something from the DRM community.

Yeah, that sounds like a good idea.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-02-02 Thread Tvrtko Ursulin



On 28/01/2023 01:11, Tejun Heo wrote:

On Thu, Jan 12, 2023 at 04:56:07PM +, Tvrtko Ursulin wrote:
...

+   /*
+* 1st pass - reset working values and update hierarchical weights and
+* GPU utilisation.
+*/
+   if (!__start_scanning(root, period_us))
+   goto out_retry; /*
+* Always come back later if scanner races with
+* core cgroup management. (Repeated pattern.)
+*/
+
+   css_for_each_descendant_pre(node, >css) {
+   struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+   struct cgroup_subsys_state *css;
+   unsigned int over_weights = 0;
+   u64 unused_us = 0;
+
+   if (!css_tryget_online(node))
+   goto out_retry;
+
+   /*
+* 2nd pass - calculate initial budgets, mark over budget
+* siblings and add up unused budget for the group.
+*/
+   css_for_each_child(css, >css) {
+   struct drm_cgroup_state *sibling = css_to_drmcs(css);
+
+   if (!css_tryget_online(css)) {
+   css_put(node);
+   goto out_retry;
+   }
+
+   sibling->per_s_budget_us  =
+   DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
+sibling->weight,
+drmcs->sum_children_weights);
+
+   sibling->over = sibling->active_us >
+   sibling->per_s_budget_us;
+   if (sibling->over)
+   over_weights += sibling->weight;
+   else
+   unused_us += sibling->per_s_budget_us -
+sibling->active_us;
+
+   css_put(css);
+   }
+
+   /*
+* 3rd pass - spread unused budget according to relative weights
+* of over budget siblings.
+*/
+   css_for_each_child(css, >css) {
+   struct drm_cgroup_state *sibling = css_to_drmcs(css);
+
+   if (!css_tryget_online(css)) {
+   css_put(node);
+   goto out_retry;
+   }
+
+   if (sibling->over) {
+   u64 budget_us =
+   DIV_ROUND_UP_ULL(unused_us *
+sibling->weight,
+over_weights);
+   sibling->per_s_budget_us += budget_us;
+   sibling->over = sibling->active_us  >
+   sibling->per_s_budget_us;
+   }
+
+   css_put(css);
+   }
+
+   css_put(node);
+   }
+
+   /*
+* 4th pass - send out over/under budget notifications.
+*/
+   css_for_each_descendant_post(node, >css) {
+   struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+   if (!css_tryget_online(node))
+   goto out_retry;
+
+   if (drmcs->over || drmcs->over_budget)
+   signal_drm_budget(drmcs,
+ drmcs->active_us,
+ drmcs->per_s_budget_us);
+   drmcs->over_budget = drmcs->over;
+
+   css_put(node);
+   }


It keeps bothering me that the distribution logic has no memory. Maybe this
is good enough for coarse control with long cycle durations but it likely
will get in trouble if pushed to finer grained control. State keeping
doesn't require a lot of complexity. The only state that needs tracking is
each cgroup's vtime and then the core should be able to tell specific
drivers how much each cgroup is over or under fairly accurately at any given
time.

That said, this isn't a blocker. What's implemented can work well enough
with coarse enough time grain and that might be enough for the time being
and we can get back to it later. I think Michal already mentioned it but it
might be a good idea to track active and inactive cgroups and build the
weight tree with only active ones. There are machines with a lot of mostly
idle cgroups (> tens of thousands) and tree wide scanning even at low
frequency can become a pretty bad bottleneck.


Right, that's the kind of experience (tens of thousands) I was missing, 
thank you. Another one item on my TODO list then but I have a question 
first.


When you say active/inactive - to what you are referring in the cgroup 
world? 

Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-02-02 Thread Michal Koutný
On Thu, Jan 12, 2023 at 04:56:07PM +, Tvrtko Ursulin 
 wrote:
> +static int drmcs_can_attach(struct cgroup_taskset *tset)
> +{
> + int ret;
> +
> + /*
> +  * As processes are getting moved between groups we need to ensure
> +  * both that the old group does not see a sudden downward jump in the
> +  * GPU utilisation, and that the new group does not see a sudden jump
> +  * up with all the GPU time clients belonging to the migrated process
> +  * have accumulated.
> +  *
> +  * To achieve that we suspend the scanner until the migration is
> +  * completed where the resume at the end ensures both groups start
> +  * observing GPU utilisation from a reset state.
> +  */
> +
> + ret = mutex_lock_interruptible(_mutex);
> + if (ret)
> + return ret;
> + start_suspend_scanning();
> + mutex_unlock(_mutex);
> +
> + finish_suspend_scanning();

Here's scanning suspension, communicated via 

root_drmcs.scanning_suspended = true;
root_drmcs.suspended_period_us = root_drmcs.period_us;
root_drmcs.period_us = 0;

but I don't see those used in scan_worker() and the scanning traversal
can apparently run concurrently with a task migration.

> [...]
> +static bool
> +__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
> [...]
> + css_for_each_descendant_post(node, >css) {
> [...]
> + active = drmcs_get_active_time_us(drmcs);
> + if (period_us && active > drmcs->prev_active_us)
> + drmcs->active_us += active - drmcs->prev_active_us;
> + drmcs->prev_active_us = active;

drmcs_get_active_time_us() could count a task's contribution here,
the task would migrate to a different drmcs,
and it'd be counted 2nd time.




signature.asc
Description: Digital signature


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-02-02 Thread Michal Koutný
On Fri, Jan 27, 2023 at 01:31:54PM +, Tvrtko Ursulin 
 wrote:
> I think you missed the finish_suspend_scanning() part:
> 
>   if (root_drmcs.suspended_period_us)
>   cancel_delayed_work_sync(_drmcs.scan_work);
> 
> So if scanning was in progress migration will wait until it finishes.

Indeed, I've missed that. Thank you!

> Not claiming I did not miss something because I was totally new with cgroup
> internals when I started working on this. So it is definitely useful to have
> more eyes looking.

The custom with (especially v2, especially horizontal) migrations
is that they're treated leniently to avoid performance costs.

I'm afraid waiting for scan in can_attach() can propagate globally (via
cgroup_update_dfl_csses() and cgroup_attach_lock()) sometimes.

OTOH, unless I misunderstood, you need to cover explicit (not task but
resource, when sending client FD around) migration anyway?
(I.e. my suggestion would be to mutualy exclude scanning and explicit
migration but not scanning and task migration in order to avoid possible
global propagation.)

Thanks,
Michal


signature.asc
Description: Digital signature


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-01-27 Thread Tejun Heo
On Thu, Jan 12, 2023 at 04:56:07PM +, Tvrtko Ursulin wrote:
...
> + /*
> +  * 1st pass - reset working values and update hierarchical weights and
> +  * GPU utilisation.
> +  */
> + if (!__start_scanning(root, period_us))
> + goto out_retry; /*
> +  * Always come back later if scanner races with
> +  * core cgroup management. (Repeated pattern.)
> +  */
> +
> + css_for_each_descendant_pre(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> + struct cgroup_subsys_state *css;
> + unsigned int over_weights = 0;
> + u64 unused_us = 0;
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + /*
> +  * 2nd pass - calculate initial budgets, mark over budget
> +  * siblings and add up unused budget for the group.
> +  */
> + css_for_each_child(css, >css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + sibling->per_s_budget_us  =
> + DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
> +  sibling->weight,
> +  drmcs->sum_children_weights);
> +
> + sibling->over = sibling->active_us >
> + sibling->per_s_budget_us;
> + if (sibling->over)
> + over_weights += sibling->weight;
> + else
> + unused_us += sibling->per_s_budget_us -
> +  sibling->active_us;
> +
> + css_put(css);
> + }
> +
> + /*
> +  * 3rd pass - spread unused budget according to relative weights
> +  * of over budget siblings.
> +  */
> + css_for_each_child(css, >css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + if (sibling->over) {
> + u64 budget_us =
> + DIV_ROUND_UP_ULL(unused_us *
> +  sibling->weight,
> +  over_weights);
> + sibling->per_s_budget_us += budget_us;
> + sibling->over = sibling->active_us  >
> + sibling->per_s_budget_us;
> + }
> +
> + css_put(css);
> + }
> +
> + css_put(node);
> + }
> +
> + /*
> +  * 4th pass - send out over/under budget notifications.
> +  */
> + css_for_each_descendant_post(node, >css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + if (drmcs->over || drmcs->over_budget)
> + signal_drm_budget(drmcs,
> +   drmcs->active_us,
> +   drmcs->per_s_budget_us);
> + drmcs->over_budget = drmcs->over;
> +
> + css_put(node);
> + }

It keeps bothering me that the distribution logic has no memory. Maybe this
is good enough for coarse control with long cycle durations but it likely
will get in trouble if pushed to finer grained control. State keeping
doesn't require a lot of complexity. The only state that needs tracking is
each cgroup's vtime and then the core should be able to tell specific
drivers how much each cgroup is over or under fairly accurately at any given
time.

That said, this isn't a blocker. What's implemented can work well enough
with coarse enough time grain and that might be enough for the time being
and we can get back to it later. I think Michal already mentioned it but it
might be a good idea to track active and inactive cgroups and build the
weight tree with only active ones. There are machines with a lot of mostly
idle cgroups (> tens of thousands) and tree wide scanning even at low
frequency can become a pretty bad bottleneck.

Thanks.

-- 
tejun


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-01-27 Thread Tvrtko Ursulin



On 27/01/2023 14:11, Michal Koutný wrote:

On Fri, Jan 27, 2023 at 01:31:54PM +, Tvrtko Ursulin 
 wrote:

I think you missed the finish_suspend_scanning() part:

if (root_drmcs.suspended_period_us)
cancel_delayed_work_sync(_drmcs.scan_work);

So if scanning was in progress migration will wait until it finishes.


Indeed, I've missed that. Thank you!


Not claiming I did not miss something because I was totally new with cgroup
internals when I started working on this. So it is definitely useful to have
more eyes looking.


The custom with (especially v2, especially horizontal) migrations
is that they're treated leniently to avoid performance costs.

I'm afraid waiting for scan in can_attach() can propagate globally (via
cgroup_update_dfl_csses() and cgroup_attach_lock()) sometimes.


That something along those lines might be a concern was indeed worrying 
me when coming up with the scheme. Good inside knowledge hint, thank 
you. I will have a deeper look.



OTOH, unless I misunderstood, you need to cover explicit (not task but
resource, when sending client FD around) migration anyway?


Correct. So far that was handled outside the cgroup controller in the 
drm layer and any lock dependency propagation was hidden behind RCU.
But that will likely change once I try your suggestion of eliminating 
the struct pid based client tracking and so become relevant.



(I.e. my suggestion would be to mutualy exclude scanning and explicit
migration but not scanning and task migration in order to avoid possible
global propagation.)


Thanks, I will look into this all hopefully shortly. Perhaps what you 
suggest will come naturally with the removal of struct pid based tracking.


Regards,

Tvrtko


Re: [Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-01-27 Thread Tvrtko Ursulin



On 27/01/2023 13:01, Michal Koutný wrote:

On Thu, Jan 12, 2023 at 04:56:07PM +, Tvrtko Ursulin 
 wrote:

+static int drmcs_can_attach(struct cgroup_taskset *tset)
+{
+   int ret;
+
+   /*
+* As processes are getting moved between groups we need to ensure
+* both that the old group does not see a sudden downward jump in the
+* GPU utilisation, and that the new group does not see a sudden jump
+* up with all the GPU time clients belonging to the migrated process
+* have accumulated.
+*
+* To achieve that we suspend the scanner until the migration is
+* completed where the resume at the end ensures both groups start
+* observing GPU utilisation from a reset state.
+*/
+
+   ret = mutex_lock_interruptible(_mutex);
+   if (ret)
+   return ret;
+   start_suspend_scanning();
+   mutex_unlock(_mutex);
+
+   finish_suspend_scanning();


Here's scanning suspension, communicated via

root_drmcs.scanning_suspended = true;
root_drmcs.suspended_period_us = root_drmcs.period_us;
root_drmcs.period_us = 0;

but I don't see those used in scan_worker() and the scanning traversal
can apparently run concurrently with a task migration.


I think you missed the finish_suspend_scanning() part:

if (root_drmcs.suspended_period_us)
cancel_delayed_work_sync(_drmcs.scan_work);

So if scanning was in progress migration will wait until it finishes. 
And re-start only when migration is done (drmcs_attach), or it failed 
(drmcs_cancel_attach).


Not claiming I did not miss something because I was totally new with 
cgroup internals when I started working on this. So it is definitely 
useful to have more eyes looking.



[...]
+static bool
+__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
[...]
+   css_for_each_descendant_post(node, >css) {
[...]
+   active = drmcs_get_active_time_us(drmcs);
+   if (period_us && active > drmcs->prev_active_us)
+   drmcs->active_us += active - drmcs->prev_active_us;
+   drmcs->prev_active_us = active;


drmcs_get_active_time_us() could count a task's contribution here,
the task would migrate to a different drmcs,
and it'd be counted 2nd time.


Lets see.. __start_scanning() can be called from the worker, so max one 
instance at a time, no issue.


Then from resume scanning, so it is guaranteed worker is not running and 
can't restart since mutex guards the re-start.


Finally from drmcs_write_period_us() - yes there __start_scanning() can 
race with it being invoked by the worker - oops! However.. this is just 
a debugging aid as the cover letter explains. This file is not intended 
to be present in the final version, rather as per earlier discussion 
with Tejun the idea is to only have boot time option to control the 
functionality (enable/disable or period).


I will nevertheless try to fix this race up for the next posting to 
avoid further confusion!


Regards,

Tvrtko


[Intel-gfx] [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

2023-01-12 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Similar to CPU scheduling, implement a concept of weight in the drm cgroup
controller.

Uses the same range and default as the CPU controller - CGROUP_WEIGHT_MIN,
CGROUP_WEIGHT_DFL and CGROUP_WEIGHT_MAX.

Later each cgroup is assigned a time budget proportionaly based on the
relative weights of it's siblings. This time budget is in turn split by
the group's children and so on.

This will be used to implement a soft, or best effort signal from drm
cgroup to drm core notifying about groups which are over their allotted
budget.

No guarantees that the limit can be enforced are provided or implied.

Signed-off-by: Tvrtko Ursulin 
---
 Documentation/admin-guide/cgroup-v2.rst |  37 ++
 drivers/gpu/drm/Kconfig |   1 +
 init/Kconfig|   1 +
 kernel/cgroup/drm.c | 506 +++-
 4 files changed, 541 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..9894dd59e4c5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2407,6 +2407,43 @@ HugeTLB Interface Files
 hugetlb pages of  in this cgroup.  Only active in
 use hugetlb pages are included.  The per-node values are in bytes.
 
+DRM
+---
+
+The DRM controller allows configuring scheduling soft limits.
+
+DRM scheduling soft limits
+~~
+
+Because of the heterogenous hardware and driver DRM capabilities, soft limits
+are implemented as a loose co-operative (bi-directional) interface between the
+controller and DRM core.
+
+The controller configures the GPU time allowed per group and periodically scans
+the belonging tasks to detect the over budget condition, at which point it
+invokes a callback notifying the DRM core of the condition.
+
+DRM core provides an API to query per process GPU utilization and 2nd API to
+receive notification from the cgroup controller when the group enters or exits
+the over budget condition.
+
+Individual DRM drivers which implement the interface are expected to act on 
this
+in the best-effort manner only. There are no guarantees that the soft limits
+will be respected.
+
+DRM scheduling soft limits interface files
+~~
+
+  drm.weight
+   Standard cgroup weight based control [1, 1] used to configure the
+   relative distributing of GPU time between the sibling groups.
+
+  drm.period_us (debugging aid during RFC only)
+   An integer representing the period with which the controller should look
+   at the GPU usage by the group and potentially send the over/under budget
+   signal.
+   Value of zero (defaul) disables the soft limit checking.
+
 Misc
 
 
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index b56b9f2fe8e6..0fbfd3026b71 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -7,6 +7,7 @@
 #
 menuconfig DRM
tristate "Direct Rendering Manager (XFree86 4.1.0 and higher DRI 
support)"
+   default y if CGROUP_DRM=y
depends on (AGP || AGP=n) && !EMULATED_CMPXCHG && HAS_DMA
select DRM_PANEL_ORIENTATION_QUIRKS
select HDMI
diff --git a/init/Kconfig b/init/Kconfig
index c5ace0d57007..304418674097 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1091,6 +1091,7 @@ config CGROUP_RDMA
 
 config CGROUP_DRM
bool "DRM controller"
+   select DRM
help
  Provides the DRM subsystem controller.
 
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 3e7f165806de..527b7bf8c576 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -8,14 +8,43 @@
 #include 
 #include 
 
+#include 
+
 struct drm_cgroup_state {
struct cgroup_subsys_state css;
+
+   unsigned int weight;
+
+   /*
+* Below fields are owned and updated by the scan worker. Either the
+* worker accesses them, or worker needs to be suspended and synced
+* before they can be touched from the outside.
+*/
+   unsigned int sum_children_weights;
+
+   bool over;
+   bool over_budget;
+
+   u64 per_s_budget_us;
+   u64 prev_active_us;
+   u64 active_us;
 };
 
 struct drm_root_cgroup_state {
struct drm_cgroup_state drmcs;
+
+   unsigned int period_us;
+
+   ktime_t prev_timestamp;
+
+   bool scanning_suspended;
+   unsigned int suspended_period_us;
+
+   struct delayed_work scan_work;
 };
 
+static DEFINE_MUTEX(drmcg_mutex);
+
 static inline struct drm_cgroup_state *
 css_to_drmcs(struct cgroup_subsys_state *css)
 {
@@ -29,10 +58,355 @@ static inline struct drm_cgroup_state 
*get_task_drmcs(struct task_struct *task)
 
 static struct drm_root_cgroup_state root_drmcs;
 
+static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
+{
+   struct cgroup *cgrp = drmcs->css.cgroup;
+   struct task_struct *task;
+   struct