Re: [Intel-gfx] [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

2023-07-20 Thread T.J. Mercier
On Thu, Jul 20, 2023 at 3:55 AM Tvrtko Ursulin
 wrote:
>
>
> Hi,
>
> On 19/07/2023 21:31, T.J. Mercier wrote:
> > On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
> >  wrote:
> >>
> >>drm.memory.stat
> >>  A nested file containing cumulative memory statistics for the 
> >> whole
> >>  sub-hierarchy, broken down into separate GPUs and separate memory
> >>  regions supported by the latter.
> >>
> >>  For example::
> >>
> >>$ cat drm.memory.stat
> >>card0 region=system total=12898304 shared=0 active=0 
> >> resident=12111872 purgeable=167936
> >>card0 region=stolen-system total=0 shared=0 active=0 resident=0 
> >> purgeable=0
> >>
> >>  Card designation corresponds to the DRM device names and multiple 
> >> line
> >>  entries can be present per card.
> >>
> >>  Memory region names should be expected to be driver specific with 
> >> the
> >>  exception of 'system' which is standardised and applicable for 
> >> GPUs
> >>  which can operate on system memory buffers.
> >>
> >>  Sub-keys 'resident' and 'purgeable' are optional.
> >>
> >>  Per category region usage is reported in bytes.
> >>
> >>   * Feedback from people interested in drm.active_us and drm.memory.stat is
> >> required to understand the use cases and their usefulness (of the 
> >> fields).
> >>
> >> Memory stats are something which was easy to add to my series, since I 
> >> was
> >> already working on the fdinfo memory stats patches, but the question 
> >> is how
> >> useful it is.
> >>
> > Hi Tvrtko,
> >
> > I think this style of driver-defined categories for reporting of
> > memory could potentially allow us to eliminate the GPU memory tracking
> > tracepoint used on Android (gpu_mem_total). This would involve reading
> > drm.memory.stat at the root cgroup (I see it's currently disabled on
>
> I can put it available under root too, don't think there is any
> technical reason to not have it. In fact, now that I look at it again,
> memory.stat is present on root so that would align with my general
> guideline to keep the two as similar as possible.
>
> > the root), which means traversing the whole cgroup tree under the
> > cgroup lock to generate the values on-demand. This would be done
> > rarely, but I still wonder what the cost of that would turn out to be.
>
> Yeah that's ugly. I could eliminate cgroup_lock by being a bit smarter.
> Just didn't think it worth it for the RFC.
>
> Basically to account memory stats for any sub-tree I need the equivalent
> one struct drm_memory_stats per DRM device present in the hierarchy. So
> I could pre-allocate a few and restart if run out of spares, or
> something. They are really small so pre-allocating a good number, based
> on past state or something, should would good enough. Or even total
> number of DRM devices in a system as a pessimistic and safe option for
> most reasonable deployments.
>
> > The drm_memory_stats categories in the output don't seem like a big
> > value-add for this use-case, but no real objection to them being
>
> You mean the fact there are different categories is not a value add for
> your use case because you would only use one?
>
Exactly, I guess that'd be just "private" (or pick another one) for
the driver-defined "regions" where
shared/private/resident/purgeable/active aren't really applicable.
That doesn't seem like a big problem to me since you already need an
understanding of what a driver-defined region means. It's just adding
a requirement to understand what fields are used, and a driver can
document that in the same place as the region itself. That does mean
performing arithmetic on values from different drivers might not make
sense. But this is just my perspective from trying to fit the
gpu_mem_total tracepoint here. I think we could probably change the
way drivers that use it report memory to fit closer into the
drm_memory_stats categories.

> The idea was to align 1:1 with DRM memory stats fdinfo and somewhat
> emulate how memory.stat also offers a breakdown.
>
> > there. I know it's called the DRM cgroup controller, but it'd be nice
> > if there were a way to make the mem tracking part work for any driver
> > that wishes to participate as many of our devices don't use a DRM
> > driver. But making that work doesn't look like it would fit very
>
> Ah that would be a challenge indeed to which I don't have any answers
> right now.
>
> Hm if you have a DRM device somewhere in the chain memory stats would
> still show up. Like if you had a dma-buf producer which is not a DRM
> driver, but then that buffer was imported by a DRM driver, it would show
> up in a cgroup. Or vice-versa. But if there aren't any in the whole
> chain then it would not.
>
Creating a dummy DRM driver underneath an existing driver as an
adaptation layer also came to mind, but yeah... probably not. :)

By the way I still want to try to add tracking for dma-bufs 

Re: [Intel-gfx] [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

2023-07-20 Thread Tvrtko Ursulin



Hi,

On 19/07/2023 21:31, T.J. Mercier wrote:

On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
 wrote:


   drm.memory.stat
 A nested file containing cumulative memory statistics for the whole
 sub-hierarchy, broken down into separate GPUs and separate memory
 regions supported by the latter.

 For example::

   $ cat drm.memory.stat
   card0 region=system total=12898304 shared=0 active=0 
resident=12111872 purgeable=167936
   card0 region=stolen-system total=0 shared=0 active=0 resident=0 
purgeable=0

 Card designation corresponds to the DRM device names and multiple line
 entries can be present per card.

 Memory region names should be expected to be driver specific with the
 exception of 'system' which is standardised and applicable for GPUs
 which can operate on system memory buffers.

 Sub-keys 'resident' and 'purgeable' are optional.

 Per category region usage is reported in bytes.

  * Feedback from people interested in drm.active_us and drm.memory.stat is
required to understand the use cases and their usefulness (of the fields).

Memory stats are something which was easy to add to my series, since I was
already working on the fdinfo memory stats patches, but the question is how
useful it is.


Hi Tvrtko,

I think this style of driver-defined categories for reporting of
memory could potentially allow us to eliminate the GPU memory tracking
tracepoint used on Android (gpu_mem_total). This would involve reading
drm.memory.stat at the root cgroup (I see it's currently disabled on


I can put it available under root too, don't think there is any 
technical reason to not have it. In fact, now that I look at it again, 
memory.stat is present on root so that would align with my general 
guideline to keep the two as similar as possible.



the root), which means traversing the whole cgroup tree under the
cgroup lock to generate the values on-demand. This would be done
rarely, but I still wonder what the cost of that would turn out to be.


Yeah that's ugly. I could eliminate cgroup_lock by being a bit smarter. 
Just didn't think it worth it for the RFC.


Basically to account memory stats for any sub-tree I need the equivalent 
one struct drm_memory_stats per DRM device present in the hierarchy. So 
I could pre-allocate a few and restart if run out of spares, or 
something. They are really small so pre-allocating a good number, based 
on past state or something, should would good enough. Or even total 
number of DRM devices in a system as a pessimistic and safe option for 
most reasonable deployments.



The drm_memory_stats categories in the output don't seem like a big
value-add for this use-case, but no real objection to them being


You mean the fact there are different categories is not a value add for 
your use case because you would only use one?


The idea was to align 1:1 with DRM memory stats fdinfo and somewhat 
emulate how memory.stat also offers a breakdown.



there. I know it's called the DRM cgroup controller, but it'd be nice
if there were a way to make the mem tracking part work for any driver
that wishes to participate as many of our devices don't use a DRM
driver. But making that work doesn't look like it would fit very


Ah that would be a challenge indeed to which I don't have any answers 
right now.


Hm if you have a DRM device somewhere in the chain memory stats would 
still show up. Like if you had a dma-buf producer which is not a DRM 
driver, but then that buffer was imported by a DRM driver, it would show 
up in a cgroup. Or vice-versa. But if there aren't any in the whole 
chain then it would not.



cleanly into this controller, so I'll just shut up now.


Not all all, good feedback!

Regards,

Tvrtko


Re: [Intel-gfx] [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

2023-07-19 Thread T.J. Mercier
On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
 wrote:
>
>   drm.memory.stat
> A nested file containing cumulative memory statistics for the whole
> sub-hierarchy, broken down into separate GPUs and separate memory
> regions supported by the latter.
>
> For example::
>
>   $ cat drm.memory.stat
>   card0 region=system total=12898304 shared=0 active=0 
> resident=12111872 purgeable=167936
>   card0 region=stolen-system total=0 shared=0 active=0 resident=0 
> purgeable=0
>
> Card designation corresponds to the DRM device names and multiple line
> entries can be present per card.
>
> Memory region names should be expected to be driver specific with the
> exception of 'system' which is standardised and applicable for GPUs
> which can operate on system memory buffers.
>
> Sub-keys 'resident' and 'purgeable' are optional.
>
> Per category region usage is reported in bytes.
>
>  * Feedback from people interested in drm.active_us and drm.memory.stat is
>required to understand the use cases and their usefulness (of the fields).
>
>Memory stats are something which was easy to add to my series, since I was
>already working on the fdinfo memory stats patches, but the question is how
>useful it is.
>
Hi Tvrtko,

I think this style of driver-defined categories for reporting of
memory could potentially allow us to eliminate the GPU memory tracking
tracepoint used on Android (gpu_mem_total). This would involve reading
drm.memory.stat at the root cgroup (I see it's currently disabled on
the root), which means traversing the whole cgroup tree under the
cgroup lock to generate the values on-demand. This would be done
rarely, but I still wonder what the cost of that would turn out to be.
The drm_memory_stats categories in the output don't seem like a big
value-add for this use-case, but no real objection to them being
there. I know it's called the DRM cgroup controller, but it'd be nice
if there were a way to make the mem tracking part work for any driver
that wishes to participate as many of our devices don't use a DRM
driver. But making that work doesn't look like it would fit very
cleanly into this controller, so I'll just shut up now.

Thanks!
-T.J.


[Intel-gfx] [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

2023-07-12 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

This series contains a proposal for a DRM cgroup controller which implements a
weight based hierarchical GPU usage budget based controller similar in concept
to some of the existing controllers and also exposes GPU memory usage as a read-
only field.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
   VMs based on the window focused status. It can be demonstrated how GPU
   scheduling control can easily be integrated into that setup.

*) Another real world example later in the cover letter.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

Structure of the series is as follows:

  1-5) A separate/different series which adds fdinfo memory stats support to
   i915. This is only a pre-requisite for patches 16-17 so can be ignored in
   scope of this series.
6) Improve client ownership tracking in DRM core. Also a pre-requisite which
   can be ignored.
7) Adds a skeleton DRM cgroup controller with no functionality.
 8-11) Laying down some infrastructure to enable the controller.
   12) The scheduling controller itself.
13-14) i915 support for the scheduling controller.
   15) Expose GPU utilisation from the controller.
   16) Add memory stats plumbing and core logic to the controller.
   17) i915 support for controller memory stats.

The proposals defines a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work.

DRM scheduling soft limits
~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM controller interface files
~~

  drm.active_us
GPU time used by the group recursively including all child groups.

  drm.weight
Standard cgroup weight based control [1, 1] used to configure the
relative distributing of GPU time between the sibling groups.

  drm.memory.stat
A nested file containing cumulative memory statistics for the whole
sub-hierarchy, broken down into separate GPUs and separate memory
regions supported by the latter.

For example::

  $ cat drm.memory.stat
  card0 region=system total=12898304 shared=0 active=0 
resident=12111872 purgeable=167936
  card0 region=stolen-system total=0 shared=0 active=0 resident=0 
purgeable=0

Card designation corresponds to the DRM device names and multiple line
entries can be present per card.

Memory region names should be expected to be driver specific with the
exception of 'system' which is standardised and applicable for GPUs
which can operate on system memory buffers.

Sub-keys 'resident' and 'purgeable' are optional.

Per category region usage is reported in bytes.

This builds upon the per client GPU utilisation work which landed recently for a
few drivers. My thinking is that in principle, an intersect of drivers which
support both that and some sort of scheduling control, like  priorities, could
also in theory support this controller.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which