Re: [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

Ho, Kenny Wed, 21 Nov 2018 14:07:42 -0800

Hi Tejun,

On Tue, Nov 20, 2018 at 5:30 PM Tejun Heo <t...@kernel.org> wrote:
> On Tue, Nov 20, 2018 at 10:21:14PM +0000, Ho, Kenny wrote:
> > By this reply, are you suggesting that vendor specific resources
> > will never be acceptable to be managed under cgroup?  Let say a user
>
> I wouldn't say never but whatever which gets included as a cgroup
> controller should have clearly defined resource abstractions and the
> control schemes around them including support for delegation.  AFAICS,
> gpu side still seems to have a long way to go (and it's not clear
> whether that's somewhere it will or needs to end up).
Right, I totally understand that it's not obvious from this RFC because the 
'resource' counting demonstrated in this RFC is trivial in nature, mostly to 
illustrate the 'vendor' concept.  The structure of this patch actually give us 
the ability to support both abstracted resources you mentioned and vendor 
specific resources.  But it is probably not very clear as the RFC only includes 
two resources and they are both vendor specific.  To be clear, I am not saying 
there aren't abstracted resources in drm, there are (we are still working on 
those).  What I am saying is that not all resources are abstracted and for the 
purpose of this RFC I was hoping to get some feedback on the vendor specific 
parts early just so that we don't go down the wrong path.


That said, I think I am getting a better sense of what you are saying.  Please 
correct me if I misinterpreted: your concern is that abstracting by vendor is 
too high level and it's too much of a free-for-all.  Instead, resources should 
be abstracted at the controller level even if it's only available to a specific 
vendor (or even a specific product from a specific vendor).  Is that a fair 
read?

A couple of additional side questions:
* Is statistic/accounting-only use cases like those enabled by cpuacct 
controller no longer sufficient?  If it is still sufficient, can you elaborate 
more on what you mean by having control schemes and supporting delegation?
* When you wrote delegation, do you mean delegation in the sense described in 
https://www.kernel.org/doc/Documentation/cgroup-v2.txt ?

> > To put the questions in more concrete terms, let say a user wants to
> > expose certain part of a gpu to a particular cgroup similar to the
> > way selective cpu cores are exposed to a cgroup via cpuset, how
> > should we go about enabling such functionality?
>
> Do what the intel driver or bpf is doing?  It's not difficult to hook
> into cgroup for identification purposes.
Does intel driver or bpf present an interface file in cgroupfs for users to 
configure the core selection like cpuset?  I must admit I am not too familiar 
with the bpf case as I was referencing mostly the way rdma was implemented when 
putting this RFC together.


Perhaps I wasn't communicating clearly so let me see if I can illustrate this 
discussion with a hypothetical but concrete example using our competitor's 
product.  Nvidia has something called Tensor Cores in some of their GPUs and 
the purpose of those cores is to accelerate matrix operations for machine 
learning applications.  This is something unique to Nvidia and to my knowledge 
no one else has something like it.  These cores are different from regular 
shader processors and there are multiple of them in a GPU.

Under the structure of this RFC, if Nvidia wants to make Tensor Cores 
manageable via cgroup (with the "Allocation" distribution model let say), they 
will probably have an interface file called "drm.nvidia.tensor_core", in which 
only nvidia's GPUs will be listed.  If a GPU has TC, it will have a positive 
count, otherwise 0.

If I understand you correctly Tejun, is that they should not do that.  What 
they should do is have an abstracted resource, possibly named 
"drm.matrix_accelerator" where all drm devices available on a system will be 
listed.  All GPUs except some Nvidia's will have a count of 0.  Or perhaps that 
is not sufficiently abstracted so instead there should be just "drm.cores" 
instead and that file list both device, core types and count.  For one vendor 
they may have shader proc, texture map unit, tensor core, ray tracing cores as 
types.  Others may have ALUs, EUs and subslices.

Is that an accurate representation of what you are recommending?

Regards,
Kenny

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

Reply via email to