Re: [Intel-gfx] [RFC 6/8] drm: Document fdinfo format specification
[AMD Official Use Only] I just want to make a comment that with this approach (the ns) calculating the percentage will take at least two reads of the fdinfo per pid over some time. Some engines may be able to provide a single shot percentage usage over an internal integration period. That is, for example, what we currently have implemented for that exact reason. I'd like to propose that we add an optional set of fields for this. Also, I may have missed a message, but why did we remove the timstamp? It is needed for accurate measurements of engine usage. David From: Daniel Vetter Sent: Friday, July 23, 2021 9:47 AM To: Daniel Stone Cc: Tvrtko Ursulin ; intel-gfx ; Tvrtko Ursulin ; Koenig, Christian ; dri-devel ; Nieto, David M Subject: Re: [RFC 6/8] drm: Document fdinfo format specification On Fri, Jul 23, 2021 at 05:43:01PM +0100, Daniel Stone wrote: > Hi Tvrtko, > Thanks for typing this up! > > On Thu, 15 Jul 2021 at 10:18, Tvrtko Ursulin > wrote: > > +Mandatory fully standardised keys > > +- > > + > > +- drm-driver: > > + > > +String shall contain a fixed string uniquely identified the driver handling > > +the device in question. For example name of the respective kernel module. > > I think let's be more prescriptive and just say that it is the module name. Just a quick comment on this one. drm_driver.name is already uapi, so let's please not invent a new one. The shared code should probably make sure drivers don't get this wrong. Maybe good if we document the getverion ioctl, which also exposes this, and then link between the two. -Daniel > > > +Optional fully standardised keys > > + > > + > > +- drm-pdev: > > + > > +For PCI devices this should contain the PCI slot address of the device in > > +question. > > How about just major:minor of the DRM render node device it's attached to? > > > +- drm-client-id: > > + > > +Unique value relating to the open DRM file descriptor used to distinguish > > +duplicated and shared file descriptors. Conceptually the value should map > > 1:1 > > +to the in kernel representation of `struct drm_file` instances. > > + > > +Uniqueness of the value shall be either globally unique, or unique within > > the > > +scope of each device, in which case `drm-pdev` shall be present as well. > > + > > +Userspace should make sure to not double account any usage statistics by > > using > > +the above described criteria in order to associate data to individual > > clients. > > + > > +- drm-engine-: ns > > + > > +GPUs usually contain multiple execution engines. Each shall be given a > > stable > > +and unique name (str), with possible values documented in the driver > > specific > > +documentation. > > + > > +Value shall be in specified time units which the respective GPU engine > > spent > > +busy executing workloads belonging to this client. > > + > > +Values are not required to be constantly monotonic if it makes the driver > > +implementation easier, but are required to catch up with the previously > > reported > > +larger value within a reasonable period. Upon observing a value lower than > > what > > +was previously read, userspace is expected to stay with that larger > > previous > > +value until a monotonic update is seen. > > Yeah, that would work well for Mali/Panfrost. We can queue multiple > jobs in the hardware, which can either be striped across multiple > cores with an affinity mask (e.g. 3 cores for your client and 1 for > your compositor), or picked according to priority, or ... > > The fine-grained performance counters (e.g. time spent waiting for > sampler) are only GPU-global. So if you have two jobs running > simultaneously, you have no idea who's responsible for what. > > But it does give us coarse-grained counters which are accounted > per-job-slot, including exactly this metric: amount of 'GPU time' > (whatever that means) occupied by that job slot during the sampling > period. So we could support that nicely if we fenced job-slot updates > with register reads/writes. > > Something I'm missing though is how we enable this information. Seems > like it would be best to either only do it whilst fdinfo is open (and > re-read it whenever you need an update), or on a per-driver sysfs > toggle, or ... ? > > > +- drm-memory-: [KiB|MiB] > > + > > +Each possible memory type which can be used to store buffer objects by the > > +GPU in question shall be given a stable and unique name t
Re: [Intel-gfx] [RFC 7/7] drm/i915: Expose client engine utilisation via fdinfo
[AMD Official Use Only] i would like to add a unit marker for the stats that we monitor in the fd, as we discussed currently we are displaying the usage percentage, because we wanted to to provide single query percentages, but this may evolve with time. May I suggest to add two new fields drm-stat-interval: <64 bit> ns drm-stat-timestamp: <64 bit> ns If interval is set, engine utilization is calculated by doing = 100*/ if interval is not set, two reads are needed : = 100* / Regards, David From: Tvrtko Ursulin Sent: Thursday, May 20, 2021 8:12 AM To: Intel-gfx@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org ; Tvrtko Ursulin ; Nieto, David M ; Koenig, Christian ; Daniel Vetter Subject: [RFC 7/7] drm/i915: Expose client engine utilisation via fdinfo From: Tvrtko Ursulin Similar to AMD commit 874442541133 ("drm/amdgpu: Add show_fdinfo() interface"), using the infrastructure added in previous patches, we add basic client info and GPU engine utilisation for i915. Example of the output: pos:0 flags: 012 mnt_id: 21 drm-driver: i915 drm-pdev: :00:02.0 drm-client-id: 7 drm-engine-render: 9288864723 ns drm-engine-copy:2035071108 ns drm-engine-video: 0 ns drm-engine-video-enhance: 0 ns DRM related fields are appropriately prefixed for easy parsing and separation from generic fdinfo fields. Idea is for some fields to become either fully or partially standardised in order to enable writting of generic top-like tools. Initial proposal for fully standardised common fields: drm-driver: drm-pdev: Optional fully standardised: drm-client-id: Optional partially standardised: engine-: ns memory-: KiB Once agreed the format would need to go to some README or kerneldoc in DRM core. Signed-off-by: Tvrtko Ursulin Cc: David M Nieto Cc: Christian König Cc: Daniel Vetter --- drivers/gpu/drm/i915/i915_drm_client.c | 68 ++ drivers/gpu/drm/i915/i915_drm_client.h | 4 ++ drivers/gpu/drm/i915/i915_drv.c| 3 ++ 3 files changed, 75 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c index 1e5db7753276..5e9cfba1116b 100644 --- a/drivers/gpu/drm/i915/i915_drm_client.c +++ b/drivers/gpu/drm/i915/i915_drm_client.c @@ -9,6 +9,11 @@ #include +#include + +#include "gem/i915_gem_context.h" +#include "gt/intel_engine_user.h" + #include "i915_drm_client.h" #include "i915_drv.h" #include "i915_gem.h" @@ -168,3 +173,66 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients) xa_destroy(&clients->xarray); } + +#ifdef CONFIG_PROC_FS +static const char * const uabi_class_names[] = { + [I915_ENGINE_CLASS_RENDER] = "render", + [I915_ENGINE_CLASS_COPY] = "copy", + [I915_ENGINE_CLASS_VIDEO] = "video", + [I915_ENGINE_CLASS_VIDEO_ENHANCE] = "video-enhance", +}; + +static u64 busy_add(struct i915_gem_context *ctx, unsigned int class) +{ + struct i915_gem_engines_iter it; + struct intel_context *ce; + u64 total = 0; + + for_each_gem_engine(ce, rcu_dereference(ctx->engines), it) { + if (ce->engine->uabi_class != class) + continue; + + total += intel_context_get_total_runtime_ns(ce); + } + + return total; +} + +static void +show_client_class(struct seq_file *m, + struct i915_drm_client *client, + unsigned int class) +{ + const struct list_head *list = &client->ctx_list; + u64 total = atomic64_read(&client->past_runtime[class]); + struct i915_gem_context *ctx; + + rcu_read_lock(); + list_for_each_entry_rcu(ctx, list, client_link) + total += busy_add(ctx, class); + rcu_read_unlock(); + + return seq_printf(m, "drm-engine-%s:\t%llu ns\n", + uabi_class_names[class], total); +} + +void i915_drm_client_fdinfo(struct seq_file *m, struct file *f) +{ + struct drm_file *file = f->private_data; + struct drm_i915_file_private *file_priv = file->driver_priv; + struct drm_i915_private *i915 = file_priv->dev_priv; + struct i915_drm_client *client = file_priv->client; + struct pci_dev *pdev = to_pci_dev(i915->drm.dev); + unsigned int i; + + seq_printf(m, "drm-driver:\ti915\n"); + seq_printf(m, "drm-pdev:\t%04x:%02x:%02x.%d\n", + pci_domain_nr(pdev->bus), pdev->bus->number, + PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn)); + + seq_printf(m, "drm-client-id:\t%u\n", client->id); + + for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) + show_client_class(m, client, i); +} +#endif diff
Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
[AMD Official Use Only] Parsing over 550 processes for fdinfo is taking between 40-100ms single threaded in a 2GHz skylake IBRS within a VM using simple string comparisons and DIRent parsing. And that is pretty much the worst case scenario with some more optimized implementations. David From: Daniel Vetter Sent: Wednesday, May 19, 2021 11:23 AM To: Tvrtko Ursulin Cc: Daniel Stone ; jhubb...@nvidia.com ; nouv...@lists.freedesktop.org ; Intel Graphics Development ; Maling list - DRI developers ; Simon Ser ; Koenig, Christian ; arit...@nvidia.com ; Nieto, David M Subject: Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin wrote: > > > On 18/05/2021 10:40, Tvrtko Ursulin wrote: > > > > On 18/05/2021 10:16, Daniel Stone wrote: > >> Hi, > >> > >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin > >> wrote: > >>> I was just wondering if stat(2) and a chrdev major check would be a > >>> solid criteria to more efficiently (compared to parsing the text > >>> content) detect drm files while walking procfs. > >> > >> Maybe I'm missing something, but is the per-PID walk actually a > >> measurable performance issue rather than just a bit unpleasant? > > > > Per pid and per each open fd. > > > > As said in the other thread what bothers me a bit in this scheme is that > > the cost of obtaining GPU usage scales based on non-GPU criteria. > > > > For use case of a top-like tool which shows all processes this is a > > smaller additional cost, but then for a gpu-top like tool it is somewhat > > higher. > > To further expand, not only cost would scale per pid multiplies per open > fd, but to detect which of the fds are DRM I see these three options: > > 1) Open and parse fdinfo. > 2) Name based matching ie /dev/dri/.. something. > 3) Stat the symlink target and check for DRM major. stat with symlink following should be plenty fast. > All sound quite sub-optimal to me. > > Name based matching is probably the least evil on system resource usage > (Keeping the dentry cache too hot? Too many syscalls?), even though > fundamentally I don't it is the right approach. > > What happens with dup(2) is another question. We need benchmark numbers showing that on anything remotely realistic it's an actual problem. Until we've demonstrated it's a real problem we don't need to solve it. E.g. top with any sorting enabled also parses way more than it displays on every update. It seems to be doing Just Fine (tm). > Does anyone have any feedback on the /proc//gpu idea at all? When we know we have a problem to solve we can take a look at solutions. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7CDavid.Nieto%40amd.com%7Cf6aea97532cf41f916de08d91af32cc1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570453997158377%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4CFrY9qWbJREcIcSzeO9KIn2P%2Fw6k%2BYdNlh6rdS%2BEh4%3D&reserved=0 ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
[AMD Official Use Only] The format is simple: : % we also have entries for the memory mapped: mem : KiB On my submission https://lists.freedesktop.org/archives/amd-gfx/2021-May/063149.html I added a python script to print out the info. It has a CPU usage lower that top, for example. To be absolutely honest, I agree that there is an overhead, but It might not be as much as you fear. From: Tvrtko Ursulin Sent: Monday, May 17, 2021 9:00 AM To: Nieto, David M ; Daniel Vetter ; Koenig, Christian Cc: Alex Deucher ; Intel Graphics Development ; Maling list - DRI developers Subject: Re: [PATCH 0/7] Per client engine busyness On 17/05/2021 15:39, Nieto, David M wrote: > [AMD Official Use Only] > > > Maybe we could try to standardize how the different submission ring > usage gets exposed in the fdinfo? We went the simple way of just > adding name and index, but if someone has a suggestion on how else we > could format them so there is commonality across vendors we could just > amend those. Could you paste an example of your format? Standardized fdinfo sounds good to me in principle. But I would also like people to look at the procfs proposal from Chris, - link to which I have pasted elsewhere in the thread. Only potential issue with fdinfo I see at the moment is a bit of an extra cost in DRM client discovery (compared to my sysfs series and also procfs RFC from Chris). It would require reading all processes (well threads, then maybe aggregating threads into parent processes), all fd symlinks, and doing a stat on them to figure out which ones are DRM devices. Btw is DRM_MAJOR 226 consider uapi? I don't see it in uapi headers. > I’d really like to have the process managers tools display GPU usage > regardless of what vendor is installed. Definitely. Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
[Public] Cycling some of the Nvidia/nouveau guys here too. I think there is a benefit on trying to estandarize how fdinfo can be used to expose per engine and device memory utilization. Another of the advantages of going the /proc/ way instead of the sysfs debugfs approach is that you inherit the access lists directly from the distribution and you don't need to start messing with ownership and group access. By default an user can monitor its own processes as long as /proc is mounted. I am not saying that fdinfo or the way we implemented is 100% the way to go, but I'd rather have a solution within the confines of proc first. David ____ From: Nieto, David M Sent: Monday, May 17, 2021 11:02 AM To: Tvrtko Ursulin ; Daniel Vetter ; Koenig, Christian Cc: Alex Deucher ; Intel Graphics Development ; Maling list - DRI developers Subject: Re: [PATCH 0/7] Per client engine busyness The format is simple: : % we also have entries for the memory mapped: mem : KiB On my submission https://lists.freedesktop.org/archives/amd-gfx/2021-May/063149.html I added a python script to print out the info. It has a CPU usage lower that top, for example. To be absolutely honest, I agree that there is an overhead, but It might not be as much as you fear. From: Tvrtko Ursulin Sent: Monday, May 17, 2021 9:00 AM To: Nieto, David M ; Daniel Vetter ; Koenig, Christian Cc: Alex Deucher ; Intel Graphics Development ; Maling list - DRI developers Subject: Re: [PATCH 0/7] Per client engine busyness On 17/05/2021 15:39, Nieto, David M wrote: > [AMD Official Use Only] > > > Maybe we could try to standardize how the different submission ring > usage gets exposed in the fdinfo? We went the simple way of just > adding name and index, but if someone has a suggestion on how else we > could format them so there is commonality across vendors we could just > amend those. Could you paste an example of your format? Standardized fdinfo sounds good to me in principle. But I would also like people to look at the procfs proposal from Chris, - link to which I have pasted elsewhere in the thread. Only potential issue with fdinfo I see at the moment is a bit of an extra cost in DRM client discovery (compared to my sysfs series and also procfs RFC from Chris). It would require reading all processes (well threads, then maybe aggregating threads into parent processes), all fd symlinks, and doing a stat on them to figure out which ones are DRM devices. Btw is DRM_MAJOR 226 consider uapi? I don't see it in uapi headers. > I’d really like to have the process managers tools display GPU usage > regardless of what vendor is installed. Definitely. Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
[AMD Official Use Only] Maybe we could try to standardize how the different submission ring usage gets exposed in the fdinfo? We went the simple way of just adding name and index, but if someone has a suggestion on how else we could format them so there is commonality across vendors we could just amend those. I’d really like to have the process managers tools display GPU usage regardless of what vendor is installed. From: Daniel Vetter Sent: Monday, May 17, 2021 7:30:47 AM To: Koenig, Christian Cc: Tvrtko Ursulin ; Nieto, David M ; Alex Deucher ; Intel Graphics Development ; Maling list - DRI developers ; Daniel Vetter Subject: Re: [PATCH 0/7] Per client engine busyness On Fri, May 14, 2021 at 05:10:29PM +0200, Christian König wrote: > Am 14.05.21 um 17:03 schrieb Tvrtko Ursulin: > > > > On 14/05/2021 15:56, Christian König wrote: > > > Am 14.05.21 um 16:47 schrieb Tvrtko Ursulin: > > > > > > > > On 14/05/2021 14:53, Christian König wrote: > > > > > > > > > > > > David also said that you considered sysfs but were wary > > > > > > of exposing process info in there. To clarify, my patch > > > > > > is not exposing sysfs entry per process, but one per > > > > > > open drm fd. > > > > > > > > > > > > > > > > Yes, we discussed this as well, but then rejected the approach. > > > > > > > > > > To have useful information related to the open drm fd you > > > > > need to related that to process(es) which have that file > > > > > descriptor open. Just tracking who opened it first like DRM > > > > > does is pretty useless on modern systems. > > > > > > > > We do update the pid/name for fds passed over unix sockets. > > > > > > Well I just double checked and that is not correct. > > > > > > Could be that i915 has some special code for that, but on my laptop > > > I only see the X server under the "clients" debugfs file. > > > > Yes we have special code in i915 for this. Part of this series we are > > discussing here. > > Ah, yeah you should mention that. Could we please separate that into common > code instead? Cause I really see that as a bug in the current handling > independent of the discussion here. > > As far as I know all IOCTLs go though some common place in DRM anyway. Yeah, might be good to fix that confusion in debugfs. But since that's non-uapi, I guess no one ever cared (enough). > > > > > But an "lsof /dev/dri/renderD128" for example does exactly > > > > > what top does as well, it iterates over /proc and sees which > > > > > process has that file open. > > > > > > > > Lsof is quite inefficient for this use case. It has to open > > > > _all_ open files for _all_ processes on the system to find a > > > > handful of ones which may have the DRM device open. > > > > > > Completely agree. > > > > > > The key point is you either need to have all references to an open > > > fd, or at least track whoever last used that fd. > > > > > > At least the last time I looked even the fs layer didn't know which > > > fd is open by which process. So there wasn't really any alternative > > > to the lsof approach. > > > > I asked you about the use case you have in mind which you did not > > answer. Otherwise I don't understand when do you need to walk all files. > > What information you want to get? > > Per fd debugging information, e.g. instead of the top use case you know > which process you want to look at. > > > > > For the use case of knowing which DRM file is using how much GPU time on > > engine X we do not need to walk all open files either with my sysfs > > approach or the proc approach from Chris. (In the former case we > > optionally aggregate by PID at presentation time, and in the latter case > > aggregation is implicit.) > > I'm unsure if we should go with the sysfs, proc or some completely different > approach. > > In general it would be nice to have a way to find all the fd references for > an open inode. Yeah, but that maybe needs to be an ioctl or syscall or something on the inode, that givey you a list of (procfd, fd_nr) pairs pointing back at all open files? If this really is a real world problem, but given that top/lsof and everyone else hasn't asked for it yet maybe it's not. Also I replied in some other thread, I really like the fdin
Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
[AMD Official Use Only - Internal Distribution Only] We had entertained the idea of exposing the processes as sysfs nodes as you proposed, but we had concerns about exposing process info in there, especially since /proc already exists for that purpose. I think if you were to follow that approach, we could have tools like top that support exposing GPU engine usage. From: Alex Deucher Sent: Thursday, May 13, 2021 10:58 PM To: Tvrtko Ursulin ; Nieto, David M ; Koenig, Christian Cc: Intel Graphics Development ; Maling list - DRI developers ; Daniel Vetter Subject: Re: [PATCH 0/7] Per client engine busyness + David, Christian On Thu, May 13, 2021 at 12:41 PM Tvrtko Ursulin wrote: > > > Hi, > > On 13/05/2021 16:48, Alex Deucher wrote: > > On Thu, May 13, 2021 at 7:00 AM Tvrtko Ursulin > > wrote: > >> > >> From: Tvrtko Ursulin > >> > >> Resurrect of the previosuly merged per client engine busyness patches. In a > >> nutshell it enables intel_gpu_top to be more top(1) like useful and show > >> not > >> only physical GPU engine usage but per process view as well. > >> > >> Example screen capture: > >> > >> intel-gpu-top - 906/ 955 MHz;0% RC6; 5.30 Watts; 933 irqs/s > >> > >>IMC reads: 4414 MiB/s > >> IMC writes: 3805 MiB/s > >> > >>ENGINE BUSY MI_SEMA > >> MI_WAIT > >> Render/3D/0 93.46% |▋ | 0% > >>0% > >> Blitter/00.00% | | 0% > >>0% > >> Video/00.00% | | 0% > >>0% > >>VideoEnhance/00.00% | | 0% > >>0% > >> > >>PIDNAME Render/3D BlitterVideo > >> VideoEnhance > >> 2733 neverball |██▌ |||||| > >>| > >> 2047Xorg |███▊|||||| > >>| > >> 2737glxgears |█▍ |||||| > >>| > >> 2128 xfwm4 ||||||| > >>| > >> 2047Xorg ||||||| > >>| > >> > >> > >> Internally we track time spent on engines for each struct intel_context, > >> both > >> for current and past contexts belonging to each open DRM file. > >> > >> This can serve as a building block for several features from the wanted > >> list: > >> smarter scheduler decisions, getrusage(2)-like per-GEM-context > >> functionality > >> wanted by some customers, setrlimit(2) like controls, cgroups controller, > >> dynamic SSEU tuning, ... > >> > >> To enable userspace access to the tracked data, we expose time spent on > >> GPU per > >> client and per engine class in sysfs with a hierarchy like the below: > >> > >> # cd /sys/class/drm/card0/clients/ > >> # tree > >> . > >> ├── 7 > >> │ ├── busy > >> │ │ ├── 0 > >> │ │ ├── 1 > >> │ │ ├── 2 > >> │ │ └── 3 > >> │ ├── name > >> │ └── pid > >> ├── 8 > >> │ ├── busy > >> │ │ ├── 0 > >> │ │ ├── 1 > >> │ │ ├── 2 > >> │ │ └── 3 > >> │ ├── name > >> │ └── pid > >> └── 9 > >> ├── busy > >> │ ├── 0 > >> │ ├── 1 > >> │ ├── 2 > >> │ └── 3 > >> ├── name > >> └── pid > >> > >> Files in 'busy' directories are numbered using the engine class ABI values > >> and > >> they contain accumulated nanoseconds each client spent on engines of a > >> respective class. > > > > We did something similar in amdgpu using the gpu scheduler.