A new Gallium HUD "value producer" could be added that reads fdinfo without calling the driver. I still think there is merit in having this in amdgpu_drm.h too.
Marek On Tue, Jan 24, 2023 at 3:13 AM Marek Olšák <mar...@gmail.com> wrote: > The table of exposed driver-specific counters: > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.c#L1751 > > Counter enums. They use the same interface as e.g. occlusion queries, > except that begin_query and end_query save the results in the driver/CPU. > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.h#L45 > > Counters exposed by the winsys: > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/include/winsys/radeon_winsys.h#L126 > > I just need to query the counters in the winsys and return them. > > Marek > > On Tue, Jan 24, 2023 at 2:58 AM Christian König < > ckoenig.leichtzumer...@gmail.com> wrote: > >> How are the counters which the HUD consumes declared? >> >> See what I want to avoid is a) to nail down the interface with the kernel >> on specific values and b) make it possible to easily expose new values. >> >> In other words what we could do with fdinfo is to have something like >> this: >> >> GALLIUM_FDINFO_HUD=drm-memory-vram,amd-evicted-vram,amd-mclk glxgears >> >> And the HUD just displays the values the kernel provides without the need >> to re-compile mesa when we want to add some more values nor have the values >> as part of the UAPI. >> >> Christian. >> >> Am 24.01.23 um 08:37 schrieb Marek Olšák: >> >> The Gallium HUD doesn't consume strings. It only consumes values that are >> exposed as counters from the driver. In this case, we need the driver to >> expose evicted stats as counters. Each counter can set whether the value is >> absolute (e.g. memory usage) or monotonic (e.g. perf counter). Parsing >> fdinfo to get the values is undesirable. >> >> Marek >> >> On Mon, Jan 23, 2023 at 4:31 AM Christian König < >> ckoenig.leichtzumer...@gmail.com> wrote: >> >>> Let's do this as valid in fdinfo. >>> >>> This way we can easily extend whatever the kernel wants to display as >>> statistics in the userspace HUD. >>> >>> Regards, >>> Christian. >>> >>> Am 21.01.23 um 01:45 schrieb Marek Olšák: >>> >>> We badly need a way to query evicted memory usage. It's essential for >>> investigating performance problems and it uncovered the buddy allocator >>> disaster. Please either suggest an alternative, suggest changes, or review. >>> We need it ASAP. >>> >>> Thanks, >>> Marek >>> >>> On Tue, Jan 10, 2023 at 11:55 AM Marek Olšák <mar...@gmail.com> wrote: >>> >>>> On Tue, Jan 10, 2023 at 11:23 AM Christian König < >>>> ckoenig.leichtzumer...@gmail.com> wrote: >>>> >>>>> Am 10.01.23 um 16:28 schrieb Marek Olšák: >>>>> >>>>> On Wed, Jan 4, 2023 at 9:51 AM Christian König < >>>>> ckoenig.leichtzumer...@gmail.com> wrote: >>>>> >>>>>> Am 04.01.23 um 00:08 schrieb Marek Olšák: >>>>>> >>>>>> I see about the access now, but did you even look at the patch? >>>>>> >>>>>> >>>>>> I did look at the patch, but I haven't fully understood yet what you >>>>>> are trying to do here. >>>>>> >>>>> >>>>> First and foremost, it returns the evicted size of VRAM and visible >>>>> VRAM, and returns visible VRAM usage. It should be obvious which stat >>>>> includes the size of another. >>>>> >>>>> >>>>>> Because what the patch does isn't even exposed to common drm code, >>>>>> such as the preferred domain and visible VRAM placement, so it can't be >>>>>> in >>>>>> fdinfo right now. >>>>>> >>>>>> Or do you even know what fdinfo contains? Because it contains nothing >>>>>> useful. It only has VRAM and GTT usage, which we already have in the INFO >>>>>> ioctl, so it has nothing that we need. We mainly need the eviction >>>>>> information and visible VRAM information now. Everything else is a bonus. >>>>>> >>>>>> >>>>>> Well the main question is what are you trying to get from that >>>>>> information? The eviction list for example is completely meaningless to >>>>>> userspace, that stuff is only temporary and will be cleared on the next >>>>>> CS >>>>>> again. >>>>>> >>>>> >>>>> I don't know what you mean. The returned eviction stats look correct >>>>> and are stable (they don't change much). You can suggest changes if you >>>>> think some numbers are not reported correctly. >>>>> >>>>> >>>>>> >>>>>> What we could expose is the VRAM over-commit value, e.g. how much BOs >>>>>> which where supposed to be in VRAM are in GTT now. I think that's what >>>>>> you >>>>>> are looking for here, right? >>>>>> >>>>> >>>>> The VRAM overcommit value is "evicted_vram". >>>>> >>>>> >>>>>> >>>>>> Also, it's undesirable to open and parse a text file if we can just >>>>>> call an ioctl. >>>>>> >>>>>> >>>>>> Well I see the reasoning for that, but I also see why other drivers >>>>>> do a lot of the stuff we have as IOCTL as separate files in sysfs, fdinfo >>>>>> or debugfs. >>>>>> >>>>>> Especially repeating all the static information which were already >>>>>> available under sysfs in the INFO IOCTL was a design mistake as far as I >>>>>> can see. Just compare what AMDGPU and the KFD code is doing to what for >>>>>> example i915 is doing. >>>>>> >>>>>> Same for things like debug information about a process. The fdinfo >>>>>> stuff can be queried from external tools (gdb, gputop, umr etc...) as >>>>>> well >>>>>> which makes that interface more preferred. >>>>>> >>>>> >>>>> Nothing uses fdinfo in Mesa. No driver uses sysfs in Mesa except drm >>>>> shims, noop drivers, and Intel for perf metrics. sysfs itself is an >>>>> unusable mess for the PCIe query and is missing information. >>>>> >>>>> I'm not against exposing more stuff through sysfs and fdinfo for >>>>> tools, but I don't see any reason why drivers should use it (other than >>>>> for >>>>> slowing down queries and initialization). >>>>> >>>>> >>>>> That's what I'm asking: Is this for some tool or to make some driver >>>>> decision based on it? >>>>> >>>>> If you just want the numbers for over displaying then I think it would >>>>> be better to put this into fdinfo together with the other existing stuff >>>>> there. >>>>> >>>> >>>>> If you want to make allocation decisions based on this then we should >>>>> have that as IOCTL or even better as mmap() page between kernel and >>>>> userspace. But in this case I would also calculation the numbers >>>>> completely >>>>> different as well. >>>>> >>>>> See we have at least the following things in the kernel: >>>>> 1. The eviction list in the VM. >>>>> Those are the BOs which are currently evicted and tried to moved >>>>> back in on the next CS. >>>>> >>>>> 2. The VRAM over commit value. >>>>> In other words how much more VRAM than available has the >>>>> application tried to allocate? >>>>> >>>>> 3. The visible VRAM usage by this application. >>>>> >>>>> The end goal is that the eviction list will go away, e.g. we will >>>>> always have stable allocations based on allocations of other applications >>>>> and not constantly swap things in and out. >>>>> >>>>> When you now expose the eviction list to userspace we will be stuck >>>>> with this interface forever. >>>>> >>>> >>>> It's for the GALLIUM HUD. >>>> >>>> The only missing thing is the size of all evicted VRAM allocations, and >>>> the size of all evicted visible VRAM allocations. >>>> >>>> 1. No list is exposed. Only sums of buffer sizes are exposed. Also, the >>>> eviction list has no meaning here. All lists are treated equally, and >>>> mem_type is compared with preferred_domains to determine where buffers are >>>> and where they should be. >>>> >>>> 2. I'm not interested in the overcommit value. I'm only interested in >>>> knowing the number of bytes of evicted VRAM right now. It can be as >>>> variable as the CPU load, but in practice it shouldn't be because PCIe >>>> doesn't have the bandwidth to move things quickly. >>>> >>>> 3. Yes, that's true. >>>> >>>> Marek >>>> >>>> >>> >>