Re: [Mesa-dev] GBM and the Device Memory Allocator Proposals

James Jones Wed, 06 Dec 2017 16:58:26 -0800

On 12/06/2017 03:25 AM, Nicolai Hähnle wrote:

On 06.12.2017 08:07, James Jones wrote:
[snip]
So lets say you have a setup where both display and GPU supported
FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
(FOO/cached).  But the GPU supported the following transitions:
    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null

Then the sets for each device (in order of preference):

GPU:
1: caps(FOO/tiled, FOO/CC, FOO/cached);constraints(alignment=32k)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
    3: caps(FOO/tiled); constraints(alignment=32k)

Display:
    1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
1: caps(FOO/tiled, FOO/CC, FOO/cached);constraints(alignment=64k);
       transition(GPU->display: trans_a, trans_b; display->GPU: none)
    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
       transition(GPU->display: trans_a; display->GPU: none)
    3: caps(FOO/tiled); constraints(alignment=64k);
       transition(GPU->display: none; display->GPU: none)
We definitely don't want to expose a way of getting uncached rendering
surfaces for radeonsi. I mean, I think we are supposed to be ableto program
our hardware so that the backend bypasses all caches, but (a) nobody
validates that and (b) it's basically suicide in terms ofperformance. Let's
build fewer footguns :)
sure, this was just a hypothetical example.  But to take this case as
another example, if you didn't want to expose uncached rendering (or
cached w/ cache flushes after each draw), you would exclude the entry
from the GPU set which didn't have FOO/cached (I'm adding back a
cached but not CC config just to make it interesting), and end up
with:

    trans_a: FOO/CC -> null
    trans_b: FOO/cached -> null

GPU:
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)

Display:
   1: caps(FOO/tiled); constraints(alignment=64k)

Merged Result:
   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_a, trans_b; display->GPU: none)
   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
      transition(GPU->display: trans_b; display->GPU: none)

So there isn't anything in the result set that doesn't have GPU cache,
and the cache-flush transition is always in the set of required
transitions going from GPU -> display

Hmm, I guess this does require the concept of a required cap..
Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.
Note I also posed the question of whether things like cached (andsimilarly compression, since I view compression as roughly anequivalent mechanism to a cache) in one of the open issues on my XDC2017 slides because of this very problem of over-pruning it causes.It's on slide 15, as "No device-local capabilities". You'll have tolisten to my coverage of it in the recorded presentation for thatslide to make any sense, but it's the same thing Nicolai has laid outhere.
As I continued working through our prototype driver support, I found Ididn't actually need to include cached or compressed as capabilities:The GPU just applies them as needed and the usage transitions make ittransparent to the non-GPU engines. That does mean the GPU drivercurrently needs to be the one to realize the allocation from thecapability set to get optimal behavior. We could fix that byreworking our driver though. At this point, not includingdevice-local properties like on-device caching in capabilities seemslike the right solution to me. I'm curious whether this appliesuniversally though, or if other hardware doesn't fit the "compressionand stuff all behaves like a cache" idiom.
Compression is a part of the memory layout for us: framebuffercompression uses an additional "meta surface". At the most basic level,an allocation with loss-less compression support is by necessity biggerthan an allocation without.
We can allocate this meta surface separately, but then we're forced todecompress when passing the surface around (e.g. to a compositor.)
Consider also the example I gave elsewhere, where a cross-vendor tilinglayout is combined with vendor-specific compression:
Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)

Some more thoughts on caching or "device-local" properties below.

Compression requires extra resources for us as well. That's probablyuniversal. I think the distinction between the two approaches iswhether the allocating driver deduces that compression can be used witha given capability set and hence adds the resources implicitly, orwhether the capability set indicates it explicitly. My theory is thatthe implicit path is possible, but it has downsides. The explicit pathis attractive due to its exact nature, as I alluded to in my talk: Youcan tell the exact properties of an allocation given the capability setused to allocate it. If that can be made to work, I prefer that path aswell. Agreed that your path also works better for themulti-vendor+device example.

[snip]
I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.

I think James's proposal for usage transitions was intended to work
with flows like:

   1. App gets GPU caps for RENDER usage
   2. App allocates GPU memory using a layout from (1)
   3. App now decides it wants use the buffer for SCANOUT
   4. App queries usage transition metadata from RENDER to SCANOUT,
      given the current memory layout.
   5. Do the transition and hand the buffer off to display
No, all usages the app intends to transition to must be specified upfront when initially querying caps in the model I assumed. The appthen specifies some subset (up to the full set) of the specifiedusages as a src and dst when querying transition metadata.
The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.
I hadn't thought hard about it, but my initial thoughts were that itwould be required that the driver support transitioning to any singleusage given the capabilities returned. However, transitioning tomultiple usages (E.g., to simultaneously rendering and scanning out)could fail to produce a valid transition, in which case the app wouldhave to fall back to a copy in that case, or avoid that simultaneoususage combination in some other way.
Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.

I'll look into extending the current merging logic to also take into
account transitions.
Yes, it'll be good to see whether this can be made to work. I agreeRob's example outcomes above are ideal, but it's not clear to me howto code up such an algorithm. This also all seems unnecessary if"device local" capabilities aren't needed, as posited above.
although maybe the user doesn't need to know every possible transition
between devices once you have more than two devices..
We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?

Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.
The allocator only knows the requested union of all usages currently.The number of possible transitions grows combinatorially for everyusage requested I believe. I expect there will be cases where ~10usages are specified, so generating all possible transitions all thetime may be excessive, when the app will probably generally only careabout 2 or 3 states, and in practice, there will probably onlyactually be 2 or 3 different underlying possible combinations ofoperations.
Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
I'm looking for something that would also eliminate another part of thedesign that makes me uncomfortable: the metadata for transitions. Thismakes me uncomfortable for a number of reasons. Who computes themetadata? How is the representation of the metadata? With cross-deviceusages (which is the whole point of the exercise), this quickly becomesinfeasible.
So instead as a thought experiment, let's just use what we already have:capabilities and constraints (or properties/attributes).
I kind of already outlined this with the long example in my email herehttps://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
Let me try to summarize the transition algorithm. Its inputs are:
- the current (source) capability set
- the desired new usages
- the capability sets associated with these usages, as queried when thesurface was allocated
Steps of the algorithm:
1. Compute the merged capability set for the new usages (the destinationcapability set).2. Compute the transition capability set, which is the merger of thesource and destination sets.3. Determine whether a "release" transition is required on the sourcedevice(s):3a. For global properties, a transition is required if the sourcecapability set is a superset of the transition set.3b. For device-local properties, a transition is required if there issome destination device for which the device-local properties are asubset of the source set.4. Determine whether an "acquire" transition is required on thedestination device(s) in a similar way.
Finally, execute the transitions using corresponding APIs, where theAPIs simply receive the computed capability sets.
For example, release transitions would receive the source capability set(and perhaps the source usages), the transition capability set, and theset difference of device-local capabilities, and nothing else.
The point is that all steps of the algorithm can be implemented in adevice-agnostic way in libdevicealloc, without calling into anydevice/driver callbacks.
I'm pretty sure this or something like it can be made to work. We needto think through a lot of example cases, but at least we'll have thoughtthem through, which is better than relying on some opaque metadata thingand then finding out later that there are some new cross-device caseswhere things don't work out because the piece of (presumablydevice-specific driver) code that computes the metadata isn't aware ofthem.

This sounds pretty good. I'd like to see more detailed pseudo-code of afull cycle (cap query, allocation, transition to and from a few usages),but it seems pretty solid. I very much like that it enables theexplicit capability sets, but I'm mildly worried it might add APIcomplexity overall rather than reduce it.

I think in the end our two proposals are very similar: Yours just movesthe conversion from high-level properties -> device commands to thedriver applying the transition. That's fine in theory, though it shiftssome minor overhead to the time of the transition. We could design theAPIs such that it's possible to cache/pre-bake the device commands for agiven transition though to alleviate that if it proves meaningful.

To make it clearer what the "metadata" is in my version and henceperhaps make it clearer how similar the two are, a few notes:

Transitions are queried per device in my proposal. Note this means youneed to query two different sets of transition metadata for across-device transition, one from the source that would be applied onthat device in the source API, and one from the destination that wouldbe applied on that device in the destination API. APIs/engines thatdon't require transitions would return some NULL metadata indicating norequired transition on that side.


Some examples of the metadata approach:

1) transition from NVIDIA dev rendering -> NVIDIA dev texturing both inVulkan, same device:

-Query transition. You'd get some metadata representing very simplecache management stuff if anything. You'd apply it using some form ofpipeline barrier on the relevant image.

2) transition from NVIDIA dev rendering -> NVIDIA dev texturing both inVulkan, different device:

-Query transition from each device. You'd get some metadatarepresenting more complex cache management, and potentially a decompressdepending on the compatibility of the two devices. The driver is thesame for both devices in this case, so it can calculate the similaritiesexactly by examining the capability set and each device's properties.You'd apply it using some form of pipeline barrier with the respectivemetadata on the relevant image on each device.



3) transition from NVIDIA dev rendering -> AMD dev texturing both in Vulkan:

-Query transition from each device. NVIDIA driver would see thedestination usage is a foreign device it has no knowledge of and performa complete cache flush and decompress. AMD driver would see the sourceusage is something it doesn't recognize and perform a full cacheinvalidate (and compression surface invalidate, if any?). You'd applyit using some form of pipeline barrier with the respective metadata onthe relevant image on each device.

4) transition from NVIDIA dev rendering -> NVIDIA encoder with cachecoherence

-Query source transition on GPU dev. Query destination transition onvideo encoder dev. GPU recognizes the destination is a device it isaware has certain properties and hence returns a decompress only sinceit knows it has cache coherence. Video encoder dev returns NULLtransition. Apply source transition on source graphics API. Note thiscase requires some careful coordination across a vendor's various driverstacks to perform optimally. It would automatically degrade to theforeign device case for naive/incomplete drivers though.

[snip]
One final note: When I initially wrote up the capability merginglogic, I treated "layout" as a sort of "special" capability, basicallylike Nicolai originally outlined above. Miguel suggested I add the"required" bit instead to generalize things, and it ended up workingout much cleaner. Besides the layout, there is at least one otherobvious candidate for a "required" capability that became obvious assoon as I started coding up the prototype driver: memory location. Itmight seem like memory location is a simple device-agnostic constraintrather than a capability, but it's actually too complicated (we needmore memory locations than "device" and "host"). It has to be vendorspecific, and hence fits in better as a capability.
Could you give more concrete examples of what you'd like to see, and whyhaving this as constraints is insufficient?

We have more than one "device local" memory with different capabilitieson some devices. I think you guys have this situation as well with yourcards with an SSD on them or something if I'm interpretting themarketing stuf right. I'd like to be able to express those all withoutneeding to code them into the device-agnostic portion of the allocatorlibrary ahead of time. That way, if we come up with any new cleverones, we don't need to wait for everyone to update their allocatorlibrary to make use of them.

Additionally, with things like SLI/Crossfire, we end up with a sort ofNUMA memory architecture, where memory on a "remote" card might havesimilar but not exactly the same capabilities as device-local memory.This would be rather complex to represent in the generic constraints aswell.

I think if possible, we should try to keep the design generalized toas few types of objects and special cases as possible. The more wecan generalize the solutions to our existing problem set, the betterthe mechanism should hold up as we apply it to new and unknownproblems as they arise.
I'm coming around to the fact that those things should perhaps live in asingle list/array, but I still don't like the term "capability".
I admit it's a bit of bike-shedding, but I'm starting to think it wouldbe better to go with the generic term "property" or "attribute", andthen add flags/adjectives to that based on how merging should work.
This would include the constraints as well -- it seems arbitrary to methat those would be singled out into their own list.
Basically, the underlying principle is that a good API would have eitherone list that includes all the properties, or one list permerging-behavior. And I think one single list is easier on the APIconsumer and easier to extend.

Agreed with Rob. Constraints are different for a reason: They'renon-extensible and hence can merge in more complex ways. Capabilitiesare extensible, but must be merged by simple memcmp()-style operations,currently more or less simple intersection.

However, I also don't care about naming. "Constraints" was chosenbecause it connotates negatively since they "limit" what an allocationcreated from a capability set can do, and similarly "capabilities"connotates positively because it indicates things that are built upadditively to describe abilities of an allocation. However, I don'tknow that that metaphor held up entirely as the design was realized, soit might be a good time to bikeshed new names anyway.


Thanks,
-James

Cheers,
Nicolai

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] GBM and the Device Memory Allocator Proposals

Reply via email to