Am 19.05.21 um 00:06 schrieb Jason Ekstrand:
[SNIP]
E.g. we can't add a fence which doesn't wait for the exclusive one as
shared.
Ok I think that's a real problem, and  guess it's also related to all
the ttm privatization tricks and all that. So essentially we'd need
the opposite of ttm_bo->moving, as in you can't ignore it, but
otherwise it completely ignores all the userspace implicit fence
stuff.
Would you mind explaining it to the rest of the class?  I get the need
to do a TLB flush after a BO is removed from the processes address
space and I get that it may be super-heavy and that it has to be
delayed.  I also get that the driver needs to hold a reference to the
underlying pages until that TLB flush is done.  What I don't get is
what this has to do with the exclusive fence.  Why can't the driver
just gather up all the dma_resv fences on the current object (or,
better yet, just the ones from the current amdgpu process) and wait on
them all?  Why does it need to insert an exclusive fence that then
clogs up the whole works?

Because we have mixed up resource management with implicit syncing.

When I sum up all fences in (for example) a dma_fence_array container and add that as explicit fence to the dma_resv object resource management will do what I want and wait for everything to finish before moving or freeing the buffer. But implicit sync will just horrible over sync and wait for stuff it shouldn't wait for in the first place.

When I add the fence as shared fence I can run into the problem the the TLB flush might finish before the exclusive fence. Which is not allowed according to the DMA-buf fencing rules.

We currently have some rather crude workarounds to make use cases like this work as expected. E.g. by using a dma_fence_chain()/dma_fence_array() and/or adding the explusive fence to the shared fences etc etc...

Let's say that you have a buffer which is shared between two drivers A
and B and let's say driver A has thrown a fence on it just to ensure
that the BO doesn't get swapped out to disk until it's at a good
stopping point.  Then driver B comes along and wants to throw a
write-fence on it.  Suddenly, your memory fence from driver A causes
driver B to have to stall waiting for a "good" time to throw in a
fence.  It sounds like this is the sort of scenario that Christian is
running into.  And, yes, with certain Vulkan drivers being a bit
sloppy about exactly when they throw in write fences, I could see it
being a real problem.
Yes this is a potential problem, and on the i915 side we need to do
some shuffling here most likely. Especially due to discrete, but the
problem is pre-existing. tbh I forgot about the implications here
until I pondered this again yesterday evening.

But afaiui the amdgpu code and winsys in mesa, this isn't (yet) the
problem amd vk drivers have. The issue is that with amdgpu, all you
supply are the following bits at CS time:
- list of always mapped private buffers, which is implicit and O(1) in
the kernel fastpath
- additional list of shared buffers that are used by the current CS

I didn't check how exactly that works wrt winsys buffer ownership, but
the thing is that on the kernel side _any_ buffer in there is treated
as a implicit sync'ed write. Which means if you render your winsys
with a bunch of command submission split over 3d and compute pipes,
you end up with horrendous amounts of oversync.
What are you talking about? We have no sync at all for submissions from
the same client.
Yes. Except when the buffer is shared with another driver, at which
point you sync a _lot_ and feel the pain.
Yes, exactly that's the problem.

We basically don't know during CS if a BO is shared or not.

We do know that during importing or exporting the BO thought.
No you don't. Or at least that's massively awkward, see Jason's reply.
Please.  In Vulkan, we know explicitly whether or not any BO will ever
be shared and, if a BO is ever flagged as shared even though it's not,
that's the app being stupid and they can eat the perf hit.

Yeah, that's not a problem at all. We already have the per BO flag in amdgpu for this as well.

In GL, things are more wishy-washy but GL has so many stupid cases where we
have to throw a buffer away and re-allocate that one more isn't going
to be all that bad.  Even there, you could do something where you add
an in-fence to the BO export operation so that the driver knows when
to switch from the shared internal dma_resv to the external one
without having to create a new BO and copy.

Hui what? What do you mean with in-fence here?

[SNIP]
Yeah but why does your userspace not know when a bo is used?
We always know when a BO is exported because we're the ones doing the
export call.  Always.  Of course, we don't know if that BO is shared
with another driver or re-imported back into the same one but is that
really the case we're optimizing for?

Yes, unfortunately. Exactly that's one of the reasons we couldn't go with the per CS per BO flag if it should be shared or exclusive.

Or very bluntly, why cant radv do what anv does (or amdvlk if you care
more about that, it's the same)? What's missing with lots of blantant
lying?
I'm also not buying this.  You keep claiming that userspace doesn't
know but GL definitely does know and Vulkan knows well enough.  You
say that it's motivated by Vulkan and use RADV as an example but the
only reason why the RADV guys haven't followed the ANV design is to
work around limitations in amdgpu.  We shouldn't then use RADV to
justify why this is the right uAPI and why i915 is wrong.

Well, I never said that this is because of RADV. The main motivation we had is because of MM engines, e.g. VA-API, VDPAU and OpenMax.

And when we expose a BO with the DMA-buf functions we simply doesn't know in userspace if that is then re-imported into VA-API or send to a different process.

[SNIP]
Yeah, and that is exactly the reason why I will NAK this uAPI change.

This doesn't works for amdgpu at all for the reasons outlined above.
Uh that's really not how uapi works. "my driver is right, everyone
else is wrong" is not how cross driver contracts are defined. If that
means a perf impact until you've fixed your rules, that's on you.

Also you're a few years too late with nacking this, it's already uapi
in the form of the dma-buf poll() support.
^^  My fancy new ioctl doesn't expose anything that isn't already
there.  It just lets you take a snap-shot of a wait instead of doing
an active wait which might end up with more fences added depending on
interrupts and retries.  The dma-buf poll waits on all fences for
POLLOUT and only the exclusive fence for POLLIN.  It's already uAPI.

Well that's not the stuff I'm concerned about. But rather that you want to add that as exclusive fence from the shared ones once more.

This prevents the TLB flush case I've outlined from working correctly.

So the way I see things right now:
- exclusive fence slot is for implicit sync. kmd should only set it
when userspace indicates, otherwise you will suffer. Explicit syncing
userspace needs to tell the kernel with a flag in the CS ioctl when it
should sync against this exclusive fence and when it should ignore it,
otherwise you'll suffer badly once more.
That is not sufficient. The explicit sync slot is for kernel internal
memory management.
Then we need to split it. But what I discussed with Thomas Hellstrom
is that at least for anything except p2p dma-buf ttm_bo->moving should
be enough.
This is starting to sound like maybe roughly the right direction to me
but I'm still unclear on exactly what problem we're trying to solve
for TLB invalidates.  I'd like to understand that better before giving
strong opinions.  I'm also not super-familiar with ttm_bo->moving but
it sounds like we need some third category of fence somewhere.

Well I would rather say that we should separate the use cases.

E.g. clear APIs for resource management vs. implicit sync.

Christian.


--Jason


Reply via email to