Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
On 2018-02-13 02:18 PM, Felix Kuehling wrote:
On 2018-02-13 01:15 PM, Christian König wrote:
Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
On 2018-02-13 12:06 PM, Christian König wrote:
Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu
devices doesn't work in general.

We need separate TTM object for each BO in each device or otherwise we
break A+A laptops.
I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time.
Sharing VRAM BOs is one issue, but the problem goes deeper than just

Starting with Carizzo we can scanout from system memory to avoid the
extra copy on A+A laptops. For this to work we need the BO mapped to
GART (and I mean a real VMID0 mapping, not just in the GTT domain).
And for this to work in turn we need a TTM object per device and not a
global one.
I still don't understand. I think what you're talking about applies only
to BOs used for scan-out. Every BO is allocated from a specific device
and can only be GART-mapped on that device.

Exactly that assumption is incorrect. BOs can be GART mapped into any device.

  What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.

Correct VM mapping is unaffected here.

I don't see how you can have separate TTM objects referring to the
same memory.
Well that is trivial, we do this all the time with prime and I+A laptops.
As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object.

That's incorrect as well. Normally multiple devices have multiple ttm_buffer_object, one for each device.

Going a bit higher that actually makes sense because the status of each BO is deferent for each device. E.g. one device could have the BO in access while it could be idle on another device.

Same is true for placement of the BO. E.g. a VRAM placement of one device is actually a GTT placement for another.

  The only way you can have a new TTM buffer object
per device is by using SG tables and pinning the BOs. But I think we
want to avoid pinning BOs.

What we do does not involve pinning of BOs, even when they're shared
between multiple devices' VMs.

That is also the reason we had to disable this feature again in the
hybrid branches.
What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.
No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
and Raven. Additional to that we found that it breaks some aspects of
the user space interface.
Let me check that with my current patch series. The patches I submitted
here shouldn't include anything that breaks the use cases you describe.
But I'm quite sure it will support sharing BOs between multiple devices'
Confirmed with the current patch series. I can allocate buffers on GPU 1
and map them into a VM on GPU 2.

As I said VM mapping is not the problem here.

BTW, this is the normal mode of operation for system memory BOs on
multi-GPU systems. Logically system memory BOs belong to the CPU node in
the KFD topology. So the Thunk API isn't told which GPU to allocate the
system memory BO from. It just picks the first one. Then we can map
those BOs on any GPUs we want. If we want to use them on GPU2, that's
where they get mapped.

Well keeping NUMA in mind that actually sounds like a design problem to me.

On NUMA system some parts of the system memory can be "closer" to a GPU than other parts.

So I just put two GPUs in a system and ran a test on GPU2. The system
memory buffers are allocated from the GPU1 device, but mapped into the
GPU2 VM. The tests work normally.

If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.

As I said it completely breaks scanout with A+I systems.

Over all it looks like the change causes more problems than it solves. So I'm going to block upstreaming it until we have found a way to make it work for everybody.




So end result is that we probably need to revert it and find a
different solution. I'm already working on this for a couple of weeks
now and should have something ready after I'm done with the PASID


amd-gfx mailing list

Reply via email to