RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-05-03 Thread Zeng, Oak



> -Original Message-
> From: Jason Gunthorpe 
> Sent: Friday, May 3, 2024 12:28 PM
> To: Zeng, Oak 
> Cc: Thomas Hellström ; Daniel Vetter
> ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org; Brost, Matthew ;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky
> 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Fri, May 03, 2024 at 02:43:19PM +, Zeng, Oak wrote:
> > > > 2.
> > > > Then call hmm_range_fault a second time
> > > > Setting the hmm_range start/end only to cover valid pfns
> > > > With all valid pfns, set the REQ_FAULT flag
> > >
> > > Why would you do this? The first already did the faults you needed and
> > > returned all the easy pfns that don't require faulting.
> >
> > But we have use case where we want to fault-in pages other than the
> > page which contains the GPU fault address, e.g., user malloc'ed or
> > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > place. The first hmm-range-fault would only fault-in the page at
> > 2MiB place, because in the first call we only set REQ_FAULT to the
> > pfn at 2MiB place.
> 
> Honestly, that doesn't make alot of sense to me, but if you really
> want that you should add some new flag and have hmm_range_fault do
> this kind of speculative faulting. I think you will end up
> significantly over faulting.

Above 2 steps hmm-range-fault approach is just my guess of what you were 
suggesting. Since you don't like the CPU vma look up, so we come out this 2 
steps hmm-range-fault thing. The first step has the same functionality of a CPU 
vma lookup.

I also think this approach doesn't make sense.

In our original approach, we lookup cpu vma before migration. Calling 
hmm-range-fault in a non-speculative way, and there is no overfaulting, because 
we only call hmm-range-fault within a valid range that we get from CPU vma.

> 
> It also doesn't make sense to do faulting in hmm prefetch if you are
> going to do migration to force the fault anyhow.

What do you mean by hmm prefetch?

As explained, we call hmm-range-fault in two scenarios:

1) call hmm-range-fault to get the current status of cpu page table without 
causing CPU fault. When address range is already accessed by CPU before GPU, or 
when we migrate for such range, we run into this scenario

2) when CPU never access range and driver determined there is no need to 
migrate, we call hmm-range-fault to trigger cpu fault and allocate system pages 
for this range.

> 
> 
> > > > Basically use hmm_range_fault to figure out the valid address range
> > > > in the first round; then really fault (e.g., trigger cpu fault to
> > > > allocate system pages) in the second call the hmm range fault.
> > >
> > > You don't fault on prefetch. Prefetch is about mirroring already
> > > populated pages, it should not be causing new faults.
> >
> > Maybe there is different wording here. We have this scenario we call
> > it prefetch, or whatever you call it:
> >
> > GPU page fault at an address A, we want to map an address range
> > (e.g., 2MiB, or whatever size depending on setting) around address A
> > to GPU page table. The range around A could have no backing pages
> > when GPU page fault happens. We want to populate the 2MiB range. We
> > can call it prefetch because most of pages in this range is not
> > accessed by GPU yet, but we expect GPU to access it soon.
> 
> This isn't prefetch, that is prefaulting.

Sure, prefaulting is a better name. 

We do have another prefetch API which can be called from user space to prefetch 
before GPU job submission.


> 
> > You mentioned "already populated pages". Who populated those pages
> > then? Is it a CPU access populated them? If CPU access those pages
> > first, it is true pages can be already populated.
> 
> Yes, I would think that is a pretty common case
> 
> > But it is also a valid use case where GPU access address before CPU
> > so there is no "already populated pages" on GPU page fault. Please
> > let us know what is the picture in your head. We seem picture it
> > completely differently.
> 
> And sure, this could happen too, but I feel like it is an application
> issue to be not prefaulting the buffers it knows the GPU is going to
> touch.
> 
> Again, our experiments have shown that taking the fault path is so
> slow that sane applications must explicitly prefault and prefetch as
> much as possible to avo

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-05-03 Thread Zeng, Oak



> -Original Message-
> From: Jason Gunthorpe 
> Sent: Friday, May 3, 2024 9:38 AM
> To: Zeng, Oak 
> Cc: Thomas Hellström ; Daniel Vetter
> ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org; Brost, Matthew ;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky
> 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, May 02, 2024 at 07:25:50PM +, Zeng, Oak wrote:
> > Hi Jason,
> >
> > I tried to understand how you supposed us to use hmm range fault... it
> seems you want us to call hmm range fault two times on each gpu page fault:
> 
> > 1.
> > Call Hmm_range_fault first time, pfn of the fault address is set with
> HMM_PFN_REQ_FAULT
> > Other pfns in the PREFAULT_SIZE range will be set as 0
> > Hmm_range fault returns:
> > Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> > Pfn with HMM_PFN_ERROR flag, means invalid pfn
> >
> > 2.
> > Then call hmm_range_fault a second time
> > Setting the hmm_range start/end only to cover valid pfns
> > With all valid pfns, set the REQ_FAULT flag
> 
> Why would you do this? The first already did the faults you needed and
> returned all the easy pfns that don't require faulting.

But we have use case where we want to fault-in pages other than the page which 
contains the GPU fault address, e.g., user malloc'ed or mmap'ed 8MiB buffer, 
and no CPU touching of this buffer before GPU access it. Let's say GPU access 
caused a GPU page fault a 2MiB place. The first hmm-range-fault would only 
fault-in the page at 2MiB place, because in the first call we only set 
REQ_FAULT to the pfn at 2MiB place. 

In such case, we would go over all the pfns returned from the first 
hmm-range-fault to learn which pfn is a faultable page but not fault-in yet 
(pfn flag ==  0), which pfns is not possible to fault-in in the future (pfn 
flag == HMM_PFN_ERROR), then call hmm-range-fault again by setting REQ_FAULT to 
all faultable pages.

> 
> > Basically use hmm_range_fault to figure out the valid address range
> > in the first round; then really fault (e.g., trigger cpu fault to
> > allocate system pages) in the second call the hmm range fault.
> 
> You don't fault on prefetch. Prefetch is about mirroring already
> populated pages, it should not be causing new faults.

Maybe there is different wording here. We have this scenario we call it 
prefetch, or whatever you call it:

GPU page fault at an address A, we want to map an address range (e.g., 2MiB, or 
whatever size depending on setting) around address A to GPU page table. The 
range around A could have no backing pages when GPU page fault happens. We want 
to populate the 2MiB range. We can call it prefetch because most of pages in 
this range is not accessed by GPU yet, but we expect GPU to access it soon.

You mentioned "already populated pages". Who populated those pages then? Is it 
a CPU access populated them? If CPU access those pages first, it is true pages 
can be already populated. But it is also a valid use case where GPU access 
address before CPU so there is no "already populated pages" on GPU page fault. 
Please let us know what is the picture in your head. We seem picture it 
completely differently.



> 
> > Do I understand it correctly?
> 
> No
> 
> > This is strange to me. We should already know the valid address
> > range before we call hmm range fault, because the migration codes
> > need to look up cpu vma anyway. what is the point of the first
> > hmm_range fault?
> 
> I don't really understand why the GPU driver would drive migration off
> of faulting. It doesn't make alot of sense, especially if you are
> prefetching CPU pages into the GPU and thus won't get faults for them.
> 

Migration on GPU fault is definitely what we want to do. Not like RDMA case, 
GPU has its own device memory. The size of the device memory is comparable to 
the size of CPU system memory, sometimes bigger. We leverage significantly on 
device memory for performance purpose. This is why HMM introduce the 
MEMORY_DEVICE_PRIVATE memory type.

On GPU page fault, driver decides whether we need to migrate pages to device 
memory depending on a lot of factors, such as user hints, atomic correctness 
requirement ect. We could migrate, or we could leave pages in CPU system 
memory, all tuned for performance. 


> If your plan is to leave the GPU page tables unpopulated and then
> migrate on every fault to try to achieve some kind of locality then
> you'd want to drive the hmm prefetch on the migration window (so you
> don't populate unmigrated pages) and hope for the best.


Exactly what did. We decide the migration window by

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-05-02 Thread Zeng, Oak
Hi Jason,

I tried to understand how you supposed us to use hmm range fault... it seems 
you want us to call hmm range fault two times on each gpu page fault:

1.
Call Hmm_range_fault first time, pfn of the fault address is set with 
HMM_PFN_REQ_FAULT
Other pfns in the PREFAULT_SIZE range will be set as 0
Hmm_range fault returns:
Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
Pfn with HMM_PFN_ERROR flag, means invalid pfn

2.  
Then call hmm_range_fault a second time
Setting the hmm_range start/end only to cover valid pfns
With all valid pfns, set the REQ_FAULT flag


Basically use hmm_range_fault to figure out the valid address range in the 
first round; then really fault (e.g., trigger cpu fault to allocate system 
pages) in the second call the hmm range fault.

Do I understand it correctly?

This is strange to me. We should already know the valid address range before we 
call hmm range fault, because the migration codes need to look up cpu vma 
anyway. what is the point of the first hmm_range fault?

Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: Thursday, May 2, 2024 11:02 AM
> To: Jason Gunthorpe 
> Cc: Daniel Vetter ; Zeng, Oak ; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost,
> Matthew ; Welty, Brian
> ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky
> 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, 2024-05-02 at 09:46 -0300, Jason Gunthorpe wrote:
> > On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:
> >
> > > It's true the cpu vma lookup is a remnant from amdkfd. The idea
> > > here is
> > > to replace that with fixed prefaulting ranges of tunable size. So
> > > far,
> > > as you mention, the prefaulting range has been determined by the
> > > CPU
> > > vma size. Given previous feedback, this is going to change.
> >
> > Perhaps limiting prefault to a VMA barrier is a reasonable thing to
> > do, but the implementation should be pushed into hmm_range_fault and
> > not open coded in the driver.
> >
> > > Still the prefaulting range needs to be restricted to avoid -EFAULT
> > > failures in hmm_range_fault(). That can ofc be done by calling it
> > > without HMM_PFN_REQ_FAULT for the range and interpret the returned
> > > pnfs.
> >
> > Yes, this is exactly what that feature is for, you mark your prefetch
> > differently from the fault critical page(s).
> >
> > > There is a performance concern of this approach as compared to
> > > peeking at the CPU vmas directly, since hmm_range_fault() would
> > > need to
> > > be called twice. Any guidelines ideas here?
> >
> > If there is something wrong with hmm_range_fault() then please fix
> > it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is
> > per
> > PFN?
> 
> Ah, yes you're right. I somehow thought it was per range. Makes sense
> now.
> 
> Thanks,
> Thomas
> 
> 
> 
> >
> > Jason



RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-04-24 Thread Zeng, Oak
Hi Jason,

I went through the conversation b/t you and Matt. I think we are pretty much 
aligned. Here is what I get from this threads:

1) hmm range fault size, gpu page table map size : you prefer bigger gpu vma 
size and vma can be sparsely mapped to gpu. Our vma size is configurable 
through a user madvise api. We do plan to try 1 gigantic vma and sparse 
mapping. That requires us to reconstruct driver for a 1vma: N page table state 
mapping. This will be stage 2 work

2) invalidation: you prefer giant notifier. We can consider this if it turns 
out our implementation is not performant. Currently we don't know.

3) whether driver can look up cpu vma. I think we need this for data migration 
purpose.


See also comments inline.


> -Original Message-
> From: Jason Gunthorpe 
> Sent: Wednesday, April 24, 2024 9:49 AM
> To: Zeng, Oak 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost,
> Matthew ; thomas.hellst...@linux.intel.com;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky
> 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Tue, Apr 23, 2024 at 09:17:03PM +, Zeng, Oak wrote:
> > > On Tue, Apr 09, 2024 at 04:45:22PM +, Zeng, Oak wrote:
> > >
> > > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > > a sgl mapping if an invalidation comes in.
> > > >
> > > > You are right, if we register a huge mmu interval notifier to cover
> > > > the whole address space, then we should use dma map/unmap pages
> to
> > > > map bits of the address space. We will explore this approach.
> > > >
> > > > Right now, in xe driver, mmu interval notifier is dynamically
> > > > registered with small address range. We map/unmap the whole small
> > > > address ranges each time. So functionally it still works. But it
> > > > might not be as performant as the method you said.
> > >
> > > Please don't do this, it is not how hmm_range_fault() should be
> > > used.
> > >
> > > It is intended to be page by page and there is no API surface
> > > available to safely try to construct covering ranges. Drivers
> > > definately should not try to invent such a thing.
> >
> > I need your help to understand this comment. Our gpu mirrors the
> > whole CPU virtual address space. It is the first design pattern in
> > your previous reply (entire exclusive owner of a single device
> > private page table and fully mirrors the mm page table into the
> > device table.)
> >
> > What do you mean by "page by page"/" no API surface available to
> > safely try to construct covering ranges"? As I understand it,
> > hmm_range_fault take a virtual address range (defined in hmm_range
> > struct), and walk cpu page table in this range. It is a range based
> > API.
> 
> Yes, but invalidation is not linked to range_fault so you can get
> invalidations of single pages. You are binding range_fault to
> dma_map_sg but then you can't handle invalidation at all sanely.

Ok, I understand your point now.

Yes strictly speaking we can get invalidation of a single page. This can be 
triggered by core mm numa balance or ksm (kernel samepage merging). At present, 
my understanding is, single page (or a few pages) invalidation is not a very 
common case. The more common cases are invalidation triggered by user munmap, 
or invalidation triggered by hmm migration itself (triggered in 
migrate_vma_setup). I will experiment this.

User munmap obviously triggers range based invalidation.

The invalidation triggered by hmm vma migration is also range based as we 
select to migration at vma granularity due to performance considerations as 
explained.

I agree in case of single page invalidation, the current codes is not 
performant because we invalidate the whole vma. What I can do is, look at the 
mmu_notifier_range parameter of the invalidation callback, and only invalidate 
the range. The requires our driver to split the vma state and page table state. 
It is a big change. We plan to do it in stage 2

> 
> > From your previous reply ("So I find it a quite strange that this
> > RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> > you are concerning why we are creating vma and register mmu interval
> > notifier on the fly. Let me try to explain it. Xe_vma is a very
> > fundamental concept in xe driver.
> 
> I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> and you do need to ensure the page table manipulation has an API that
> i

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-04-23 Thread Zeng, Oak
Hi Jason,

Sorry for a late reply. I have been working on a v2 of this series: 
https://patchwork.freedesktop.org/series/132229/. This version addressed some 
of your concerns, such as removing the global character device, removing svm 
process concept (need further clean up per Matt's feedback).

But the main concern you raised is not addressed yet. I need to further make 
sure I understand your concerns, See inline.



> -Original Message-
> From: Jason Gunthorpe 
> Sent: Tuesday, April 9, 2024 1:24 PM
> To: Zeng, Oak 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost, 
> Matthew
> ; thomas.hellst...@linux.intel.com; Welty, Brian
> ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table 
> from
> hmm range
> 
> On Tue, Apr 09, 2024 at 04:45:22PM +, Zeng, Oak wrote:
> 
> > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > a sgl mapping if an invalidation comes in.
> >
> > You are right, if we register a huge mmu interval notifier to cover
> > the whole address space, then we should use dma map/unmap pages to
> > map bits of the address space. We will explore this approach.
> >
> > Right now, in xe driver, mmu interval notifier is dynamically
> > registered with small address range. We map/unmap the whole small
> > address ranges each time. So functionally it still works. But it
> > might not be as performant as the method you said.
> 
> Please don't do this, it is not how hmm_range_fault() should be
> used.
> 
> It is intended to be page by page and there is no API surface
> available to safely try to construct covering ranges. Drivers
> definately should not try to invent such a thing.

I need your help to understand this comment. Our gpu mirrors the whole CPU 
virtual address space. It is the first design pattern in your previous reply 
(entire exclusive owner of a single device private page table and fully mirrors 
the mm page table into the device table.) 

What do you mean by "page by page"/" no API surface available to safely try to 
construct covering ranges"? As I understand it, hmm_range_fault take a virtual 
address range (defined in hmm_range struct), and walk cpu page table in this 
range. It is a range based API.

>From your previous reply ("So I find it a quite strange that this RFC is 
>creating VMA's, notifiers and ranges on the fly "), it seems you are 
>concerning why we are creating vma and register mmu interval notifier on the 
>fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe 
>driver. The gpu page table update, invalidation are all vma-based. This 
>concept exists before this svm work. For svm, we create a 2M (the size is user 
>configurable) vma during gpu page fault handler and register this 2M range to 
>mmu interval notifier.

Now I try to figure out if we don't create vma, what can we do? We can map one 
page (which contains the gpu fault address) to gpu page table. But that doesn't 
work for us because the GPU cache and TLB would not be performant for 4K page 
each time. One way to think of the vma is a chunk size which is good for GPU HW 
performance.

And the mmu notifier... if we don't register the mmu notifier on the fly, do we 
register one mmu notifier to cover the whole CPU virtual address space (which 
would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user 
space kernel space split)? That won't be performant as well because for any 
address range that is unmapped from cpu program, but even if they are never 
touched by GPU, gpu driver still got a invalidation callback. In our approach, 
we only register a mmu notifier for address range that we know gpu would touch 
it. 

> 
> > > Please don't use sg list at all for this.
> >
> > As explained, we use sg list for device private pages so we can
> > re-used the gpu page table update codes.
> 
> I'm asking you not to use SGL lists for that too. SGL lists are not
> generic data structures to hold DMA lists.

Matt mentioned to use drm_buddy_block. I will see how that works out.

> 
> > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > MM VA, it should not be created on demand.
> >
> > Today we have two places where we create gpu vma: 1) create gpu vma
> > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > system allocator range (this will be in v2 of this series).
> 
> Don't do 2.

As said, we will try the approach of one gigantic gpu vma with N page table 
states. We will create page table states in page fault handling. But this is 
only planned for stage 2. 

> 

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-04-09 Thread Zeng, Oak
Hi Jason

We are re-spinning this series based on the previous community feedback. I will 
send out a v2 soon. There were big changes compared to v1. So we would better 
to discuss this work with v2. 

See some reply inline.

> -Original Message-
> From: Jason Gunthorpe 
> Sent: Friday, April 5, 2024 2:02 PM
> To: Zeng, Oak 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost,
> Matthew ; thomas.hellst...@linux.intel.com;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table 
> from
> hmm range
> 
> On Fri, Apr 05, 2024 at 04:42:14PM +, Zeng, Oak wrote:
> > > > Above codes deal with a case where dma map is not needed. As I
> > > > understand it, whether we need a dma map depends on the devices
> > > > topology. For example, when device access host memory or another
> > > > device's memory through pcie, we need dma mapping; if the connection
> > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > memory can be in same address space, so no dma mapping is needed.
> > >
> > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> >
> > When dma map is needed, we used dma_map_sgtable, a different flavor
> > of the dma-map-page function.
> 
> I saw, I am saying this should not be done. You cannot unmap bits of
> a sgl mapping if an invalidation comes in.

You are right, if we register a huge mmu interval notifier to cover the whole 
address space, then we should use dma map/unmap pages to map bits of the 
address space. We will explore this approach.

Right now, in xe driver, mmu interval notifier is dynamically registered with 
small address range. We map/unmap the whole small address ranges each time. So 
functionally it still works. But it might not be as performant as the method 
you said. This is existing logic for our userptr codes. Our system allocator 
inherit this logic automatically as our system allocator design is built on top 
of userptr (will send out v2 soon ).We plan to make things work in the first 
stage then do some performance improvement like you suggested here.

> 
> > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > because we want to re-use some data structure. Our goal here is, for
> > a hmm_range, build a list of device physical address (can be
> > dma-mapped or not), which will be used later on to program the
> > device page table. We re-used the sg list structure for the
> > non-dma-map cases so those two cases can share the same page table
> > programming codes. Since sg list was originally designed for
> > dma-map, it does look like this is mis-used here.
> 
> Please don't use sg list at all for this.

As explained, we use sg list for device private pages so we can re-used the gpu 
page table update codes. The input of the gpu page table update codes in this 
case is a list of dma address (in case of system memory) or device physical 
address (in case of device private pages). The gpu page table update codes in 
xe driver is pretty complicated, so re-use that codes is a preferable thing for 
us. If we introduce different data structure we would have to re-write part of 
the gpu page table update codes.

I don't see an obvious problem of this approach. But if you see this a problem, 
I am open to change it.

> 
> > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > dma-mapping. For example, if you have two devices connected to each
> > other through PCIe, both devices memory are registered as
> > DEVICE_PRIVATE to hmm.
> 
> Yes, but you don't ever dma map DEVICE_PRIVATE.
> 
> > I believe we need a dma-map when one device access another device's
> > memory. Two devices' memory doesn't belong to same address space in
> > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > needed.
> 
> Review my emails here:
> 
> https://lore.kernel.org/dri-devel/20240403125712.ga1744...@nvidia.com/
> 
> Which explain how it should work.

You are right. Dma map is not needed for device private

> 
> > > A 1:1 SVA mapping is a special case of this where there is a single
> > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > offset).
> >
> > From implementation perspective, we can have one device page table
> > for one process for such 1:1 va mapping, but it is not necessary to
> > have a single gpu vma. We can have many gpu vma each cover a segment
> > of this address space.
> 

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-04-05 Thread Zeng, Oak



> -Original Message-
> From: dri-devel  On Behalf Of Jason
> Gunthorpe
> Sent: Friday, April 5, 2024 8:37 AM
> To: Zeng, Oak 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost,
> Matthew ; thomas.hellst...@linux.intel.com;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table 
> from
> hmm range
> 
> On Fri, Apr 05, 2024 at 03:33:10AM +, Zeng, Oak wrote:
> > >
> > > I didn't look at this series a lot but I wanted to make a few
> > > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > > with hmm_range_fault is pretty bad, but it should not be hacked
> > > around like this. Leon is working on a series to improve it:
> > >
> > > https://lore.kernel.org/linux-rdma/cover.1709635535.git.l...@kernel.org/
> >
> >
> > I completely agree above codes are really ugly. We definitely want
> > to adapt to a better way. I will spend some time on above series.
> >
> > >
> > > Please participate there too. In the mean time you should just call
> > > dma_map_page for every single page like ODP does.
> >
> > Above codes deal with a case where dma map is not needed. As I
> > understand it, whether we need a dma map depends on the devices
> > topology. For example, when device access host memory or another
> > device's memory through pcie, we need dma mapping; if the connection
> > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > memory can be in same address space, so no dma mapping is needed.
> 
> Then you call dma_map_page to do your DMA side and you avoid it for
> the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

When dma map is needed, we used dma_map_sgtable, a different flavor of the 
dma-map-page function.

The reason we also used (mis-used) sg list for non-dma-map cases, is because we 
want to re-use some data structure. Our goal here is, for a hmm_range, build a 
list of device physical address (can be dma-mapped or not), which will be used 
later on to program the device page table. We re-used the sg list structure for 
the non-dma-map cases so those two cases can share the same page table 
programming codes. Since sg list was originally designed for dma-map, it does 
look like this is mis-used here.

Need to mention, even for some DEVICE_PRIVATE memory, we also need dma-mapping. 
For example, if you have two devices connected to each other through PCIe, both 
devices memory are registered as DEVICE_PRIVATE to hmm. I believe we need a 
dma-map when one device access another device's memory. Two devices' memory 
doesn't belong to same address space in this case. For modern GPU with 
xeLink/nvLink/XGMI, this is not needed.


> 
> > > Also, I tried to follow the large discussion in the end but it was
> > > quite hard to read the text in Lore for some reason.
> >
> > Did you mean this discussion: https://lore.kernel.org/dri-
> devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good
> for me with chrome browser.
> 
> That is the one I am referring to
> 
> > > I would just opine some general points on how I see hmm_range_fault
> > > being used by drivers.
> > >
> > > First of all, the device should have a private page table. At least
> > > one, but ideally many. Obviously it should work, so I found it a bit
> > > puzzling the talk about problems with virtualization. Either the
> > > private page table works virtualized, including faults, or it should
> > > not be available..
> >
> > To be very honest, I was also very confused. In this series, I had
> > one very fundamental assumption that with hmm any valid cpu virtual
> > address is also a valid gpu virtual address. But Christian had a
> > very strong opinion that the gpu va can have an offset to cpu va. He
> > mentioned a failed use case with amdkfd and claimed an offset can
> > solve their problem.
> 
> Offset is something different, I said the VM's view of the page table
> should fully work. You shouldn't get into a weird situation where the
> VM is populating the page table and can't handle faults or something.
> 

We don't have such weird situation. There are two layers of translations when 
run under virtualized environment. From guest VM's perspective, the first level 
page table is in the guest device physical address space. It is nothing 
different from bare-metal situation. Our driver doesn't need to know it runs 
under virtualized or bare-metal for first level page table programming and page 
fault handling. 

> If the VMM h

RE: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

2024-04-04 Thread Zeng, Oak
Hi Jason,

> -Original Message-
> From: Jason Gunthorpe 
> Sent: Thursday, April 4, 2024 8:39 PM
> To: Zeng, Oak 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Brost,
> Matthew ; thomas.hellst...@linux.intel.com;
> Welty, Brian ; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; Leon Romanovsky 
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table 
> from
> hmm range
> 
> On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> > +/**
> > + * xe_svm_build_sg() - build a scatter gather table for all the physical
> pages/pfn
> > + * in a hmm_range.
> > + *
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +int xe_svm_build_sg(struct hmm_range *range,
> > +struct sg_table *st)
> > +{
> > +   struct scatterlist *sg;
> > +   u64 i, npages;
> > +
> > +   sg = NULL;
> > +   st->nents = 0;
> > +   npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +   if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +   return -ENOMEM;
> > +
> > +   for (i = 0; i < npages; i++) {
> > +   unsigned long addr = range->hmm_pfns[i];
> > +
> > +   if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +   sg->length += PAGE_SIZE;
> > +   sg_dma_len(sg) += PAGE_SIZE;
> > +   continue;
> > +   }
> > +
> > +   sg =  sg ? sg_next(sg) : st->sgl;
> > +   sg_dma_address(sg) = addr;
> > +   sg_dma_len(sg) = PAGE_SIZE;
> > +   sg->length = PAGE_SIZE;
> > +   st->nents++;
> > +   }
> > +
> > +   sg_mark_end(sg);
> > +   return 0;
> > +}
> 
> I didn't look at this series a lot but I wanted to make a few
> remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> with hmm_range_fault is pretty bad, but it should not be hacked
> around like this. Leon is working on a series to improve it:
> 
> https://lore.kernel.org/linux-rdma/cover.1709635535.git.l...@kernel.org/


I completely agree above codes are really ugly. We definitely want to adapt to 
a better way. I will spend some time on above series.

> 
> Please participate there too. In the mean time you should just call
> dma_map_page for every single page like ODP does.

Above codes deal with a case where dma map is not needed. As I understand it, 
whether we need a dma map depends on the devices topology. For example, when 
device access host memory or another device's memory through pcie, we need dma 
mapping; if the connection b/t devices is xelink (similar to nvidia's nvlink), 
all device's memory can be in same address space, so no dma mapping is needed.


> 
> Also, I tried to follow the large discussion in the end but it was
> quite hard to read the text in Lore for some reason.

Did you mean this discussion: 
https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? 
This link works good for me with chrome browser.


> 
> I would just opine some general points on how I see hmm_range_fault
> being used by drivers.
> 
> First of all, the device should have a private page table. At least
> one, but ideally many. Obviously it should work, so I found it a bit
> puzzling the talk about problems with virtualization. Either the
> private page table works virtualized, including faults, or it should
> not be available..

To be very honest, I was also very confused. In this series, I had one very 
fundamental assumption that with hmm any valid cpu virtual address is also a 
valid gpu virtual address. But Christian had a very strong opinion that the gpu 
va can have an offset to cpu va. He mentioned a failed use case with amdkfd and 
claimed an offset can solve their problem.

For all our known use cases, gpu va == cpu va. But we had agreed to make the 
uAPI to be flexible so we can introduce a offset if a use case come out in the 
future.


RE: Making drm_gpuvm work across gpu devices

2024-03-07 Thread Zeng, Oak
Hello all,

Since I didn't get a reply for this one, I assume below are agreed. But feel 
free to let us know if you don't agree.

Thanks,
Oak

-Original Message-
From: dri-devel  On Behalf Of Zeng, Oak
Sent: Thursday, February 29, 2024 1:23 PM
To: Christian König ; Daniel Vetter 
; David Airlie 
Cc: Thomas Hellström ; Brost, Matthew 
; Felix Kuehling ; Welty, 
Brian ; dri-devel@lists.freedesktop.org; Ghimiray, Himal 
Prasad ; Bommu, Krishnaiah 
; Gupta, saurabhg ; 
Vishwanathapura, Niranjana ; 
intel...@lists.freedesktop.org; Danilo Krummrich ; Shah, Ankur 
N ; jgli...@redhat.com; rcampb...@nvidia.com; 
apop...@nvidia.com
Subject: RE: Making drm_gpuvm work across gpu devices

Hi Christian/Daniel/Dave/Felix/Thomas, and all,

We have been refining our design internally in the past month. Below is our 
plan. Please let us know if you have any concern.

1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be 
through /dev/dri/render devices. Not global interface.

2) Unify userptr and system allocator codes. We will treat userptr as a 
speciality of system allocator without migration capability. We will introduce 
the hmmptr concept for system allocator. We will extend vm_bind API to map a 
range A..B of process address space to a range C..D of GPU address space for 
hmmptr. For hmmptr, if gpu program accesses an address which is not backed by 
core mm vma, it is a fatal error.

3) Multiple device support. We have identified p2p use-cases where we might 
want to leave memory on a foreign device or direct migrations to a foreign 
device and therefore might need a global structure that tracks or caches the 
migration state per process address space. We didn't completely settle down 
this design. We will come back when we have more details.

4)We will first work on this code on xekmd then look to move some common codes 
to drm layer so it can also be used by other vendors.

Thomas and me still have open questions to Christian. We will follow up.

Thanks all for this discussion.

Regards,
Oak

> -Original Message-
> From: Christian König 
> Sent: Thursday, February 1, 2024 3:52 AM
> To: Zeng, Oak ; Daniel Vetter ; David
> Airlie 
> Cc: Thomas Hellström ; Brost, Matthew
> ; Felix Kuehling ; Welty,
> Brian ; dri-devel@lists.freedesktop.org; Ghimiray, 
> Himal
> Prasad ; Bommu, Krishnaiah
> ; Gupta, saurabhg ;
> Vishwanathapura, Niranjana ; intel-
> x...@lists.freedesktop.org; Danilo Krummrich ; Shah, Ankur N
> ; jgli...@redhat.com; rcampb...@nvidia.com;
> apop...@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
> > Hi Sima, Dave,
> >
> > I am well aware nouveau driver is not what Nvidia do with their customer. 
> > The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those 
> two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> >
> > Hi Christian, Thomas,
> >
> > In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
> But this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA 
> space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If 
> this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> That approach was suggested before, but it doesn't work. First of all
> you create a massive security hole when you give the GPU full access to
> the QEMU CPU process which runs the virtualization.
> 
> So even if you say CPU VA == GPU VA you still need some kind of
> flexibility otherwise you can't implement this use case securely.
> 
> Additional to this the CPU VAs are usually controlled by the OS and not
> some driver, so to make sure that host and guest VAs don't overlap you
> would need to add some kind of sync between the guest and host OS kernels.
> 
> > In practice, your scheme also have a

RE: Making drm_gpuvm work across gpu devices

2024-02-29 Thread Zeng, Oak
Hi Christian/Daniel/Dave/Felix/Thomas, and all,

We have been refining our design internally in the past month. Below is our 
plan. Please let us know if you have any concern.

1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be 
through /dev/dri/render devices. Not global interface.

2) Unify userptr and system allocator codes. We will treat userptr as a 
speciality of system allocator without migration capability. We will introduce 
the hmmptr concept for system allocator. We will extend vm_bind API to map a 
range A..B of process address space to a range C..D of GPU address space for 
hmmptr. For hmmptr, if gpu program accesses an address which is not backed by 
core mm vma, it is a fatal error.

3) Multiple device support. We have identified p2p use-cases where we might 
want to leave memory on a foreign device or direct migrations to a foreign 
device and therefore might need a global structure that tracks or caches the 
migration state per process address space. We didn't completely settle down 
this design. We will come back when we have more details.

4)We will first work on this code on xekmd then look to move some common codes 
to drm layer so it can also be used by other vendors.

Thomas and me still have open questions to Christian. We will follow up.

Thanks all for this discussion.

Regards,
Oak

> -Original Message-
> From: Christian König 
> Sent: Thursday, February 1, 2024 3:52 AM
> To: Zeng, Oak ; Daniel Vetter ; David
> Airlie 
> Cc: Thomas Hellström ; Brost, Matthew
> ; Felix Kuehling ; Welty,
> Brian ; dri-devel@lists.freedesktop.org; Ghimiray, 
> Himal
> Prasad ; Bommu, Krishnaiah
> ; Gupta, saurabhg ;
> Vishwanathapura, Niranjana ; intel-
> x...@lists.freedesktop.org; Danilo Krummrich ; Shah, Ankur N
> ; jgli...@redhat.com; rcampb...@nvidia.com;
> apop...@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
> > Hi Sima, Dave,
> >
> > I am well aware nouveau driver is not what Nvidia do with their customer. 
> > The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those 
> two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> >
> > Hi Christian, Thomas,
> >
> > In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
> But this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA 
> space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If 
> this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> That approach was suggested before, but it doesn't work. First of all
> you create a massive security hole when you give the GPU full access to
> the QEMU CPU process which runs the virtualization.
> 
> So even if you say CPU VA == GPU VA you still need some kind of
> flexibility otherwise you can't implement this use case securely.
> 
> Additional to this the CPU VAs are usually controlled by the OS and not
> some driver, so to make sure that host and guest VAs don't overlap you
> would need to add some kind of sync between the guest and host OS kernels.
> 
> > In practice, your scheme also have a risk of running out of process space
> because you have to partition whole address space b/t processes. Apparently
> allowing each guest process to own the whole process space and using separate
> GPU/CPU page table for different processes is a better solution than using 
> single
> page table and partition process space b/t processes.
> 
> Yeah that you run out of address space is certainly possible. But as I
> said CPUs are switching to 5 level of pages tables and if you look at
> for example a "cat maps | cut -c-4 | sort -u" of process you will find
> that only a handful of 4GiB segments are actually used and thanks to
> recoverable page fau

RE: Making drm_gpuvm work across gpu devices

2024-02-29 Thread Zeng, Oak
Hi Christian,

Can you elaborate the mirror on demand/userfaultfd idea?

userfaultfd is a way for user space to take over page fault handling of a user 
registered range. From first look, it seems you want a user space page fault 
handler to mirror a large chunk of memory to GPU. I would imagine this handler 
is in UMD, because the whole purpose of system svm allocator is to allow user 
use cpu address (such as malloc’ed) on gpu program without extra driver api 
call. So the registration and mirroring of this large chunk can’t be in user 
program. With this, I pictured below sequence:

During process initialization time, umd register a large chunk (lets say 1GiB) 
of memory using userfaultfd, this include:

  1.  mem = mmap(NULL, 1GiB, MAP_ANON)
  2.  register range [mem, mem + 1GiB] through userfaultfd
  3.  after that, umd can wait on page fault event. When page fault happens, 
umd call vm_bind to mirror [mem, mem+1GiB] range to GPU

now in a user program:
ptr = malloc(size);
submit a GPU program which uses ptr

This is what I can picture. It doesn’t work because ptr can’t belong to [mem, 
mem+1GiB] range. So you can’t vm_bind/mirror ptr on demand to GPU.

Also, the page fault event in 3) above can’t happen at all. A page fault only 
happens when *CPU* access mem but in our case, it could be *only GPU* touch the 
memory.

The point is, with system svm allocator, user can use *any* valid CPU address 
for GPU program. This address can be anything in the range of [0~2^57-1]. This 
design requirement is quite simple and clean. I don’t see how to solve this 
with userfaultfd/on demand mirroring.

Regards,
Oak

From: Christian König 
Sent: Thursday, February 29, 2024 4:41 AM
To: Zeng, Oak ; Danilo Krummrich ; Dave 
Airlie ; Daniel Vetter ; Felix Kuehling 
; jgli...@redhat.com
Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah ; 
Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
; Brost, Matthew 
; Gupta, saurabhg 
Subject: Re: Making drm_gpuvm work across gpu devices

Am 28.02.24 um 20:51 schrieb Zeng, Oak:


The mail wasn’t indent/preface correctly. Manually format it.


From: Christian König 
mailto:christian.koe...@amd.com>>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak mailto:oak.z...@intel.com>>; Danilo Krummrich 
mailto:d...@redhat.com>>; Dave Airlie 
mailto:airl...@redhat.com>>; Daniel Vetter 
mailto:dan...@ffwll.ch>>; Felix Kuehling 
mailto:felix.kuehl...@amd.com>>; 
jgli...@redhat.com<mailto:jgli...@redhat.com>
Cc: Welty, Brian mailto:brian.we...@intel.com>>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; 
intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org>; Bommu, 
Krishnaiah mailto:krishnaiah.bo...@intel.com>>; 
Ghimiray, Himal Prasad 
mailto:himal.prasad.ghimi...@intel.com>>; 
thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com>; 
Vishwanathapura, Niranjana 
mailto:niranjana.vishwanathap...@intel.com>>;
 Brost, Matthew mailto:matthew.br...@intel.com>>; 
Gupta, saurabhg mailto:saurabhg.gu...@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.



Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.”

There are two category of SVM:

1.   driver svm allocator: this is implemented in user space,  i.g., 
cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel 
already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and 
zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process 
address space is mapped into a range C..D of the GPU address space, exactly as 
you said.

2.   system svm allocator:  This doesn’t introduce extra driver API for 
memory allocation. Any valid CPU virtual address can be used directly 
transparently in a GPU program without any extra driver API call. Quote from 
kernel Documentation/vm/hmm.hst: “Any application memory region (private 
anonymous, shared memory, or regular file backed memory) can be used by a 
device transparently” and “to share the address space by duplicating the CPU 
page table in the device page table so the same address points to the same 
physical memory for any valid main memory address in the process address 
space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whol

RE: Making drm_gpuvm work across gpu devices

2024-02-28 Thread Zeng, Oak

The mail wasn’t indent/preface correctly. Manually format it.


From: Christian König 
mailto:christian.koe...@amd.com>>
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak mailto:oak.z...@intel.com>>; Danilo Krummrich 
mailto:d...@redhat.com>>; Dave Airlie 
mailto:airl...@redhat.com>>; Daniel Vetter 
mailto:dan...@ffwll.ch>>; Felix Kuehling 
mailto:felix.kuehl...@amd.com>>; 
jgli...@redhat.com<mailto:jgli...@redhat.com>
Cc: Welty, Brian mailto:brian.we...@intel.com>>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; 
intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org>; Bommu, 
Krishnaiah mailto:krishnaiah.bo...@intel.com>>; 
Ghimiray, Himal Prasad 
mailto:himal.prasad.ghimi...@intel.com>>; 
thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com>; 
Vishwanathapura, Niranjana 
mailto:niranjana.vishwanathap...@intel.com>>;
 Brost, Matthew mailto:matthew.br...@intel.com>>; 
Gupta, saurabhg mailto:saurabhg.gu...@intel.com>>
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.


Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.”

There are two category of SVM:

1.   driver svm allocator: this is implemented in user space,  i.g., 
cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel 
already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and 
zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process 
address space is mapped into a range C..D of the GPU address space, exactly as 
you said.

2.   system svm allocator:  This doesn’t introduce extra driver API for 
memory allocation. Any valid CPU virtual address can be used directly 
transparently in a GPU program without any extra driver API call. Quote from 
kernel Documentation/vm/hmm.hst: “Any application memory region (private 
anonymous, shared memory, or regular file backed memory) can be used by a 
device transparently” and “to share the address space by duplicating the CPU 
page table in the device page table so the same address points to the same 
physical memory for any valid main memory address in the process address 
space”. In system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the 
GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the 
address space this mirroring applies and where it maps to.


[Zeng, Oak]
Lets say we have a hardware platform where both CPU and GPU support 57bit(use 
it for example. The statement apply to any address range) virtual address 
range, how do you decide “which part of the address space this mirroring 
applies”? You have to mirror the whole address space [0~2^57-1], do you? As you 
designed it, the gigantic vm_bind/mirroring happens at the process 
initialization time, and at that time, you don’t know which part of the address 
space will be used for gpu program. Remember for system allocator, *any* valid 
CPU address can be used for GPU program.  If you add an offset to [0~2^57-1], 
you get an address out of 57bit address range. Is this a valid concern?


I see the system svm allocator as just a special case of the driver allocator 
where not fully backed buffer objects are allocated, but rather sparse one 
which are filled and migrated on demand.


[Zeng, Oak]
Above statement is true to me. We don’t have BO for system svm allocator. It is 
a sparse one as we can sparsely map vma to GPU. Our migration policy decide 
which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created 
during the gigantic vm_bind) to be sparsely populated to gpu. While I thought 
vma (xe_vma in xekmd codes) is a place to save memory attributes (such as 
caching, user preferred placement etc). All those memory attributes are range 
based, i.e., user can specify range1 is cached while range2 is uncached. So I 
don’t see how you can manage it with the gigantic vma. Do you split your 
gigantic vma later to save range based memory attributes?

Regards,
Oak


Regards,
Christian.




RE: Making drm_gpuvm work across gpu devices

2024-02-27 Thread Zeng, Oak


From: Christian König 
Sent: Tuesday, February 27, 2024 1:54 AM
To: Zeng, Oak ; Danilo Krummrich ; Dave 
Airlie ; Daniel Vetter ; Felix Kuehling 
; jgli...@redhat.com
Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah ; 
Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
; Brost, Matthew 
; Gupta, saurabhg 
Subject: Re: Making drm_gpuvm work across gpu devices

Hi Oak,
Am 23.02.24 um 21:12 schrieb Zeng, Oak:
Hi Christian,

I go back this old email to ask a question.

sorry totally missed that one.



Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.”

There are two category of SVM:

  1.  driver svm allocator: this is implemented in user space,  i.g., 
cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel 
already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and 
zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process 
address space is mapped into a range C..D of the GPU address space, exactly as 
you said.
  2.  system svm allocator:  This doesn’t introduce extra driver API for memory 
allocation. Any valid CPU virtual address can be used directly transparently in 
a GPU program without any extra driver API call. Quote from kernel 
Documentation/vm/hmm.hst: “Any application memory region (private anonymous, 
shared memory, or regular file backed memory) can be used by a device 
transparently” and “to share the address space by duplicating the CPU page 
table in the device page table so the same address points to the same physical 
memory for any valid main memory address in the process address space”. In 
system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

No, even when you fully mirror the whole address space from a process into the 
GPU you still need to enable this somehow with an IOCTL.

And while enabling this you absolutely should specify to which part of the 
address space this mirroring applies and where it maps to.


Lets say we have a hardware platform where both CPU and GPU support 57bit 
virtual address range, how do you decide “which part of the address space this 
mirroring applies”? You have to mirror the whole address space (0~2^57-1), do 
you? As you designed it, the gigantic vm_bind/mirroring happens at the process 
initialization time, and at that time, you don’t know which part of the address 
space will be used for gpu program.


I see the system svm allocator as just a special case of the driver allocator 
where not fully backed buffer objects are allocated, but rather sparse one 
which are filled and migrated on demand.

Above statement is true to me. We don’t have BO for system svm allocator. It is 
a sparse one as we don’t map the whole vma to GPU. Our migration policy decide 
which pages/how much of the vma is migrated/mapped to GPU page table.

The difference b/t your mind and mine is, you want a gigantic vma (created 
during the gigantic vm_bind) to be sparsely populated to gpu. While I thought 
vma (xe_vma in xekmd codes) is a place to save memory attributes (such as 
caching, user preferred placement etc). All those memory attributes are range 
based, i.e., user can specify range1 is cached while range2 is uncached. So I 
don’t see how you can manage it with the gigantic vma.

Regards,
Oak


Regards,
Christian.





RE: Making drm_gpuvm work across gpu devices

2024-02-23 Thread Zeng, Oak
Hi Christian,

I go back this old email to ask a question.

Quote from your email:
“Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.”
“SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.”

There are two category of SVM:

  1.  driver svm allocator: this is implemented in user space,  i.g., 
cudaMallocManaged (cuda) or zeMemAllocShared (L0) or clSVMAlloc(openCL). Intel 
already have gem_create/vm_bind in xekmd and our umd implemented clSVMAlloc and 
zeMemAllocShared on top of gem_create/vm_bind. Range A..B of the process 
address space is mapped into a range C..D of the GPU address space, exactly as 
you said.
  2.  system svm allocator:  This doesn’t introduce extra driver API for memory 
allocation. Any valid CPU virtual address can be used directly transparently in 
a GPU program without any extra driver API call. Quote from kernel 
Documentation/vm/hmm.hst: “Any application memory region (private anonymous, 
shared memory, or regular file backed memory) can be used by a device 
transparently” and “to share the address space by duplicating the CPU page 
table in the device page table so the same address points to the same physical 
memory for any valid main memory address in the process address space”. In 
system svm allocator, we don’t need that A..B C..D mapping.

It looks like you were talking of 1). Were you?

Oak
From: Christian König 
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak ; Danilo Krummrich ; Dave 
Airlie ; Daniel Vetter ; Felix Kuehling 

Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah ; 
Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
; Brost, Matthew 
; Gupta, saurabhg 
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that 
ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a 
virtualization projects. Having SVM as device independent feature which somehow 
ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says 
that a range A..B of the process address space is mapped into a range C..D of 
the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.

When you talk about migrating memory to a device you also do this on a per 
device basis and *not* tied to the process address space. If you then get 
crappy performance because userspace gave contradicting information where to 
migrate memory then that's a bug in userspace and not something the kernel 
should try to prevent somehow.

[SNIP]


I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo 
/dev/kfd device represent all hardware gpu devices. That is why during kfd 
open, many pdd (process device data) is created, each for one hardware device 
for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this 
design as a rather extreme failure. And I think it's one of the reasons why 
NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU 
process into the GPUs, but this idea only works for a few use cases and is not 
something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU 
address != GPU address because the VAs are actually coming from the guest VM 
and not the host process.

SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have

RE: Making drm_gpuvm work across gpu devices

2024-01-31 Thread Zeng, Oak
Fixed one typo: GPU VA != GPU VA should be GPU VA can != CPU VA

> -Original Message-
> From: Zeng, Oak
> Sent: Wednesday, January 31, 2024 3:17 PM
> To: Daniel Vetter ; David Airlie 
> Cc: Christian König ; Thomas Hellström
> ; Brost, Matthew
> ; Felix Kuehling ; Welty,
> Brian ; dri-devel@lists.freedesktop.org; Ghimiray, 
> Himal
> Prasad ; Bommu, Krishnaiah
> ; Gupta, saurabhg ;
> Vishwanathapura, Niranjana ; intel-
> x...@lists.freedesktop.org; Danilo Krummrich ; Shah, Ankur N
> ; jgli...@redhat.com; rcampb...@nvidia.com;
> apop...@nvidia.com
> Subject: RE: Making drm_gpuvm work across gpu devices
> 
> Hi Sima, Dave,
> 
> I am well aware nouveau driver is not what Nvidia do with their customer. The
> key argument is, can we move forward with the concept shared virtual address
> space b/t CPU and GPU? This is the foundation of HMM. We already have split
> address space support with other driver API. SVM, from its name, it means
> shared address space. Are we allowed to implement another driver model to
> allow SVM work, along with other APIs supporting split address space? Those 
> two
> scheme can co-exist in harmony. We actually have real use cases to use both
> models in one application.
> 
> Hi Christian, Thomas,
> 
> In your scheme, GPU VA can != CPU VA. This does introduce some flexibility. 
> But
> this scheme alone doesn't solve the problem of the proxy process/para-
> virtualization. You will still need a second mechanism to partition GPU VA 
> space
> b/t guest process1 and guest process2 because proxy process (or the host
> hypervisor whatever you call it) use one single gpu page table for all the
> guest/client processes. GPU VA for different guest process can't overlap. If 
> this
> second mechanism exist, we of course can use the same mechanism to partition
> CPU VA space between guest processes as well, then we can still use shared VA
> b/t CPU and GPU inside one process, but process1 and process2's address space
> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
> solve the proxy process problem, not the flexibility you introduced.
> 
> In practice, your scheme also have a risk of running out of process space 
> because
> you have to partition whole address space b/t processes. Apparently allowing
> each guest process to own the whole process space and using separate GPU/CPU
> page table for different processes is a better solution than using single 
> page table
> and partition process space b/t processes.
> 
> For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
> Preview-kernel. It is similar idea of the proxy process in Flex's email. They 
> are all
> SW-based GPU virtualization technology) is an old project. It is now replaced 
> with
> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
> So agreed your scheme add some flexibility. The question is, do we have a 
> valid
> use case to use such flexibility? I don't see a single one ATM.
> 
> I also pictured into how to implement your scheme. You basically rejected the
> very foundation of hmm design which is shared address space b/t CPU and GPU.
> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
> application writer's perspective, whenever he want to use a CPU pointer in his
> GPU program, he add to add that offset. Do you think this is awkward?
> 
> Finally, to implement SVM, we need to implement some memory hint API which
> applies to a virtual address range across all GPU devices. For example, user 
> would
> say, for this virtual address range, I prefer the backing store memory to be 
> on
> GPU deviceX (because user knows deviceX would use this address range much
> more than other GPU devices or CPU). It doesn't make sense to me to make such
> API per device based. For example, if you tell device A that the preferred
> memory location is device B memory, this doesn't sounds correct to me because
> in your scheme, device A is not even aware of the existence of device B. 
> right?
> 
> Regards,
> Oak
> > -Original Message-
> > From: Daniel Vetter 
> > Sent: Wednesday, January 31, 2024 4:15 AM
> > To: David Airlie 
> > Cc: Zeng, Oak ; Christian König
> > ; Thomas Hellström
> > ; Daniel Vetter ; Brost,
> > Matthew ; Felix Kuehling
> > ; Welty, Brian ; dri-
> > de...@lists.freedesktop.org; Ghimiray, Himal Prasad
> > ; Bommu, Krishnaiah
> > ; Gupta, saurabhg
> ;
> > Vishwanathapura, Niranjana ; intel-
> > x

RE: Making drm_gpuvm work across gpu devices

2024-01-31 Thread Zeng, Oak
Hi Sima, Dave,

I am well aware nouveau driver is not what Nvidia do with their customer. The 
key argument is, can we move forward with the concept shared virtual address 
space b/t CPU and GPU? This is the foundation of HMM. We already have split 
address space support with other driver API. SVM, from its name, it means 
shared address space. Are we allowed to implement another driver model to allow 
SVM work, along with other APIs supporting split address space? Those two 
scheme can co-exist in harmony. We actually have real use cases to use both 
models in one application.   

Hi Christian, Thomas,

In your scheme, GPU VA can != GPU VA. This does introduce some flexibility. But 
this scheme alone doesn't solve the problem of the proxy 
process/para-virtualization. You will still need a second mechanism to 
partition GPU VA space b/t guest process1 and guest process2 because proxy 
process (or the host hypervisor whatever you call it) use one single gpu page 
table for all the guest/client processes. GPU VA for different guest process 
can't overlap. If this second mechanism exist, we of course can use the same 
mechanism to partition CPU VA space between guest processes as well, then we 
can still use shared VA b/t CPU and GPU inside one process, but process1 and 
process2's address space (for both cpu and gpu) doesn't overlap. This second 
mechanism is the key to solve the proxy process problem, not the flexibility 
you introduced. 

In practice, your scheme also have a risk of running out of process space 
because you have to partition whole address space b/t processes. Apparently 
allowing each guest process to own the whole process space and using separate 
GPU/CPU page table for different processes is a better solution than using 
single page table and partition process space b/t processes.

For Intel GPU, para-virtualization (xenGT, see 
https://github.com/intel/XenGT-Preview-kernel. It is similar idea of the proxy 
process in Flex's email. They are all SW-based GPU virtualization technology) 
is an old project. It is now replaced with HW accelerated SRIOV/system 
virtualization. XenGT is abandoned long time ago. So agreed your scheme add 
some flexibility. The question is, do we have a valid use case to use such 
flexibility? I don't see a single one ATM.

I also pictured into how to implement your scheme. You basically rejected the 
very foundation of hmm design which is shared address space b/t CPU and GPU. In 
your scheme, GPU VA = CPU VA + offset. In every single place where driver need 
to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in mmu 
notifier call back, you need to offset the GPU VA to get a CPU VA. From 
application writer's perspective, whenever he want to use a CPU pointer in his 
GPU program, he add to add that offset. Do you think this is awkward?

Finally, to implement SVM, we need to implement some memory hint API which 
applies to a virtual address range across all GPU devices. For example, user 
would say, for this virtual address range, I prefer the backing store memory to 
be on GPU deviceX (because user knows deviceX would use this address range much 
more than other GPU devices or CPU). It doesn't make sense to me to make such 
API per device based. For example, if you tell device A that the preferred 
memory location is device B memory, this doesn't sounds correct to me because 
in your scheme, device A is not even aware of the existence of device B. right?

Regards,
Oak
> -Original Message-
> From: Daniel Vetter 
> Sent: Wednesday, January 31, 2024 4:15 AM
> To: David Airlie 
> Cc: Zeng, Oak ; Christian König
> ; Thomas Hellström
> ; Daniel Vetter ; Brost,
> Matthew ; Felix Kuehling
> ; Welty, Brian ; dri-
> de...@lists.freedesktop.org; Ghimiray, Himal Prasad
> ; Bommu, Krishnaiah
> ; Gupta, saurabhg ;
> Vishwanathapura, Niranjana ; intel-
> x...@lists.freedesktop.org; Danilo Krummrich ; Shah, Ankur N
> ; jgli...@redhat.com; rcampb...@nvidia.com;
> apop...@nvidia.com
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
> > On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak  wrote:
> > >
> > > Hi Christian,
> > >
> > >
> > >
> > > Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
> GPU address in the same process is exactly the same with CPU virtual address. 
> It
> is already in upstream Linux kernel. We Intel just follow the same direction 
> for
> our customers. Why we are not allowed?
> >
> >
> > Oak, this isn't how upstream works, you don't get to appeal to
> > customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
> > isn't something NVIDIA would ever suggest for their customers. We also
> > likely wouldn't just accept NVIDIA's current solution upstream withou

RE: Making drm_gpuvm work across gpu devices

2024-01-30 Thread Zeng, Oak
Hi Christian,

Nvidia Nouveau driver uses exactly the same concept of SVM with HMM, GPU 
address in the same process is exactly the same with CPU virtual address. It is 
already in upstream Linux kernel. We Intel just follow the same direction for 
our customers. Why we are not allowed?

From: Christian König 
Sent: Tuesday, January 30, 2024 3:40 AM
To: Zeng, Oak ; Thomas Hellström 
; Daniel Vetter ; Dave 
Airlie 
Cc: Brost, Matthew ; Felix Kuehling 
; Welty, Brian ; 
dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad 
; Bommu, Krishnaiah 
; Gupta, saurabhg ; 
Vishwanathapura, Niranjana ; 
intel...@lists.freedesktop.org; Danilo Krummrich 
Subject: Re: Making drm_gpuvm work across gpu devices

Am 30.01.24 um 01:21 schrieb Zeng, Oak:

The example you used to prove that KFD is a design failure, is against *any* 
design which utilize system allocator and hmm. The way that one proxy process 
running on host to handle many guest processes, doesn’t fit into the concept of 
“share address space b/t cpu and gpu”. The shared address space has to be 
within one process. Your proxy process represent many guest processes. It is a 
fundamental conflict.

Also your userptr proposal does’t solve this problem either:
Imagine you have guest process1 mapping CPU address range A…B to GPU address 
range C…D
And you have guest process 2 mapping CPU address range A…B to GPU address range 
C…D, since process 1 and 2 are two different process, it is legal for process 2 
to do the exact same mapping.
Now when gpu shader access address C…D, a gpu page fault happens, what does 
your proxy process do? Which guest process will this fault be directed to and 
handled? Except you have extra information/API to tell proxy process and GPU 
HW, there is no way to figure out.

Well yes, as far as I can see the fundamental design issue in the KFD is that 
it ties together CPU and GPU address space. That came from the implementation 
using the ATS/PRI feature to access the CPU address space from the GPU.

If you don't do ATS/PRI then you don't have that restriction and you can do as 
many GPU address spaces per CPU process as you want. This is just how the hw 
works.

So in your example above when you have multiple mappings for the range A..B you 
also have multiple GPU address spaces and so can distinct where the page fault 
is coming from just by looking at the source of it. All you then need is 
userfaultfd() to forward the fault to the client and you are pretty much done.



Compared to the shared virtual address space concept of HMM, the userptr design 
is nothing new except it allows CPU and GPU to use different address to access 
the same object. If you replace above C…D with A…B, above description becomes a 
description of the “problem” of HMM/shared virtual address design.

Both design has the same difficulty with your example of the special 
virtualization environment setup.

As said, we spent effort scoped the userptr solution some time ago. The problem 
we found enabling userptr with migration were:

  1.  The user interface of userptr is not as convenient as system allocator. 
With the userptr solution, user need to call userptr_ioctl and vm_bind for 
*every* single cpu pointer that he want to use in a gpu program. While with 
system allocator, programmer just use any cpu pointer directly in gpu program 
without any extra driver ioctls.

And I think exactly that is questionable. Why not at least call it for the 
whole address space once during initialization?

> We don’t see the real benefit of using a different Gpu address C…D than 
> the A..B, except you can prove my above reasoning is wrong. In most use 
> cases, you can make GPU C…D == CPU A…B, why bother then?

Because there are cases where this isn't true. We just recently ran into 
exactly that use case with a customer. It might be that you will never need 
this, but again the approach should generally be that the kernel exposes the 
hardware features and as far as I can see the hardware can do this.

And apart from those use cases there is also another good reason for this: CPU 
are going towards 5 level of page tables and GPUs are lacking behind. It's not 
unrealistic to run into cases where you can only mirror parts of the CPU 
address space into the GPU address space because of hardware restrictions. And 
in this case you absolutely do want the flexibility to have different address 
ranges.


> Looked into implementation details, since hmm fundamentally assume a 
> shared virtual address space b/t cpu and device, for the userptr solution to 
> leverage hmm, you need perform address space conversion every time you calls 
> into hmm functions.

Correct, but that is trivial. I mean we do nothing else with VMAs mapping into 
the address space of files on the CPU either.

Which is by the way a good analogy. The CPU address space consists of anonymous 
memory and file mappings, where the later covers both real files on a file 
syst

RE: re:Making drm_gpuvm work across gpu devices

2024-01-29 Thread Zeng, Oak
Hi Chunming,

In this email thread, Christian mentioned a very special virtualization 
environment where multiple guess processes relies on a host proxy process to 
talk to kfd. Such setup has a hard confliction with SVM concept as SVM means 
shared virtual address space in *one* process while the host proxy process in 
this setup need to represent multiple guest process. Thus SVM doesn’t work in 
such setup.

Normal GPU virtualization such as SRIOV, or system virtualization (such as 
passing whole GPU device to guest machine), works perfectly fine with SVM 
design.

Regards,
Oak

From: 周春明(日月) 
Sent: Monday, January 29, 2024 10:55 PM
To: Felix Kuehling ; Christian König 
; Daniel Vetter 
Cc: Brost, Matthew ; thomas.hellst...@linux.intel.com; 
Welty, Brian ; Ghimiray, Himal Prasad 
; dri-devel@lists.freedesktop.org; Gupta, 
saurabhg ; Danilo Krummrich ; Zeng, 
Oak ; Bommu, Krishnaiah ; Dave 
Airlie ; Vishwanathapura, Niranjana 
; intel...@lists.freedesktop.org; 毛钧(落提) 

Subject: re:Making drm_gpuvm work across gpu devices


Hi Felix,

Following your thread, you mentioned many times that SVM API cannot run in 
virtualization env, I still don't get it why.
Why you often said need a host proxy process? Cannot HW report page fault 
interrupt per VF via vfid? Isn't it sriov env?

Regargs,
-Chunming
--
发件人:Felix Kuehling mailto:felix.kuehl...@amd.com>>
发送时间:2024年1月30日(星期二) 04:24
收件人:"Christian König" 
mailto:christian.koe...@amd.com>>; Daniel Vetter 
mailto:dan...@ffwll.ch>>
抄 送:"Brost, Matthew" mailto:matthew.br...@intel.com>>; 
thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com> 
mailto:thomas.hellst...@linux.intel.com>>; 
"Welty, Brian" mailto:brian.we...@intel.com>>; 
"Ghimiray, Himal Prasad" 
mailto:himal.prasad.ghimi...@intel.com>>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> 
mailto:dri-devel@lists.freedesktop.org>>; 
"Gupta, saurabhg" mailto:saurabhg.gu...@intel.com>>; 
Danilo Krummrich mailto:d...@redhat.com>>; "Zeng, Oak" 
mailto:oak.z...@intel.com>>; "Bommu, Krishnaiah" 
mailto:krishnaiah.bo...@intel.com>>; Dave Airlie 
mailto:airl...@redhat.com>>; "Vishwanathapura, Niranjana" 
mailto:niranjana.vishwanathap...@intel.com>>;
 intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org> 
mailto:intel...@lists.freedesktop.org>>
主 题:Re: Making drm_gpuvm work across gpu devices


On 2024-01-29 14:03, Christian König wrote:
> Am 29.01.24 um 18:52 schrieb Felix Kuehling:
>> On 2024-01-29 11:28, Christian König wrote:
>>> Am 29.01.24 um 17:24 schrieb Felix Kuehling:
>>>> On 2024-01-29 10:33, Christian König wrote:
>>>>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>>>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>>>>> [SNIP]
>>>>>>>>> Yes most API are per device based.
>>>>>>>>>
>>>>>>>>> One exception I know is actually the kfd SVM API. If you look
>>>>>>>>> at the svm_ioctl function, it is per-process based. Each
>>>>>>>>> kfd_process represent a process across N gpu devices.
>>>>>>>> Yeah and that was a big mistake in my opinion. We should really
>>>>>>>> not do that
>>>>>>>> ever again.
>>>>>>>>
>>>>>>>>> Need to say, kfd SVM represent a shared virtual address space
>>>>>>>>> across CPU and all GPU devices on the system. This is by the
>>>>>>>>> definition of SVM (shared virtual memory). This is very
>>>>>>>>> different from our legacy gpu *device* driver which works for
>>>>>>>>> only one device (i.e., if you want one device to access
>>>>>>>>> another device's memory, you will have to use dma-buf
>>>>>>>>> export/import etc).
>>>>>>>> Exactly that thinking is what we have currently found as
>>>>>>>> blocker for a
>>>>>>>> virtualization projects. Having SVM as device independent
>>>>>>>> feature which
>>>>>>>> somehow ties to the process address space turned out to be an
>>>>>>>> extremely bad
>>>>>>>> idea.
>>>>>>>>
>>>>>&g

RE: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Zeng, Oak
The example you used to prove that KFD is a design failure, is against *any* 
design which utilize system allocator and hmm. The way that one proxy process 
running on host to handle many guest processes, doesn’t fit into the concept of 
“share address space b/t cpu and gpu”. The shared address space has to be 
within one process. Your proxy process represent many guest processes. It is a 
fundamental conflict.

Also your userptr proposal does’t solve this problem either:
Imagine you have guest process1 mapping CPU address range A…B to GPU address 
range C…D
And you have guest process 2 mapping CPU address range A…B to GPU address range 
C…D, since process 1 and 2 are two different process, it is legal for process 2 
to do the exact same mapping.
Now when gpu shader access address C…D, a gpu page fault happens, what does 
your proxy process do? Which guest process will this fault be directed to and 
handled? Except you have extra information/API to tell proxy process and GPU 
HW, there is no way to figure out.

Compared to the shared virtual address space concept of HMM, the userptr design 
is nothing new except it allows CPU and GPU to use different address to access 
the same object. If you replace above C…D with A…B, above description becomes a 
description of the “problem” of HMM/shared virtual address design.

Both design has the same difficulty with your example of the special 
virtualization environment setup.

As said, we spent effort scoped the userptr solution some time ago. The problem 
we found enabling userptr with migration were:

  1.  The user interface of userptr is not as convenient as system allocator. 
With the userptr solution, user need to call userptr_ioctl and vm_bind for 
*every* single cpu pointer that he want to use in a gpu program. While with 
system allocator, programmer just use any cpu pointer directly in gpu program 
without any extra driver ioctls.
  2.  We don’t see the real benefit of using a different Gpu address C…D than 
the A..B, except you can prove my above reasoning is wrong. In most use cases, 
you can make GPU C…D == CPU A…B, why bother then?
  3.  Looked into implementation details, since hmm fundamentally assume a 
shared virtual address space b/t cpu and device, for the userptr solution to 
leverage hmm, you need perform address space conversion every time you calls 
into hmm functions.

In summary, GPU device is just a piece of HW to accelerate your CPU program. If 
HW allows, it is more convenient to use shared address space b/t cpu and GPU. 
On old HW (example, no gpu page fault support, or gpu only has a very limited 
address space), we can disable system allocator/SVM. If you use different 
address space on modern GPU, why don’t you use different address space on 
different CPU cores?

Regards,
Oak
From: dri-devel  On Behalf Of 
Christian König
Sent: Monday, January 29, 2024 5:20 AM
To: Zeng, Oak ; Thomas Hellström 
; Daniel Vetter ; Dave 
Airlie 
Cc: Brost, Matthew ; Felix Kuehling 
; Welty, Brian ; 
dri-devel@lists.freedesktop.org; Ghimiray, Himal Prasad 
; Bommu, Krishnaiah 
; Gupta, saurabhg ; 
Vishwanathapura, Niranjana ; 
intel...@lists.freedesktop.org; Danilo Krummrich 
Subject: Re: Making drm_gpuvm work across gpu devices

Well Daniel and Dave noted it as well, so I'm just repeating it: Your design 
choices are not an argument to get something upstream.

It's the job of the maintainers and at the end of the Linus to judge of 
something is acceptable or not.

As far as I can see a good part of this this idea has been exercised lengthy 
with KFD and it turned out to not be the best approach.

So from what I've seen the design you outlined is extremely unlikely to go 
upstream.

Regards,
Christian.
Am 27.01.24 um 03:21 schrieb Zeng, Oak:
Regarding the idea of expanding userptr to support migration, we explored this 
idea long time ago. It provides similar functions of the system allocator but 
its interface is not as convenient as system allocator. Besides the shared 
virtual address space, another benefit of a system allocator is, you can 
offload cpu program to gpu easier, you don’t need to call driver specific API 
(such as register_userptr and vm_bind in this case) for memory allocation.

We also scoped the implementation. It turned out to be big, and not as 
beautiful as hmm. Why we gave up this approach.

From: Christian König 
<mailto:christian.koe...@amd.com>
Sent: Friday, January 26, 2024 7:52 AM
To: Thomas Hellström 
<mailto:thomas.hellst...@linux.intel.com>; 
Daniel Vetter <mailto:dan...@ffwll.ch>
Cc: Brost, Matthew <mailto:matthew.br...@intel.com>; 
Felix Kuehling <mailto:felix.kuehl...@amd.com>; Welty, 
Brian <mailto:brian.we...@intel.com>; Ghimiray, Himal 
Prasad 
<mailto:himal.prasad.ghimi...@intel.com>; 
Zeng, Oak <mailto:oak.z...@intel.com>; Gupta, saurabhg 
<mailto:saurabhg.gu...@intel.com>; Danilo Krummrich 
<mailto:d...@redhat.com>; 
dri-devel@lists.freedesktop.org

RE: [PATCH] drm/xe: Fix a build error

2024-01-29 Thread Zeng, Oak
Hi Thomas,

My patch was based on drm-tip because I found drm-tip is broken

As long as drm-tip can build, I am all good.

Thanks,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: Monday, January 29, 2024 3:26 PM
> To: Christian König ; Zeng, Oak
> ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org
> Cc: amaranath.somalapu...@amd.com; De Marchi, Lucas
> 
> Subject: Re: [PATCH] drm/xe: Fix a build error
> 
> Hi,
> 
> On 1/29/24 17:48, Christian König wrote:
> > Am 27.01.24 um 16:53 schrieb Oak Zeng:
> >> This fixes a build failure on drm-tip. This issue was introduced during
> >> merge of "drm/ttm: replace busy placement with flags v6". For some
> >> reason, the xe_bo.c part of above change is not merged. Manually merge
> >> the missing part to drm_tip
> >
> > Mhm, I provided this as manual fixup for drm-tip in this rerere commit:
> >
> > commit afc5797e8c03bed3ec47a34f2bc3cf03fce24411
> > Author: Christian König 
> > Date:   Thu Jan 25 10:44:54 2024 +0100
> >
> >     2024y-01m-25d-09h-44m-07s UTC: drm-tip rerere cache update
> >
> >     git version 2.34.1
> >
> >
> > And for me compiling xe in drm-tip worked fine after that. No idea why
> > that didn't work for you.
> >
> > Anyway feel free to add my rb to this patch here if it helps in any way.
> >
> > Regards,
> > Christian.
> 
> I reverted that rerere cache update and added another one, so now it
> works. Not sure exactly what the difference was, but the resulting patch
> was for the drm-misc-next merge in my case, and It was for
> drm-xe-something in your case.
> 
> /Thomas
> 
> 
> >
> >>
> >> Signed-off-by: Oak Zeng 
> >> ---
> >>   drivers/gpu/drm/xe/xe_bo.c | 33 +++--
> >>   1 file changed, 15 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> >> index 686d716c5581..d6a193060cc0 100644
> >> --- a/drivers/gpu/drm/xe/xe_bo.c
> >> +++ b/drivers/gpu/drm/xe/xe_bo.c
> >> @@ -38,22 +38,26 @@ static const struct ttm_place sys_placement_flags
> >> = {
> >>   static struct ttm_placement sys_placement = {
> >>   .num_placement = 1,
> >>   .placement = _placement_flags,
> >> -    .num_busy_placement = 1,
> >> -    .busy_placement = _placement_flags,
> >>   };
> >>   -static const struct ttm_place tt_placement_flags = {
> >> -    .fpfn = 0,
> >> -    .lpfn = 0,
> >> -    .mem_type = XE_PL_TT,
> >> -    .flags = 0,
> >> +static const struct ttm_place tt_placement_flags[] = {
> >> +    {
> >> +    .fpfn = 0,
> >> +    .lpfn = 0,
> >> +    .mem_type = XE_PL_TT,
> >> +    .flags = TTM_PL_FLAG_DESIRED,
> >> +    },
> >> +    {
> >> +    .fpfn = 0,
> >> +    .lpfn = 0,
> >> +    .mem_type = XE_PL_SYSTEM,
> >> +    .flags = TTM_PL_FLAG_FALLBACK,
> >> +    }
> >>   };
> >>     static struct ttm_placement tt_placement = {
> >> -    .num_placement = 1,
> >> -    .placement = _placement_flags,
> >> -    .num_busy_placement = 1,
> >> -    .busy_placement = _placement_flags,
> >> +    .num_placement = 2,
> >> +    .placement = tt_placement_flags,
> >>   };
> >>     bool mem_type_is_vram(u32 mem_type)
> >> @@ -230,8 +234,6 @@ static int __xe_bo_placement_for_flags(struct
> >> xe_device *xe, struct xe_bo *bo,
> >>   bo->placement = (struct ttm_placement) {
> >>   .num_placement = c,
> >>   .placement = bo->placements,
> >> -    .num_busy_placement = c,
> >> -    .busy_placement = bo->placements,
> >>   };
> >>     return 0;
> >> @@ -251,7 +253,6 @@ static void xe_evict_flags(struct
> >> ttm_buffer_object *tbo,
> >>   /* Don't handle scatter gather BOs */
> >>   if (tbo->type == ttm_bo_type_sg) {
> >>   placement->num_placement = 0;
> >> -    placement->num_busy_placement = 0;
> >>   return;
> >>   }
> >>   @@ -1391,8 +1392,6 @@ static int __xe_bo_fixed_placement(struct
> >> xe_device *xe,
> >>   bo->placement = (struct ttm_placement) {
> >>   .num_placement = 1,
> >>   .placement = place,
> >> -    .num_busy_placement = 1,
> >> -    .busy_placement = place,
> >>   };
> >>     return 0;
> >> @@ -2150,9 +2149,7 @@ int xe_bo_migrate(struct xe_bo *bo, u32 mem_type)
> >>     xe_place_from_ttm_type(mem_type, );
> >>   placement.num_placement = 1;
> >> -    placement.num_busy_placement = 1;
> >>   placement.placement = 
> >> -    placement.busy_placement = 
> >>     /*
> >>    * Stolen needs to be handled like below VRAM handling if we
> >> ever need
> >


RE: Making drm_gpuvm work across gpu devices

2024-01-29 Thread Zeng, Oak
Hi Christian,

Even though this email thread was started to discuss shared virtual address 
space b/t multiple GPU devices, I eventually found you even don’t agree with a 
shared virtual address space b/t CPU and GPU program. So let’s forget about 
multiple GPU devices for now. I will try explain the shared address space b/t 
cpu and one gpu.

HMM was designed to solve the GPU programmability problem with a very 
fundamental assumption which is GPU program shares a same virtual address space 
with CPU program, for example, with HMM any CPU pointers (such as malloc’ed, 
stack variables and globals) can be used directly on you GPU shader program. 
Are you against this design goal? HMM is already part of linux core MM and 
Linus approved this design. CC’ed Jérôme.

Here is an example of how application can use system allocator (hmm),  I copied 
from 
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/.
 CC’ed a few Nvidia folks.

void sortfile(FILE* fp, int N) {
  char* data;
  data = (char*)malloc(N);

  fread(data, 1, N, fp);
  qsort<<<...>>>(data, N, 1, cmp);
  cudaDeviceSynchronize();

  use_data(data);
  free(data)
}

As you can see, malloced ptr is used directly in GPU program, no userptr ioctl, 
no vm_bind. This is the model Intel also want to support, besides AMD and 
Nvidia.

Lastly, nouveau in the kernel already support hmm and system allocator. It also 
support shared virtual address space b/t CPU and GPU program. All the codes 
already merged upstream.


See also comments inline to your questions.

I will address your other email separately.

Regards,
Oak

From: Christian König 
Sent: Monday, January 29, 2024 5:11 AM
To: Zeng, Oak ; David Airlie 
Cc: Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Winiarski, Michal 
; Felix Kuehling ; Welty, 
Brian ; Shah, Ankur N ; 
dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; Gupta, 
saurabhg ; Danilo Krummrich ; Daniel 
Vetter ; Brost, Matthew ; Bommu, 
Krishnaiah ; Vishwanathapura, Niranjana 

Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 21:13 schrieb Zeng, Oak:

-Original Message-

From: Christian König 
<mailto:christian.koe...@amd.com>

Sent: Friday, January 26, 2024 5:10 AM

To: Zeng, Oak <mailto:oak.z...@intel.com>; David Airlie 
<mailto:airl...@redhat.com>

Cc: Ghimiray, Himal Prasad 
<mailto:himal.prasad.ghimi...@intel.com>;

thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com>; 
Winiarski, Michal

<mailto:michal.winiar...@intel.com>; Felix Kuehling 
<mailto:felix.kuehl...@amd.com>; Welty,

Brian <mailto:brian.we...@intel.com>; Shah, Ankur N 
<mailto:ankur.n.s...@intel.com>; dri-

de...@lists.freedesktop.org<mailto:de...@lists.freedesktop.org>; 
intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org>; Gupta, 
saurabhg

<mailto:saurabhg.gu...@intel.com>; Danilo Krummrich 
<mailto:d...@redhat.com>; Daniel

Vetter <mailto:dan...@ffwll.ch>; Brost, Matthew 
<mailto:matthew.br...@intel.com>; Bommu,

Krishnaiah <mailto:krishnaiah.bo...@intel.com>; 
Vishwanathapura, Niranjana

<mailto:niranjana.vishwanathap...@intel.com>

Subject: Re: Making drm_gpuvm work across gpu devices



Hi Oak,



you can still use SVM, but it should not be a design criteria for the

kernel UAPI. In other words the UAPI should be designed in such a way

that the GPU virtual address can be equal to the CPU virtual address of

a buffer, but can also be different to support use cases where this

isn't the case.



Terminology:

SVM: any technology which can achieve a shared virtual address space b/t cpu 
and devices. The virtual address space can be managed by user space or kernel 
space. Intel implemented a SVM, based on the BO-centric gpu driver (gem-create, 
vm-bind) where virtual address space is managed by UMD.

System allocator: another way of implement SVM. User just use malloc'ed memory 
for gpu submission. Virtual address space is managed by Linux core mm. In 
practice, we leverage HMM to implement system allocator.

This article described details of all those different model: 
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/



Our programming model allows a mixture use of system allocator (even though 
system allocator is ) and traditional vm_bind (where cpu address can != gpu 
address). Let me re-post the pseudo codes:



 1. Fd0 = open(/"dev/dri/render0")

 2. Fd1 = open("/dev/dri/render1")

 3. Fd3 = open("/dev/dri/xe-svm")

 4. Gpu_Vm0 =xe_vm_create(fd0)

 5. Gpu_Vm1 = xe_vm_create(fd1)

 6. Queue0 = xe_exec_queue_create(fd0, gpu_vm0)

 7. Queue1 = xe_exec_queue_create(fd1, gpu_vm1)

 8. ptr = malloc()

 9. bo = xe_bo_create(fd0)

 10. Vm_bind(bo, gpu_vm0, va)//va is from UMD, cpu can access bo with sam

RE: Re: Re: [PATCH 3/5] drm/ttm: replace busy placement with flags v6

2024-01-27 Thread Zeng, Oak
Hi Lucas,

I encountered this build issue as well. I submitted a fix for drm-tip.

Oak

> -Original Message-
> From: dri-devel  On Behalf Of Lucas
> De Marchi
> Sent: Friday, January 26, 2024 5:23 PM
> To: Thomas Hellström 
> Cc: kher...@redhat.com; michel.daen...@mailbox.org;
> nouv...@lists.freedesktop.org; intel-...@lists.freedesktop.org; dri-
> de...@lists.freedesktop.org; Christian König
> ; za...@vmware.com
> Subject: Re: Re: Re: [PATCH 3/5] drm/ttm: replace busy placement with flags v6
> 
> On Fri, Jan 26, 2024 at 04:16:58PM -0600, Lucas De Marchi wrote:
> >On Thu, Jan 18, 2024 at 05:38:16PM +0100, Thomas Hellström wrote:
> >>
> >>On 1/17/24 13:27, Thomas Hellström wrote:
> >>>
> >>>On 1/17/24 11:47, Thomas Hellström wrote:
> Hi, Christian
> 
> Xe changes look good. Will send the series to xe ci to check for
> regressions.
> >>>
> >>>Hmm, there are some checkpatch warnings about author / SOB email
> >>>mismatch,
> >>
> >>With those fixed, this patch is
> >>
> >>Reviewed-by: Thomas Hellström 
> >
> >
> >it actually broke drm-tip now that this is merged:
> >
> >../drivers/gpu/drm/xe/xe_bo.c:41:10: error: ‘struct ttm_placement’ has no
> member named ‘num_busy_placement’; did you mean ‘num_placement’
> >   41 | .num_busy_placement = 1,
> >  |  ^~
> >  |  num_placement
> >../drivers/gpu/drm/xe/xe_bo.c:41:31: error: excess elements in struct 
> >initializer
> [-Werror]
> >   41 | .num_busy_placement = 1,
> >  |   ^
> >
> >
> >Apparently a conflict with another patch that got applied a few days
> >ago: a201c6ee37d6 ("drm/xe/bo: Evict VRAM to TT rather than to system")
> 
> oh, no... apparently that commit is  from a long time ago. The problem
> was that drm-misc-next was not yet in sync with drm-next. Thomas, do you
> have a fixup for this to put in rerere?
> 
> Lucas De Marchi


RE: Making drm_gpuvm work across gpu devices

2024-01-26 Thread Zeng, Oak
Regarding the idea of expanding userptr to support migration, we explored this 
idea long time ago. It provides similar functions of the system allocator but 
its interface is not as convenient as system allocator. Besides the shared 
virtual address space, another benefit of a system allocator is, you can 
offload cpu program to gpu easier, you don’t need to call driver specific API 
(such as register_userptr and vm_bind in this case) for memory allocation.

We also scoped the implementation. It turned out to be big, and not as 
beautiful as hmm. Why we gave up this approach.

From: Christian König 
Sent: Friday, January 26, 2024 7:52 AM
To: Thomas Hellström ; Daniel Vetter 

Cc: Brost, Matthew ; Felix Kuehling 
; Welty, Brian ; Ghimiray, Himal 
Prasad ; Zeng, Oak ; 
Gupta, saurabhg ; Danilo Krummrich ; 
dri-devel@lists.freedesktop.org; Bommu, Krishnaiah 
; Dave Airlie ; 
Vishwanathapura, Niranjana ; 
intel...@lists.freedesktop.org
Subject: Re: Making drm_gpuvm work across gpu devices

Am 26.01.24 um 09:21 schrieb Thomas Hellström:


Hi, all



On Thu, 2024-01-25 at 19:32 +0100, Daniel Vetter wrote:

On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]

Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at

the svm_ioctl function, it is per-process based. Each kfd_process

represent a process across N gpu devices.



Yeah and that was a big mistake in my opinion. We should really not

do that

ever again.



Need to say, kfd SVM represent a shared virtual address space

across CPU and all GPU devices on the system. This is by the

definition of SVM (shared virtual memory). This is very different

from our legacy gpu *device* driver which works for only one

device (i.e., if you want one device to access another device's

memory, you will have to use dma-buf export/import etc).



Exactly that thinking is what we have currently found as blocker

for a

virtualization projects. Having SVM as device independent feature

which

somehow ties to the process address space turned out to be an

extremely bad

idea.



The background is that this only works for some use cases but not

all of

them.



What's working much better is to just have a mirror functionality

which says

that a range A..B of the process address space is mapped into a

range C..D

of the GPU address space.



Those ranges can then be used to implement the SVM feature required

for

higher level APIs and not something you need at the UAPI or even

inside the

low level kernel memory management.



When you talk about migrating memory to a device you also do this

on a per

device basis and *not* tied to the process address space. If you

then get

crappy performance because userspace gave contradicting information

where to

migrate memory then that's a bug in userspace and not something the

kernel

should try to prevent somehow.



[SNIP]

I think if you start using the same drm_gpuvm for multiple

devices you

will sooner or later start to run into the same mess we have

seen with

KFD, where we moved more and more functionality from the KFD to

the DRM

render node because we found that a lot of the stuff simply

doesn't work

correctly with a single object to maintain the state.

As I understand it, KFD is designed to work across devices. A

single pseudo /dev/kfd device represent all hardware gpu devices.

That is why during kfd open, many pdd (process device data) is

created, each for one hardware device for this process.



Yes, I'm perfectly aware of that. And I can only repeat myself that

I see

this design as a rather extreme failure. And I think it's one of

the reasons

why NVidia is so dominant with Cuda.



This whole approach KFD takes was designed with the idea of

extending the

CPU process into the GPUs, but this idea only works for a few use

cases and

is not something we should apply to drivers in general.



A very good example are virtualization use cases where you end up

with CPU

address != GPU address because the VAs are actually coming from the

guest VM

and not the host process.



SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should

not have

any influence on the design of the kernel UAPI.



If you want to do something similar as KFD for Xe I think you need

to get

explicit permission to do this from Dave and Daniel and maybe even

Linus.



I think the one and only one exception where an SVM uapi like in kfd

makes

sense, is if the _hardware_ itself, not the software stack defined

semantics that you've happened to build on top of that hw, enforces a

1:1

mapping with the cpu process address space.



Which means your hardware is using PASID, IOMMU based translation,

PCI-ATS

(address translation services) or whatever your hw calls it and has

_no_

device-side pagetables on top. Which from what I've seen all devices

with

device-memory have, simply because they need some place

RE: Making drm_gpuvm work across gpu devices

2024-01-26 Thread Zeng, Oak


> -Original Message-
> From: Christian König 
> Sent: Friday, January 26, 2024 5:10 AM
> To: Zeng, Oak ; David Airlie 
> Cc: Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Winiarski, Michal
> ; Felix Kuehling ; Welty,
> Brian ; Shah, Ankur N ; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Gupta, saurabhg
> ; Danilo Krummrich ; Daniel
> Vetter ; Brost, Matthew ; Bommu,
> Krishnaiah ; Vishwanathapura, Niranjana
> 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> you can still use SVM, but it should not be a design criteria for the
> kernel UAPI. In other words the UAPI should be designed in such a way
> that the GPU virtual address can be equal to the CPU virtual address of
> a buffer, but can also be different to support use cases where this
> isn't the case.

Terminology:
SVM: any technology which can achieve a shared virtual address space b/t cpu 
and devices. The virtual address space can be managed by user space or kernel 
space. Intel implemented a SVM, based on the BO-centric gpu driver (gem-create, 
vm-bind) where virtual address space is managed by UMD.
System allocator: another way of implement SVM. User just use malloc'ed memory 
for gpu submission. Virtual address space is managed by Linux core mm. In 
practice, we leverage HMM to implement system allocator.
This article described details of all those different model: 
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

Our programming model allows a mixture use of system allocator (even though 
system allocator is ) and traditional vm_bind (where cpu address can != gpu 
address). Let me re-post the pseudo codes:

1. Fd0 = open(/"dev/dri/render0")
2. Fd1 = open("/dev/dri/render1")
3. Fd3 = open("/dev/dri/xe-svm")
4. Gpu_Vm0 =xe_vm_create(fd0) 
5. Gpu_Vm1 = xe_vm_create(fd1) 
6. Queue0 = xe_exec_queue_create(fd0, gpu_vm0)
7. Queue1 = xe_exec_queue_create(fd1, gpu_vm1)
8. ptr = malloc()
9. bo = xe_bo_create(fd0)
10. Vm_bind(bo, gpu_vm0, va)//va is from UMD, cpu can access bo with 
same or different va. It is UMD's responsibility that va doesn't conflict with 
malloc'ed PTRs.
11. Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0
12. Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1
13. Xe_exec(queue0, va)//submit gpu job which use va, on card0

In above codes, the va used in vm_bind (line 10, Intel's API to bind an object 
to a va for GPU access) can be different from the CPU address when cpu access 
the same object. But whenever user use malloc'ed ptr for GPU submission (line 
11, 12, so called system allocator), it implies CPU and GPUs use the same ptr 
to access.

In above vm_bind, it is user/UMD's responsibility to guarantee that vm_bind va 
doesn't conflict with malloc'ed ptr. Otherwise it is treated as programming 
error.

I think this design still meets your design restrictions. 


> 
> Additionally to what Dave wrote I can summarize a few things I have
> learned while working on the AMD GPU drivers in the last decade or so:
> 
> 1. Userspace requirements are *not* relevant for UAPI or even more
> general kernel driver design.
> 
> 2. What should be done is to look at the hardware capabilities and try
> to expose those in a save manner to userspace.
> 
> 3. The userspace requirements are then used to validate the kernel
> driver and especially the UAPI design to ensure that nothing was missed.
> 
> The consequence of this is that nobody should ever use things like Cuda,
> Vulkan, OpenCL, OpenGL etc.. as argument to propose a certain UAPI design.
> 
> What should be done instead is to say: My hardware works in this and
> that way -> we want to expose it like this -> because that enables us to
> implement the high level API in this and that way.
> 
> Only this gives then a complete picture of how things interact together
> and allows the kernel community to influence and validate the design.

What you described above is mainly bottom up. I know other people do top down, 
or whole system vertical HW-SW co-design. I don't have strong opinion here.

Regards,
Oak

> 
> This doesn't mean that you need to throw away everything, but it gives a
> clear restriction that designs are not nailed in stone and for example
> you can't use something like a waterfall model.
> 
> Going to answer on your other questions separately.
> 
> Regards,
> Christian.
> 
> Am 25.01.24 um 06:25 schrieb Zeng, Oak:
> > Hi Dave,
> >
> > Let me step back. When I wrote " shared virtual address space b/t cpu and 
> > all
> gpu devices is a hard requirement for our system allocator design"

RE: Making drm_gpuvm work across gpu devices

2024-01-25 Thread Zeng, Oak



> -Original Message-
> From: Daniel Vetter 
> Sent: Thursday, January 25, 2024 1:33 PM
> To: Christian König 
> Cc: Zeng, Oak ; Danilo Krummrich ;
> Dave Airlie ; Daniel Vetter ; Felix
> Kuehling ; Welty, Brian ; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Bommu, Krishnaiah
> ; Ghimiray, Himal Prasad
> ; thomas.hellst...@linux.intel.com;
> Vishwanathapura, Niranjana ; Brost,
> Matthew ; Gupta, saurabhg
> 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
> > Am 23.01.24 um 20:37 schrieb Zeng, Oak:
> > > [SNIP]
> > > Yes most API are per device based.
> > >
> > > One exception I know is actually the kfd SVM API. If you look at the 
> > > svm_ioctl
> function, it is per-process based. Each kfd_process represent a process 
> across N
> gpu devices.
> >
> > Yeah and that was a big mistake in my opinion. We should really not do that
> > ever again.
> >
> > > Need to say, kfd SVM represent a shared virtual address space across CPU
> and all GPU devices on the system. This is by the definition of SVM (shared 
> virtual
> memory). This is very different from our legacy gpu *device* driver which 
> works
> for only one device (i.e., if you want one device to access another device's
> memory, you will have to use dma-buf export/import etc).
> >
> > Exactly that thinking is what we have currently found as blocker for a
> > virtualization projects. Having SVM as device independent feature which
> > somehow ties to the process address space turned out to be an extremely bad
> > idea.
> >
> > The background is that this only works for some use cases but not all of
> > them.
> >
> > What's working much better is to just have a mirror functionality which says
> > that a range A..B of the process address space is mapped into a range C..D
> > of the GPU address space.
> >
> > Those ranges can then be used to implement the SVM feature required for
> > higher level APIs and not something you need at the UAPI or even inside the
> > low level kernel memory management.
> >
> > When you talk about migrating memory to a device you also do this on a per
> > device basis and *not* tied to the process address space. If you then get
> > crappy performance because userspace gave contradicting information where
> to
> > migrate memory then that's a bug in userspace and not something the kernel
> > should try to prevent somehow.
> >
> > [SNIP]
> > > > I think if you start using the same drm_gpuvm for multiple devices you
> > > > will sooner or later start to run into the same mess we have seen with
> > > > KFD, where we moved more and more functionality from the KFD to the
> DRM
> > > > render node because we found that a lot of the stuff simply doesn't work
> > > > correctly with a single object to maintain the state.
> > > As I understand it, KFD is designed to work across devices. A single 
> > > pseudo
> /dev/kfd device represent all hardware gpu devices. That is why during kfd 
> open,
> many pdd (process device data) is created, each for one hardware device for 
> this
> process.
> >
> > Yes, I'm perfectly aware of that. And I can only repeat myself that I see
> > this design as a rather extreme failure. And I think it's one of the reasons
> > why NVidia is so dominant with Cuda.
> >
> > This whole approach KFD takes was designed with the idea of extending the
> > CPU process into the GPUs, but this idea only works for a few use cases and
> > is not something we should apply to drivers in general.
> >
> > A very good example are virtualization use cases where you end up with CPU
> > address != GPU address because the VAs are actually coming from the guest
> VM
> > and not the host process.
> >
> > SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have
> > any influence on the design of the kernel UAPI.
> >
> > If you want to do something similar as KFD for Xe I think you need to get
> > explicit permission to do this from Dave and Daniel and maybe even Linus.
> 
> I think the one and only one exception where an SVM uapi like in kfd makes
> sense, is if the _hardware_ itself, not the software stack defined
> semantics that you've happened to build on top of that hw, enforces a 1:1
> mapping with the cpu process address space.
> 
> Which means your hardware is using PASID, IOMMU based translation, PCI-ATS
> (address translation services) or whatever your hw calls it an

RE: Making drm_gpuvm work across gpu devices

2024-01-25 Thread Zeng, Oak


> -Original Message-
> From: Felix Kuehling 
> Sent: Thursday, January 25, 2024 12:16 PM
> To: Zeng, Oak ; Christian König
> ; Danilo Krummrich ; Dave
> Airlie ; Daniel Vetter ; Shah, Ankur N
> ; Winiarski, Michal 
> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
> intel-
> x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> ; Brost, Matthew
> ; Gupta, saurabhg 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> 
> On 2024-01-24 20:17, Zeng, Oak wrote:
> >
> > Hi Christian,
> >
> > Even though I mentioned KFD design, I didn’t mean to copy the KFD
> > design. I also had hard time to understand the difficulty of KFD under
> > virtualization environment.
> >
> The problem with virtualization is related to virtualization design
> choices. There is a single process that proxies requests from multiple
> processes in one (or more?) VMs to the GPU driver. That means, we need a
> single process with multiple contexts (and address spaces). One proxy
> process on the host must support multiple guest address spaces.

My first response is, why processes on the virtual machine can't open /dev/kfd 
device itself?

Also try to picture why base amdgpu driver (which is per hardware device based) 
doesn't have this problem... creating multiple contexts under single amdgpu 
device, each context servicing one guest process?
> 
> I don't know much more than these very high level requirements, and I
> only found out about those a few weeks ago. Due to my own bias I can't
> comment whether there are bad design choices in the proxy architecture
> or in KFD or both. The way we are considering fixing this, is to enable
> creating multiple KFD contexts in the same process. Each of those
> contexts will still represent a shared virtual address space across
> devices (but not the CPU). Because the device address space is not
> shared with the CPU, we cannot support our SVM API in this situation.
> 

One kfd process, multiple contexts, each context has a shared address space 
across devices I do see some complications 

> I still believe that it makes sense to have the kernel mode driver aware
> of a shared virtual address space at some level. A per-GPU API and an
> API that doesn't require matching CPU and GPU virtual addresses would
> enable more flexibility at the cost duplicate information tracking for
> multiple devices and duplicate overhead for things like MMU notifiers
> and interval tree data structures. Having to coordinate multiple devices
> with potentially different address spaces would probably make it more
> awkward to implement memory migration. The added flexibility would go
> mostly unused, except in some very niche applications.
> 
> Regards,
>    Felix
> 
> 
> > For us, Xekmd doesn't need to know it is running under bare metal or
> > virtualized environment. Xekmd is always a guest driver. All the
> > virtual address used in xekmd is guest virtual address. For SVM, we
> > require all the VF devices share one single shared address space with
> > guest CPU program. So all the design works in bare metal environment
> > can automatically work under virtualized environment. +@Shah, Ankur N
> > <mailto:ankur.n.s...@intel.com> +@Winiarski, Michal
> > <mailto:michal.winiar...@intel.com> to backup me if I am wrong.
> >
> > Again, shared virtual address space b/t cpu and all gpu devices is a
> > hard requirement for our system allocator design (which means
> > malloc’ed memory, cpu stack variables, globals can be directly used in
> > gpu program. Same requirement as kfd SVM design). This was aligned
> > with our user space software stack.
> >
> > For anyone who want to implement system allocator, or SVM, this is a
> > hard requirement. I started this thread hoping I can leverage the
> > drm_gpuvm design to manage the shared virtual address space (as the
> > address range split/merge function was scary to me and I didn’t want
> > re-invent). I guess my takeaway from this you and Danilo is this
> > approach is a NAK. Thomas also mentioned to me drm_gpuvm is a overkill
> > for our svm address range split/merge. So I will make things work
> > first by manage address range xekmd internally. I can re-look
> > drm-gpuvm approach in the future.
> >
> > Maybe a pseudo user program can illustrate our programming model:
> >
> > Fd0 = open(card0)
> >
> > Fd1 = open(card1)
> >
> > Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's
> > first vm_create
> >
> > Vm1 = xe_vm_create(fd1) //driver re-use xe_s

RE: 回复:Making drm_gpuvm work across gpu devices

2024-01-25 Thread Zeng, Oak
Hi Chunming,


From: 周春明(日月) 
Sent: Thursday, January 25, 2024 6:01 AM
To: Zeng, Oak ; Christian König ; 
Danilo Krummrich ; Dave Airlie ; Daniel 
Vetter ; Felix Kuehling ; Shah, Ankur 
N ; Winiarski, Michal 
Cc: Brost, Matthew ; thomas.hellst...@linux.intel.com; 
Welty, Brian ; dri-devel@lists.freedesktop.org; 
Ghimiray, Himal Prasad ; Gupta, saurabhg 
; Bommu, Krishnaiah ; 
Vishwanathapura, Niranjana ; 
intel...@lists.freedesktop.org
Subject: 回复:Making drm_gpuvm work across gpu devices

[snip]

Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first 
vm_create

Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from 
same process

Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu
[snip]
Hi Oak,
From your sample code, you not only need va-manager cross gpu devices, but also 
cpu, right?

No. Per the feedback from Christian and Danilo, I would give up the idea of 
making drm_gpuvm to work across gpu devices. I might want to come back later 
but for now it is not the plan anymore.

I think you need a UVA (unified va) manager in user space and make range of 
drm_gpuvm reserved from cpu va space. In that way, malloc's va and gpu va are 
in same space and will not conflict. And then via HMM mechanism, gpu devices 
can safely use VA passed from HMM.

Under HMM, both GPU and CPU are simply under the same address space. A same 
virtual address represent the same allocation for both CPU and GPUs. See the 
hmm doc here: https://www.kernel.org/doc/Documentation/vm/hmm.rst.  User space 
program doesn’t need to reserve any address range. All the address ranges are 
managed by linux kernel core mm. Today GPU kmd driver has some structure to 
save the address range based memory attributes.

Regards,
Oak

By the way, I'm not familiar with drm_gpuvm, traditionally, gpu driver often 
put va-manager in user space, not sure what's benefit we can get from drm_gpuvm 
invented in kernel space. Can anyone help explain more?

- Chunming
--
发件人:Zeng, Oak mailto:oak.z...@intel.com>>
发送时间:2024年1月25日(星期四) 09:17
收件人:"Christian König" 
mailto:christian.koe...@amd.com>>; Danilo Krummrich 
mailto:d...@redhat.com>>; Dave Airlie 
mailto:airl...@redhat.com>>; Daniel Vetter 
mailto:dan...@ffwll.ch>>; Felix Kuehling 
mailto:felix.kuehl...@amd.com>>; "Shah, Ankur N" 
mailto:ankur.n.s...@intel.com>>; "Winiarski, Michal" 
mailto:michal.winiar...@intel.com>>
抄 送:"Brost, Matthew" mailto:matthew.br...@intel.com>>; 
thomas.hellst...@linux.intel.com<mailto:thomas.hellst...@linux.intel.com> 
mailto:thomas.hellst...@linux.intel.com>>; 
"Welty, Brian" mailto:brian.we...@intel.com>>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> 
mailto:dri-devel@lists.freedesktop.org>>; 
"Ghimiray, Himal Prasad" 
mailto:himal.prasad.ghimi...@intel.com>>; 
"Gupta, saurabhg" mailto:saurabhg.gu...@intel.com>>; 
"Bommu, Krishnaiah" 
mailto:krishnaiah.bo...@intel.com>>; 
"Vishwanathapura, Niranjana" 
mailto:niranjana.vishwanathap...@intel.com>>;
 intel...@lists.freedesktop.org<mailto:intel...@lists.freedesktop.org> 
mailto:intel...@lists.freedesktop.org>>
主 题:RE: Making drm_gpuvm work across gpu devices

Hi Christian,

Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I 
also had hard time to understand the difficulty of KFD under virtualization 
environment.

For us, Xekmd doesn't need to know it is running under bare metal or 
virtualized environment. Xekmd is always a guest driver. All the virtual 
address used in xekmd is guest virtual address. For SVM, we require all the VF 
devices share one single shared address space with guest CPU program. So all 
the design works in bare metal environment can automatically work under 
virtualized environment. +@Shah, Ankur N<mailto:ankur.n.s...@intel.com> 
+@Winiarski, Michal<mailto:michal.winiar...@intel.com> to backup me if I am 
wrong.

Again, shared virtual address space b/t cpu and all gpu devices is a hard 
requirement for our system allocator design (which means malloc’ed memory, cpu 
stack variables, globals can be directly used in gpu program. Same requirement 
as kfd SVM design). This was aligned with our user space software stack.

For anyone who want to implement system allocator, or SVM, this is a hard 
requirement. I started this thread hoping I can leverage the drm_gpuvm des

RE: Making drm_gpuvm work across gpu devices

2024-01-25 Thread Zeng, Oak
Hi Christian,

I got a few more questions inline

From: Christian König 
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak ; Danilo Krummrich ; Dave 
Airlie ; Daniel Vetter ; Felix Kuehling 

Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah ; 
Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
; Brost, Matthew 
; Gupta, saurabhg 
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that 
ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a 
virtualization projects. Having SVM as device independent feature which somehow 
ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says 
that a range A..B of the process address space is mapped into a range C..D of 
the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.


The whole purpose of the HMM design is to create a shared address space b/t cpu 
and gpu program. See here: https://www.kernel.org/doc/Documentation/vm/hmm.rst. 
Mapping process address A..B to C..D of GPU address space is exactly referred 
as “split address space” in the HMM design.



When you talk about migrating memory to a device you also do this on a per 
device basis and *not* tied to the process address space. If you then get 
crappy performance because userspace gave contradicting information where to 
migrate memory then that's a bug in userspace and not something the kernel 
should try to prevent somehow.

[SNIP]


I think if you start using the same drm_gpuvm for multiple devices you

will sooner or later start to run into the same mess we have seen with

KFD, where we moved more and more functionality from the KFD to the DRM

render node because we found that a lot of the stuff simply doesn't work

correctly with a single object to maintain the state.



As I understand it, KFD is designed to work across devices. A single pseudo 
/dev/kfd device represent all hardware gpu devices. That is why during kfd 
open, many pdd (process device data) is created, each for one hardware device 
for this process.

Yes, I'm perfectly aware of that. And I can only repeat myself that I see this 
design as a rather extreme failure. And I think it's one of the reasons why 
NVidia is so dominant with Cuda.

This whole approach KFD takes was designed with the idea of extending the CPU 
process into the GPUs, but this idea only works for a few use cases and is not 
something we should apply to drivers in general.

A very good example are virtualization use cases where you end up with CPU 
address != GPU address because the VAs are actually coming from the guest VM 
and not the host process.


Are you talking about general virtualization set up such as SRIOV, GPU device 
pass through, or something else?

In a typical virtualization set up, gpu driver such as xekmd or amdgpu is 
always a guest driver. In xekmd case, xekmd doesn’t need to know it is 
operating under virtualized environment. So the virtual address in driver is 
guest virtual address. From kmd driver perspective, there is no difference b/t 
bare metal and virtualized.

Are you talking about special virtualized setup such as para-virtualized/VirGL? 
I need more background info to understand why you end up with CPU address !=GPU 
address in SVM….


SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This should not have 
any influence on the design of the kernel UAPI.


Maybe a terminology problem here. I agree what you said above. We also have 
achieved the SVM design with our BO-centric driver such as i915, xekmd.

But we are mainly talking about system allocator here, like use malloc’ed 
memory directly for GPU program. And we want to leverage HMM. System allocator 
can be used to implement the same SVM concept at OpenCL/Cuda/ROCm, but SVM can 
be implemented with BO-centric driver also.


If you want to do something similar as KFD for Xe I think you need to get 
explicit permission to do

RE: Making drm_gpuvm work across gpu devices

2024-01-24 Thread Zeng, Oak
Hi Dave,

Let me step back. When I wrote " shared virtual address space b/t cpu and all 
gpu devices is a hard requirement for our system allocator design", I meant 
this is not only Intel's design requirement. Rather this is a common 
requirement for both Intel, AMD and Nvidia. Take a look at cuda driver API 
definition of cuMemAllocManaged (search this API on 
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM),
 it said: 

"The pointer is valid on the CPU and on all GPUs in the system that support 
managed memory."

This means the program virtual address space is shared b/t CPU and all GPU 
devices on the system. The system allocator we are discussing is just one step 
advanced than cuMemAllocManaged: it allows malloc'ed memory to be shared b/t 
CPU and all GPU devices.

I hope we all agree with this point.

With that, I agree with Christian that in kmd we should make driver code 
per-device based instead of managing all devices in one driver instance. Our 
system allocator (and generally xekmd)design follows this rule: we make xe_vm 
per device based - one device is *not* aware of other device's address space, 
as I explained in previous email. I started this email seeking a one drm_gpuvm 
instance to cover all GPU devices. I gave up this approach (at least for now) 
per Danilo and Christian's feedback: We will continue to have per device based 
drm_gpuvm. I hope this is aligned with Christian but I will have to wait for 
Christian's reply to my previous email.

I hope this clarify thing a little.

Regards,
Oak 

> -Original Message-
> From: dri-devel  On Behalf Of David
> Airlie
> Sent: Wednesday, January 24, 2024 8:25 PM
> To: Zeng, Oak 
> Cc: Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Winiarski, Michal
> ; Felix Kuehling ; Welty,
> Brian ; Shah, Ankur N ; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Gupta, saurabhg
> ; Danilo Krummrich ; Daniel
> Vetter ; Brost, Matthew ; Bommu,
> Krishnaiah ; Vishwanathapura, Niranjana
> ; Christian König
> 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> >
> >
> > For us, Xekmd doesn't need to know it is running under bare metal or
> virtualized environment. Xekmd is always a guest driver. All the virtual 
> address
> used in xekmd is guest virtual address. For SVM, we require all the VF devices
> share one single shared address space with guest CPU program. So all the 
> design
> works in bare metal environment can automatically work under virtualized
> environment. +@Shah, Ankur N +@Winiarski, Michal to backup me if I am wrong.
> >
> >
> >
> > Again, shared virtual address space b/t cpu and all gpu devices is a hard
> requirement for our system allocator design (which means malloc’ed memory,
> cpu stack variables, globals can be directly used in gpu program. Same
> requirement as kfd SVM design). This was aligned with our user space software
> stack.
> 
> Just to make a very general point here (I'm hoping you listen to
> Christian a bit more and hoping he replies in more detail), but just
> because you have a system allocator design done, it doesn't in any way
> enforce the requirements on the kernel driver to accept that design.
> Bad system design should be pushed back on, not enforced in
> implementation stages. It's a trap Intel falls into regularly since
> they say well we already agreed this design with the userspace team
> and we can't change it now. This isn't acceptable. Design includes
> upstream discussion and feedback, if you say misdesigned the system
> allocator (and I'm not saying you definitely have), and this is
> pushing back on that, then you have to go fix your system
> architecture.
> 
> KFD was an experiment like this, I pushed back on AMD at the start
> saying it was likely a bad plan, we let it go and got a lot of
> experience in why it was a bad design.
> 
> Dave.



RE: Making drm_gpuvm work across gpu devices

2024-01-24 Thread Zeng, Oak
Thank you Felix for sharing. See a few comments inline

> -Original Message-
> From: Felix Kuehling 
> Sent: Tuesday, January 23, 2024 3:17 PM
> To: Zeng, Oak ; Christian König 
> ;
> Danilo Krummrich ; Dave Airlie ; Daniel
> Vetter 
> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
> intel-
> x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> ; Brost, Matthew
> ; Gupta, saurabhg 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> On 2024-01-23 14:37, Zeng, Oak wrote:
> > Thanks Christian. I have some comment inline below.
> >
> > Danilo, can you also take a look and give your feedback? Thanks.
> 
> Sorry, just catching up with this thread now. I'm also not familiar with
> drm_gpuvm.
> 
> Some general observations based on my experience with KFD, amdgpu and
> SVM. With SVM we have a single virtual address space managed in user
> mode (basically using mmap) with attributes per virtual address range
> maintained in the kernel mode driver. Different devices get their
> mappings of pieces of that address space using the same virtual
> addresses. We also support migration to different DEVICE_PRIVATE memory
> spaces.

I think one same virtual address can be mapped into different devices. For 
different device, reading from same virtual address result in same content.  
Driver either map the page table to pointing to the same physical location, or 
migrate before mapping. I guess you imply this.

> 
> However, we still have page tables managed per device. Each device can
> have different page table formats and layout (e.g. different GPU
> generations in the same system) and the same memory may be mapped with
> different flags on different devices in order to get the right coherence
> behaviour. We also need to maintain per-device DMA mappings somewhere.
> That means, as far as the device page tables are concerned, we still
> have separate address spaces. SVM only adds a layer on top, which
> coordinates these separate device virtual address spaces so that some
> parts of them provide the appearance of a shared virtual address space.
> 

Yes exactly the same understanding.

> At some point you need to decide, where you draw the boundary between
> managing a per-process shared virtual address space and managing
> per-device virtual address spaces. In amdgpu that boundary is currently
> where kfd_svm code calls amdgpu_vm code to manage the per-device page
> tables.

Exactly, in xe driver it is xe_svm and xe_vm. Just different name 

> 
> In the amdgpu driver, we still have the traditional memory management
> APIs in the render nodes that don't do SVM. They share the device
> virtual address spaces with SVM. We have to be careful that we don't try
> to manage the same device virtual address ranges with these two
> different memory managers. In practice, we let the non-SVM APIs use the
> upper half of the canonical address space, while the lower half can be
> used almost entirely for SVM.

In xekmd, we also have to support a mixed usage of traditional 
gem_create/vm_bind api and malloc. 

I just wondering why you had to split the canonical address space b/t those two 
usage model. To illustrate those two usage:

Traditional model:
Ptr= mmap(anonymous)
Vm_bind(ptr, bo)
Submit gpu kernel using ptr

System allocator model:
Ptr = mmap(anonymous) or malloc()
Submit gpu kernel using ptr

The point is, both ptr are allocated in user space anonymously inside one 
process space, so there is no collision even if you don't deliberately divide 
the canonical address space.

Thanks,
Oak
> 
> Regards,
>    Felix
> 
> 
> >
> >> -Original Message-
> >> From: Christian König 
> >> Sent: Tuesday, January 23, 2024 6:13 AM
> >> To: Zeng, Oak ; Danilo Krummrich ;
> >> Dave Airlie ; Daniel Vetter 
> >> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
> >> intel-
> >> x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> >> Ghimiray, Himal Prasad ;
> >> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> >> ; Brost, Matthew
> >> 
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> Hi Oak,
> >>
> >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> >>> Hi Danilo and all,
> >>>
> >>> During the work of Intel's SVM code, we came up the idea of making
> >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> >> https://lore.kernel.org/dri-
> >>
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> >> 11.prod.outlook.com/
> >>> The re

RE: Making drm_gpuvm work across gpu devices

2024-01-24 Thread Zeng, Oak
Hi Christian,

Even though I mentioned KFD design, I didn’t mean to copy the KFD design. I 
also had hard time to understand the difficulty of KFD under virtualization 
environment.

For us, Xekmd doesn't need to know it is running under bare metal or 
virtualized environment. Xekmd is always a guest driver. All the virtual 
address used in xekmd is guest virtual address. For SVM, we require all the VF 
devices share one single shared address space with guest CPU program. So all 
the design works in bare metal environment can automatically work under 
virtualized environment. +@Shah, Ankur N<mailto:ankur.n.s...@intel.com> 
+@Winiarski, Michal<mailto:michal.winiar...@intel.com> to backup me if I am 
wrong.

Again, shared virtual address space b/t cpu and all gpu devices is a hard 
requirement for our system allocator design (which means malloc’ed memory, cpu 
stack variables, globals can be directly used in gpu program. Same requirement 
as kfd SVM design). This was aligned with our user space software stack.

For anyone who want to implement system allocator, or SVM, this is a hard 
requirement. I started this thread hoping I can leverage the drm_gpuvm design 
to manage the shared virtual address space (as the address range split/merge 
function was scary to me and I didn’t want re-invent). I guess my takeaway from 
this you and Danilo is this approach is a NAK. Thomas also mentioned to me 
drm_gpuvm is a overkill for our svm address range split/merge. So I will make 
things work first by manage address range xekmd internally. I can re-look 
drm-gpuvm approach in the future.

Maybe a pseudo user program can illustrate our programming model:


Fd0 = open(card0)

Fd1 = open(card1)

Vm0 =xe_vm_create(fd0) //driver create process xe_svm on the process's first 
vm_create

Vm1 = xe_vm_create(fd1) //driver re-use xe_svm created above if called from 
same process

Queue0 = xe_exec_queue_create(fd0, vm0)

Queue1 = xe_exec_queue_create(fd1, vm1)

//check p2p capability calling L0 API….

ptr = malloc()//this replace bo_create, vm_bind, dma-import/export

Xe_exec(queue0, ptr)//submit gpu job which use ptr, on card0

Xe_exec(queue1, ptr)//submit gpu job which use ptr, on card1

//Gpu page fault handles memory allocation/migration/mapping to gpu

As you can see, from above model, our design is a little bit different than the 
KFD design. user need to explicitly create gpuvm (vm0 and vm1 above) for each 
gpu device. Driver internally have a xe_svm represent the shared address space 
b/t cpu and multiple gpu devices. But end user doesn’t see and no need to 
create xe_svm. The shared virtual address space is really managed by linux core 
mm (through the vma struct, mm_struct etc). From each gpu device’s perspective, 
it just operate under its own gpuvm, not aware of the existence of other gpuvm, 
even though in reality all those gpuvm shares a same virtual address space.

See one more comment inline

From: Christian König 
Sent: Wednesday, January 24, 2024 3:33 AM
To: Zeng, Oak ; Danilo Krummrich ; Dave 
Airlie ; Daniel Vetter ; Felix Kuehling 

Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org; Bommu, Krishnaiah ; 
Ghimiray, Himal Prasad ; 
thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana 
; Brost, Matthew 
; Gupta, saurabhg 
Subject: Re: Making drm_gpuvm work across gpu devices

Am 23.01.24 um 20:37 schrieb Zeng, Oak:

[SNIP]



Yes most API are per device based.



One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices.

Yeah and that was a big mistake in my opinion. We should really not do that 
ever again.



Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

Exactly that thinking is what we have currently found as blocker for a 
virtualization projects. Having SVM as device independent feature which somehow 
ties to the process address space turned out to be an extremely bad idea.

The background is that this only works for some use cases but not all of them.

What's working much better is to just have a mirror functionality which says 
that a range A..B of the process address space is mapped into a range C..D of 
the GPU address space.

Those ranges can then be used to implement the SVM feature required for higher 
level APIs and not something you need at the UAPI or even inside the low level 
kernel memory management.

When you talk about migrating memory to a device you also do this on a per 
device basis and *not* tied to the process address space. If you then get 
crappy performance because userspace gave contradicting

RE: Making drm_gpuvm work across gpu devices

2024-01-23 Thread Zeng, Oak
Danilo,

Maybe before I give up, I should also ask, currently drm_gpuvm is designed for 
BO-centric world. Is it easy to make the va range split/merge work simply for 
va range, but without BO? Conceptually this should work as we are 
merge/splitting virtual address range which can be decoupled completely from 
BO. 

> -Original Message-
> From: dri-devel  On Behalf Of Zeng,
> Oak
> Sent: Tuesday, January 23, 2024 10:57 PM
> To: Danilo Krummrich ; Christian König
> ; Dave Airlie ; Daniel Vetter
> ; Felix Kuehling ; Welty, Brian
> 
> Cc: Brost, Matthew ;
> thomas.hellst...@linux.intel.com; dri-devel@lists.freedesktop.org; Ghimiray,
> Himal Prasad ; Gupta, saurabhg
> ; Bommu, Krishnaiah
> ; Vishwanathapura, Niranjana
> ; intel...@lists.freedesktop.org
> Subject: RE: Making drm_gpuvm work across gpu devices
> 
> Thanks a lot Danilo.
> 
> Maybe I wasn't clear enough. In the solution I proposed, each device still 
> have
> separate vm/page tables. Each device still need to manage the mapping, page
> table flags etc. It is just in svm use case, all devices share one drm_gpuvm
> instance. As I understand it, drm_gpuvm's main function is the va range split 
> and
> merging. I don't see why it doesn't work across gpu devices.
> 
> But I read more about drm_gpuvm. Its split merge function takes a
> drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and
> drm_gpuvm_sm_map. Actually the whole drm_gpuvm is designed for BO-centric
> driver, for example, it has a drm_gpuvm_bo concept to keep track of the
> 1BO:Ngpuva mapping. The whole purpose of leveraging drm_gpuvm is to re-use
> the va split/merge functions for SVM. But in our SVM implementation, there is 
> no
> buffer object at all. So I don't think our SVM codes can leverage drm_gpuvm.
> 
> I will give up this approach, unless Matt or Brian can see a way.
> 
> A few replies inline @Welty, Brian I had more thoughts inline to one of 
> your
> original question....
> 
> > -Original Message-
> > From: Danilo Krummrich 
> > Sent: Tuesday, January 23, 2024 6:57 PM
> > To: Zeng, Oak ; Christian König
> > ; Dave Airlie ; Daniel Vetter
> > ; Felix Kuehling 
> > Cc: Welty, Brian ; dri-devel@lists.freedesktop.org;
> intel-
> > x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> > Ghimiray, Himal Prasad ;
> > thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> > ; Brost, Matthew
> > ; Gupta, saurabhg 
> > Subject: Re: Making drm_gpuvm work across gpu devices
> >
> > Hi Oak,
> >
> > On 1/23/24 20:37, Zeng, Oak wrote:
> > > Thanks Christian. I have some comment inline below.
> > >
> > > Danilo, can you also take a look and give your feedback? Thanks.
> >
> > I agree with everything Christian already wrote. Except for the KFD parts, 
> > which
> > I'm simply not familiar with, I had exactly the same thoughts after reading 
> > your
> > initial mail.
> >
> > Please find some more comments below.
> >
> > >
> > >> -Original Message-
> > >> From: Christian König 
> > >> Sent: Tuesday, January 23, 2024 6:13 AM
> > >> To: Zeng, Oak ; Danilo Krummrich
> ;
> > >> Dave Airlie ; Daniel Vetter 
> > >> Cc: Welty, Brian ; 
> > >> dri-devel@lists.freedesktop.org;
> > intel-
> > >> x...@lists.freedesktop.org; Bommu, Krishnaiah
> > ;
> > >> Ghimiray, Himal Prasad ;
> > >> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> > >> ; Brost, Matthew
> > >> 
> > >> Subject: Re: Making drm_gpuvm work across gpu devices
> > >>
> > >> Hi Oak,
> > >>
> > >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> > >>> Hi Danilo and all,
> > >>>
> > >>> During the work of Intel's SVM code, we came up the idea of making
> > >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> > >> https://lore.kernel.org/dri-
> > >>
> >
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> > >> 11.prod.outlook.com/
> > >>>
> > >>> The reason we try to do this is, for a SVM (shared virtual memory across
> cpu
> > >> program and all gpu program on all gpu devices) process, the address 
> > >> space
> > has
> > >> to be across all gpu devices. So if we make drm_gpuvm to work across
> devices,
> > >> then our SVM code can leverage drm_gpuvm as well.
> > >>>
> > >>> At a first look, it seems feasible be

RE: Making drm_gpuvm work across gpu devices

2024-01-23 Thread Zeng, Oak
Thanks a lot Danilo.

Maybe I wasn't clear enough. In the solution I proposed, each device still have 
separate vm/page tables. Each device still need to manage the mapping, page 
table flags etc. It is just in svm use case, all devices share one drm_gpuvm 
instance. As I understand it, drm_gpuvm's main function is the va range split 
and merging. I don't see why it doesn't work across gpu devices. 

But I read more about drm_gpuvm. Its split merge function takes a 
drm_gem_object parameter, see drm_gpuvm_sm_map_ops_create and drm_gpuvm_sm_map. 
Actually the whole drm_gpuvm is designed for BO-centric driver, for example, it 
has a drm_gpuvm_bo concept to keep track of the 1BO:Ngpuva mapping. The whole 
purpose of leveraging drm_gpuvm is to re-use the va split/merge functions for 
SVM. But in our SVM implementation, there is no buffer object at all. So I 
don't think our SVM codes can leverage drm_gpuvm.

I will give up this approach, unless Matt or Brian can see a way.

A few replies inline @Welty, Brian I had more thoughts inline to one of 
your original question

> -Original Message-
> From: Danilo Krummrich 
> Sent: Tuesday, January 23, 2024 6:57 PM
> To: Zeng, Oak ; Christian König
> ; Dave Airlie ; Daniel Vetter
> ; Felix Kuehling 
> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
> intel-
> x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> ; Brost, Matthew
> ; Gupta, saurabhg 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> On 1/23/24 20:37, Zeng, Oak wrote:
> > Thanks Christian. I have some comment inline below.
> >
> > Danilo, can you also take a look and give your feedback? Thanks.
> 
> I agree with everything Christian already wrote. Except for the KFD parts, 
> which
> I'm simply not familiar with, I had exactly the same thoughts after reading 
> your
> initial mail.
> 
> Please find some more comments below.
> 
> >
> >> -----Original Message-
> >> From: Christian König 
> >> Sent: Tuesday, January 23, 2024 6:13 AM
> >> To: Zeng, Oak ; Danilo Krummrich ;
> >> Dave Airlie ; Daniel Vetter 
> >> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org;
> intel-
> >> x...@lists.freedesktop.org; Bommu, Krishnaiah
> ;
> >> Ghimiray, Himal Prasad ;
> >> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> >> ; Brost, Matthew
> >> 
> >> Subject: Re: Making drm_gpuvm work across gpu devices
> >>
> >> Hi Oak,
> >>
> >> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> >>> Hi Danilo and all,
> >>>
> >>> During the work of Intel's SVM code, we came up the idea of making
> >> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> >> https://lore.kernel.org/dri-
> >>
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> >> 11.prod.outlook.com/
> >>>
> >>> The reason we try to do this is, for a SVM (shared virtual memory across 
> >>> cpu
> >> program and all gpu program on all gpu devices) process, the address space
> has
> >> to be across all gpu devices. So if we make drm_gpuvm to work across 
> >> devices,
> >> then our SVM code can leverage drm_gpuvm as well.
> >>>
> >>> At a first look, it seems feasible because drm_gpuvm doesn't really use 
> >>> the
> >> drm_device *drm pointer a lot. This param is used only for 
> >> printing/warning.
> So I
> >> think maybe we can delete this drm field from drm_gpuvm.
> >>>
> >>> This way, on a multiple gpu device system, for one process, we can have 
> >>> only
> >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
> >> each gpu device).
> >>>
> >>> What do you think?
> >>
> >> Well from the GPUVM side I don't think it would make much difference if
> >> we have the drm device or not.
> >>
> >> But the experience we had with the KFD I think I should mention that we
> >> should absolutely *not* deal with multiple devices at the same time in
> >> the UAPI or VM objects inside the driver.
> >>
> >> The background is that all the APIs inside the Linux kernel are build
> >> around the idea that they work with only one device at a time. This
> >> accounts for both low level APIs like the DMA API as well as pretty high
> >> level things like for example file system address space etc...
> >
> > Yes most API are per device 

RE: Making drm_gpuvm work across gpu devices

2024-01-23 Thread Zeng, Oak
Thanks Christian. I have some comment inline below.

Danilo, can you also take a look and give your feedback? Thanks.

> -Original Message-
> From: Christian König 
> Sent: Tuesday, January 23, 2024 6:13 AM
> To: Zeng, Oak ; Danilo Krummrich ;
> Dave Airlie ; Daniel Vetter 
> Cc: Welty, Brian ; dri-devel@lists.freedesktop.org; 
> intel-
> x...@lists.freedesktop.org; Bommu, Krishnaiah ;
> Ghimiray, Himal Prasad ;
> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana
> ; Brost, Matthew
> 
> Subject: Re: Making drm_gpuvm work across gpu devices
> 
> Hi Oak,
> 
> Am 23.01.24 um 04:21 schrieb Zeng, Oak:
> > Hi Danilo and all,
> >
> > During the work of Intel's SVM code, we came up the idea of making
> drm_gpuvm to work across multiple gpu devices. See some discussion here:
> https://lore.kernel.org/dri-
> devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd
> 11.prod.outlook.com/
> >
> > The reason we try to do this is, for a SVM (shared virtual memory across cpu
> program and all gpu program on all gpu devices) process, the address space has
> to be across all gpu devices. So if we make drm_gpuvm to work across devices,
> then our SVM code can leverage drm_gpuvm as well.
> >
> > At a first look, it seems feasible because drm_gpuvm doesn't really use the
> drm_device *drm pointer a lot. This param is used only for printing/warning. 
> So I
> think maybe we can delete this drm field from drm_gpuvm.
> >
> > This way, on a multiple gpu device system, for one process, we can have only
> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for
> each gpu device).
> >
> > What do you think?
> 
> Well from the GPUVM side I don't think it would make much difference if
> we have the drm device or not.
> 
> But the experience we had with the KFD I think I should mention that we
> should absolutely *not* deal with multiple devices at the same time in
> the UAPI or VM objects inside the driver.
> 
> The background is that all the APIs inside the Linux kernel are build
> around the idea that they work with only one device at a time. This
> accounts for both low level APIs like the DMA API as well as pretty high
> level things like for example file system address space etc...

Yes most API are per device based.

One exception I know is actually the kfd SVM API. If you look at the svm_ioctl 
function, it is per-process based. Each kfd_process represent a process across 
N gpu devices. Cc Felix.

Need to say, kfd SVM represent a shared virtual address space across CPU and 
all GPU devices on the system. This is by the definition of SVM (shared virtual 
memory). This is very different from our legacy gpu *device* driver which works 
for only one device (i.e., if you want one device to access another device's 
memory, you will have to use dma-buf export/import etc).

We have the same design requirement of SVM. For anyone who want to implement 
the SVM concept, this is a hard requirement. Since now drm has the drm_gpuvm 
concept which strictly speaking is designed for one device, I want to see 
whether we can extend drm_gpuvm to make it work for both single device (as used 
in xe) and multipe devices (will be used in the SVM code). That is why I 
brought up this topic.

> 
> So when you have multiple GPUs you either have an inseparable cluster of
> them which case you would also only have one drm_device. Or you have
> separated drm_device which also results in separate drm render nodes and
> separate virtual address spaces and also eventually separate IOMMU
> domains which gives you separate dma_addresses for the same page and so
> separate GPUVM page tables

I am thinking we can still make each device has its separate drm_device/render 
node/iommu domains/gpu page table. Just as what we have today. I am not plan to 
change this picture.

But the virtual address space will support two modes of operation: 
1. one drm_gpuvm per device. This is when svm is not in the picture
2. all devices in the process share one single drm_gpuvm, when svm is in the 
picture. In xe driver design, we have to support a mixture use of legacy mode 
(such as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu 
submission). So whenever SVM is in the picture, we want one single process 
address space across all devices. Drm_gpuvm doesn't need to be aware of those 
two operation modes. It is driver's responsibility to use different mode. 

For example, in mode #1, a driver's vm structure (such as xe_vm) can inherit 
from drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: 
https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.z...@intel.com/) 
can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) 
will just have a pointer to the drm_gp

Making drm_gpuvm work across gpu devices

2024-01-22 Thread Zeng, Oak
Hi Danilo and all,

During the work of Intel's SVM code, we came up the idea of making drm_gpuvm to 
work across multiple gpu devices. See some discussion here: 
https://lore.kernel.org/dri-devel/ph7pr11mb70049e7e6a2f40bf6282ecc292...@ph7pr11mb7004.namprd11.prod.outlook.com/

The reason we try to do this is, for a SVM (shared virtual memory across cpu 
program and all gpu program on all gpu devices) process, the address space has 
to be across all gpu devices. So if we make drm_gpuvm to work across devices, 
then our SVM code can leverage drm_gpuvm as well.

At a first look, it seems feasible because drm_gpuvm doesn't really use the 
drm_device *drm pointer a lot. This param is used only for printing/warning. So 
I think maybe we can delete this drm field from drm_gpuvm.

This way, on a multiple gpu device system, for one process, we can have only 
one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for each 
gpu device).

What do you think?

Thanks,
Oak


RE: [PATCH 21/23] drm/xe/svm: GPU page fault support

2024-01-22 Thread Zeng, Oak


> -Original Message-
> From: Welty, Brian 
> Sent: Monday, January 22, 2024 9:06 PM
> To: Zeng, Oak ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org
> Cc: Bommu, Krishnaiah ; Ghimiray, Himal Prasad
> ; thomas.hellst...@linux.intel.com;
> Vishwanathapura, Niranjana ; Brost,
> Matthew 
> Subject: Re: [PATCH 21/23] drm/xe/svm: GPU page fault support
> 
> 
> On 1/17/2024 2:12 PM, Oak Zeng wrote:
> > On gpu page fault of a virtual address, try to fault in the virtual
> > address range to gpu page table and let HW to retry on the faulty
> > address.
> >
> > Right now, we always migrate the whole vma which contains the fault
> > address to GPU. This is subject to change of a more sophisticated
> > migration policy: decide whether to migrate memory to GPU or map
> > in place with CPU memory; migration granularity.
> >
> > There is rather complicated locking strategy in this patch. See more
> > details in xe_svm_doc.h, lock design section.
> >
> > Signed-off-by: Oak Zeng 
> > Cc: Niranjana Vishwanathapura 
> > Cc: Matthew Brost 
> > Cc: Thomas Hellström 
> > Cc: Brian Welty 
> > ---
> >   drivers/gpu/drm/xe/xe_gt_pagefault.c |   7 ++
> >   drivers/gpu/drm/xe/xe_svm.c  | 116 +++
> >   drivers/gpu/drm/xe/xe_svm.h  |   6 ++
> >   drivers/gpu/drm/xe/xe_svm_range.c|  43 ++
> >   4 files changed, 172 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index 467d68f8332e..462603abab8a 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -22,6 +22,7 @@
> >   #include "xe_pt.h"
> >   #include "xe_trace.h"
> >   #include "xe_vm.h"
> > +#include "xe_svm.h"
> >
> >   enum fault_type {
> > NOT_PRESENT = 0,
> > @@ -131,6 +132,11 @@ static int handle_pagefault(struct xe_gt *gt, struct
> pagefault *pf)
> > if (!vm || !xe_vm_in_fault_mode(vm))
> > return -EINVAL;
> >
> > +   if (vm->svm) {
> > +   ret = xe_svm_handle_gpu_fault(vm, gt, pf);
> > +   goto put_vm;
> > +   }
> > +
> >   retry_userptr:
> > /*
> >  * TODO: Avoid exclusive lock if VM doesn't have userptrs, or
> > @@ -219,6 +225,7 @@ static int handle_pagefault(struct xe_gt *gt, struct
> pagefault *pf)
> > if (ret >= 0)
> > ret = 0;
> > }
> > +put_vm:
> > xe_vm_put(vm);
> >
> > return ret;
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index 0c13690a19f5..1ade8d7f0ab2 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -12,6 +12,7 @@
> >   #include "xe_svm.h"
> >   #include 
> >   #include 
> > +#include 
> >   #include "xe_pt.h"
> >   #include "xe_assert.h"
> >   #include "xe_vm_types.h"
> > @@ -206,3 +207,118 @@ static int svm_populate_range(struct xe_svm_range
> *svm_range,
> > kvfree(pfns);
> > return ret;
> >   }
> > +
> > +/**
> > + * svm_access_allowed() -  Determine whether read or/and write to vma is
> allowed
> > + *
> > + * @write: true means a read and write access; false: read only access
> > + */
> > +static bool svm_access_allowed(struct vm_area_struct *vma, bool write)
> > +{
> > +   unsigned long access = VM_READ;
> > +
> > +   if (write)
> > +   access |= VM_WRITE;
> > +
> > +   return (vma->vm_flags & access) == access;
> > +}
> > +
> > +/**
> > + * svm_should_migrate() - Determine whether we should migrate a range to
> > + * a destination memory region
> > + *
> > + * @range: The svm memory range to consider
> > + * @dst_region: target destination memory region
> > + * @is_atomic_fault: Is the intended migration triggered by a atomic 
> > access?
> > + * On some platform, we have to migrate memory to guarantee atomic
> correctness.
> > + */
> > +static bool svm_should_migrate(struct xe_svm_range *range,
> > +   struct xe_mem_region *dst_region, bool
> is_atomic_fault)
> > +{
> > +   return true;
> > +}
> > +
> > +/**
> > + * xe_svm_handle_gpu_fault() - gpu page fault handler for svm subsystem
> > + *
> > + * @vm: The vm of the fault.
> > + * @gt: The gt hard

RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-30 Thread Zeng, Oak
See inline comments

> -Original Message-
> From: dri-devel  On Behalf Of
> zhuweixi
> Sent: Thursday, November 30, 2023 5:48 AM
> To: Christian König ; Zeng, Oak
> ; Christian König ; linux-
> m...@kvack.org; linux-ker...@vger.kernel.org; a...@linux-foundation.org;
> Danilo Krummrich ; Dave Airlie ; Daniel
> Vetter 
> Cc: tvrtko.ursu...@linux.intel.com; rcampb...@nvidia.com; apop...@nvidia.com;
> z...@nvidia.com; weixi@openeuler.sh; jhubb...@nvidia.com; intel-
> g...@lists.freedesktop.org; mhairgr...@nvidia.com; Wang, Zhi A
> ; xinhui@amd.com; amd-...@lists.freedesktop.org;
> jgli...@redhat.com; dri-devel@lists.freedesktop.org; j...@nvidia.com; Vivi,
> Rodrigo ; alexander.deuc...@amd.com;
> felix.kuehl...@amd.com; intel-gvt-...@lists.freedesktop.org;
> ogab...@kernel.org; leo...@nvidia.com; mgor...@suse.de
> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Glad to know that there is a common demand for a new syscall like hmadvise(). 
> I
> expect it would also be useful for homogeneous NUMA cases. Credits to
> cudaMemAdvise() API which brought this idea to GMEM's design.
> 
> To answer @Oak's questions about GMEM vs. HMM,
> 
> Here is the major difference:
>   GMEM's main target is to stop drivers from reinventing MM code, while
> HMM/MMU notifiers provide a compatible struct page solution and a
> coordination mechanism for existing device driver MMs that requires adding
> extra code to interact with CPU MM.
> 
> A straightforward qualitative result for the main target: after integrating 
> Huawei's
> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
> leaving <100 lines invoking GMEM interface and 3,700 lines implementing 
> vendor-
> specific functions. Some code from the 3,700 lines should be further moved to
> GMEM as a generalized feature like device memory oversubscription, but not
> included in this RFC patch yet.
> 
> A list of high-level differences:
>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
> subsystem. With GMEM, drivers can reuse Linux's core MM.

A full mm subsystem essentially has below functions:

Physical memory management: neither your approach nor hmm-based solution 
provide device physical memory management. You mentioned you have a plan but at 
least for now driver need to mange device physical memory.

Virtual address space management: both approach leverage linux core mm, vma for 
this.

Data eviction, migration: with hmm, driver need to implement this. It is not 
clear whether gmem has this function. I guess even gmem has it, it might be 
slow cpu data copy, compared to modern gpu's fast data copy engine.

Device page table update, va-pa mapping: I think it is driver's responsibility 
in both approach.

So from the point of re-use core MM, I don't see big difference. Maybe you did 
it more elegantly. I think it is very possible with your approach driver can be 
simpler, less codes.

> 
>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
> approach further decouples the arch-related stuff, drivers do not need to
> implement separate code for X86/ARM and etc.

I don't understand this...with hmm, when a virtual address range's backing 
store is in device memory, cpu pte is encoded to point to device memory. Device 
page table is also encoded to point to the same device memory location. But 
since device memory is not accessible to CPU (DEVICE_PRIVATE), so when cpu 
access this virtual address, there is a cpu page fault. Device mapping info is 
still in device page table, not in cpu ptes.

I do not see with hmm why driver need to implement x86/arm code... driver only 
take cares of device page table. Hmm/core mm take care of cpu page table, right?

> 
>   3. MMU notifiers register hooks at certain core MM events, while GMEM
> declares basic functions and internally invokes them. GMEM requires less from
> the driver side -- no need to understand what core MM behaves at certain MMU
> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
> operations with standard declarations vs. implementing whatever random device
> MM logic in MMU notifiers.

This seems true to me. I feel the mmu notifier thing, especially the 
synchronization/lock design (those sequence numbers, interacting with driver 
lock, and the mmap lock) are very complicated. I indeed spent time to 
understand the specification documented in hmm.rst...

Your approach seems better.

> 
>   4. GMEM plans to support a more lightweight physical memory management.
> The discussion about this part can be found in my cover letter. The question 
> is
> whether struct page should be compatible (

RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-29 Thread Zeng, Oak
Hi Weixi,

Even though Christian has listed reasons rejecting this proposal (yes they are 
very reasonable to me), I would open my mind and further explore the 
possibility here. Since the current GPU driver uses a hmm based implementation 
(AMD and NV has done this; At Intel we are catching up), I want to explore how 
much we can benefit from the proposed approach and how your approach can solve 
some pain points of our development. So basically what I am questioning here 
is: what is the advantage of your approach against hmm.

To implement a UVM (unified virtual address space b/t cpu and gpu device), with 
hmm, driver essentially need to implement below functions:

1. device page table update. Your approach requires the same because this is 
device specific codes

2. Some migration functions to migrate memory b/t system memory and GPU local 
memory. My understanding is, even though you generalized this a bit, such as 
modified cpu page fault path, provided "general" gm_dev_fault handler... but 
device driver still need to provide migration functions because migration 
functions have to be device specific (i.e., using device dma/copy engine for 
performance purpose). Right?

3. GPU physical memory management, this part is now in drm/buddy, shared by all 
drivers. I think with your approach, driver still need to provide callback 
functions to allocate/free physical pages. Right? Or do you let linux core mm 
buddy manage device memory directly?

4. madvise/hints/virtual address range management. This has been pain point for 
us. Right now device driver has to maintain certain virtual address range data 
structure to maintain hints and other virtual address range based memory 
attributes. Driver need to sync with linux vma. Driver need to explicitly deal 
with range split/merging... HMM doesn't provide support in this area. Your 
approach seems cleaner/simpler to me...


So in above, I have examined the some key factors of a gpu UVM memory manager. 
I think for #1 and #2, hmm has provide pretty good abstraction/tools for 
address space mirroring and migration helpers. For #3, since we have a common 
drm/buddy layer, I don't think it is a big problem for driver writer now.

I do see #4 is something you solved more beautifully, requires new system call 
though. 

Oak


> -Original Message-
> From: dri-devel  On Behalf Of
> Christian König
> Sent: Tuesday, November 28, 2023 8:09 AM
> To: Weixi Zhu ; linux...@kvack.org; linux-
> ker...@vger.kernel.org; a...@linux-foundation.org; Danilo Krummrich
> ; Dave Airlie ; Daniel Vetter
> 
> Cc: dri-devel@lists.freedesktop.org; leo...@nvidia.com; apop...@nvidia.com;
> amd-...@lists.freedesktop.org; mgor...@suse.de; z...@nvidia.com; Wang, Zhi
> A ; rcampb...@nvidia.com; j...@nvidia.com;
> weixi@openeuler.sh; jhubb...@nvidia.com; intel-...@lists.freedesktop.org;
> mhairgr...@nvidia.com; jgli...@redhat.com; Vivi, Rodrigo
> ; intel-gvt-...@lists.freedesktop.org;
> tvrtko.ursu...@linux.intel.com; felix.kuehl...@amd.com; xinhui@amd.com;
> alexander.deuc...@amd.com; ogab...@kernel.org
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Adding a few missing important people to the explicit to list.
> 
> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> > The problem:
> >
> > Accelerator driver developers are forced to reinvent external MM subsystems
> > case by case, because Linux core MM only considers host memory resources.
> > These reinvented MM subsystems have similar orders of magnitude of LoC as
> > Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU
> has
> > 30K. Meanwhile, more and more vendors are implementing their own
> > accelerators, e.g. Microsoft's Maia 100. At the same time,
> > application-level developers suffer from poor programmability -- they must
> > consider parallel address spaces and be careful about the limited device
> > DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
> > be shared by the accelerator, or the abundant host DRAM can further
> > transparently backup the device local memory.
> >
> > These external MM systems share similar mechanisms except for the
> > hardware-dependent part, so reinventing them is effectively introducing
> > redundant code (14K~70K for each case). Such developing/maintaining is not
> > cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
> > need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
> > notifiers/HMM. This raises the bar for driver development, since developers
> > must understand how Linux MM works. Further, it creates code maintenance
> > problems -- any changes to Linux MM potentially require coordinated changes
> > to accelerator drivers using low-level MM APIs.
> >
> > Putting a cache-coherent bus between host and device will not make these
> > external MM subsystems disappear. For example, a throughput-oriented
> > accelerator will not tolerate 

RE: [RFC 03/11] drm: introduce drm evictable LRU

2023-11-03 Thread Zeng, Oak


From: Christian König 
Sent: Friday, November 3, 2023 5:36 AM
To: Zeng, Oak ; dri-devel@lists.freedesktop.org; 
intel...@lists.freedesktop.org
Cc: thomas.hellst...@linux.intel.com; felix.kuehl...@amd.com; 
airl...@gmail.com; Welty, Brian 
Subject: Re: [RFC 03/11] drm: introduce drm evictable LRU

Am 03.11.23 um 05:04 schrieb Zeng, Oak:[SNIP]



I also want to have a more advanced iterator at some point where we grab

the BO lock for keeping a reference into the LRU list. Not sure how to

do this if we don't have the BO here any more.



Need to think about that further,



Don't quite get the what you want to do with the advanced iterator. But with 
this work, the lru entity is a base class of ttm_resource or any other resource 
struct in hmm/svm. Lru is decoupled from bo concept - this is why this lru can 
be shared with svm code which is bo-less.

This is just a crazy idea I had because TTM tends to perform bad on certain 
tasks.

When we start to evict something we use a callback which indicates if an 
eviction is valuable or not. So it can happen that we have to skip quite a 
bunch of BOs on the LRU until we found one which is worth evicting.

Not it can be that the first eviction doesn't make enough room to fulfill the 
allocation requirement, in this case we currently start over at the beginning 
searching for some BO to evict.

I want to avoid this by being able to have cursors into the LRU, e.g. the next 
BO which can't move until we have evicted the current one.


Got you now. I didn’t know this problem so I didn’t try to fix this efficiency 
problem in this series. Theoretically I think we can fix this issue this way: 
change ttm_mem_evict_first to ttm_mem_evict_first_n and add a parameter to this 
function to specify how much room we want to yield; then we evict the first n 
objects to make enough room before return, or fail if we can’t make enough 
room. This scheme would need the caller of ttm_mem_evict_first to tell how much 
room he need – I think reasonable.


BTW: How do you handle eviction here? I mean we can't call the evict callback 
with the spinlock held easily?

I was actually struggling when I refactored ttm_mem_evict_first function. I 
moved this function to lru manager and abstracted 3 callback functions 
(evict_allowable/valuable, evict_entity, evict_busy_entity) – those need to 
relook when hmm/svm codes come in picture. I tried not to change any logic of 
this function – I know people worked on this function in the past 15 years so 
better to be very careful.

So in my current implementation, spinlock is held calling the evict_entity 
callback. Spinlock is unlocked before calling ttm_bo_evict in the evict_entity 
callback and re-held if we need to move entity in lru list. See details in 
patch 4 and patch 10. So it keeps exactly the original call sequence but does 
look awkward.

But I think you are right. We can release the spinlock in the 
drm_lru_evict_first function before calling evict callback.

Oak


Christian.






Oak




RE: [RFC 03/11] drm: introduce drm evictable LRU

2023-11-02 Thread Zeng, Oak


> -Original Message-
> From: Christian König 
> Sent: Thursday, November 2, 2023 9:24 AM
> To: Zeng, Oak ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org
> Cc: thomas.hellst...@linux.intel.com; felix.kuehl...@amd.com;
> airl...@gmail.com; Welty, Brian 
> Subject: Re: [RFC 03/11] drm: introduce drm evictable LRU
> 
> Am 02.11.23 um 05:32 schrieb Oak Zeng:
> > drm LRU manager is introuced for resource eviction purpose. It maintains
> > a LRU list per resource type.
> 
> Shouldn't we first add the possible resource types in a separate patch?

Resource type in my description message is not a good name. for resource type 
we have:
System memory
Gpu stolen memory
Normal gpu vram

Some device such as Intel's pvc has sub-device concept (it is called tile in xe 
driver). For such device, we create multiple vram type ttm resource manager and 
multiple lru manager, one for each sub-device...So currently we only defined a 
DRM_NUM_MEM_TYPES in lru manager, but not defining each resource (memory) type. 
> 
> >   It provides functions to add or remove
> > resource to or from the list. It also provides function to retrieve the
> > first entity on the LRU list.
> 
> + functions to iterate over them.

Yes basic iterate functions are implemented in this patch. Will add it to 
description message.
> 
> >
> > drm LRU manager also provides functions for bulk moving resources
> > on the LRU lists.
> >
> > drm LRU manager also does very basic memory accounting function, i.e.,
> > LRU manager keeps a size of this resource type and a usage member
> > for how much of resource has been added to this LRU manager's LRU
> > list. TTM resource manager memory accounting functoins such as
> > struct ttm_resource_manager::size and struct ttm_resource_manger::usage
> > are still kept. In the future, when SVM codes are in the picture,
> > those memory accounting functions need some rework to consider
> > the memory used by both TTM and SVM.
> 
> Please keep in mind that this structure needs to extremely small to be
> usable for SVM. E.g. struct page size small :)
> 
> At least HMM based implementations ideally wants to have one for each
> page or something like that.

Very good point. List node and eviction function pointer are necessary for 
drm_lru_entity. I will look whether we can remove other members. At least we 
can remove the drm_device pointer if we make drm_device base class of 
ttm_device as you suggested in previous patch comment.

And mem_type and priority can use bitfield, so a dword is enough.


> 
> > For one device, a global drm LRU manager per resource type should be
> > created/initialized at device initialization time. Drm LRU manager
> > instances are embedded in struct drm_device.
> >
> > It is pretty much moving some of the ttm resource manager functions
> > to the drm layer. The reason of this code refactory is, we want to
> > create a single LRU list for memory allocated from BO(buffer object)
> > based driver and hmm/svm(shared virtual memory) based driver, thus BO
> > driver and svm driver can evict memory from each other.
> >
> > Previously the LRU list in TTM resource manager (lru field in struct
> > ttm_reource_manager) is coupled with ttm_buffer_object concept, i.e.,
> > each ttm resource is backed by a ttm_buffer_object and the LRU list
> > is essentially a list of ttm_buffer_object.
> 
> Actually it's the other way around. The resource provides the backing of
> the BO.

You are right. Will fix this description.
> 
> And when a BO moves around it can temporary be that multiple resource
> point to the same BO.
> 
> I also want to have a more advanced iterator at some point where we grab
> the BO lock for keeping a reference into the LRU list. Not sure how to
> do this if we don't have the BO here any more.
> 
> Need to think about that further,

Don't quite get the what you want to do with the advanced iterator. But with 
this work, the lru entity is a base class of ttm_resource or any other resource 
struct in hmm/svm. Lru is decoupled from bo concept - this is why this lru can 
be shared with svm code which is bo-less.

Oak 

> Christian.
> 
> >   Due to this behavior, the
> > TTM resource manager can't be used by hmm/svm driver as we don't plan
> > to have the BO concept for the hmm/svm implemenation. So we decouple
> > the evictable LRU list from the BO concept in this series.
> >
> > The design goal of drm lru manager is to make it as lean as possible.
> > So each lru entity only has a list node member used to link this entity
> > to the evictable LRU list, and the basic resource size/type/priority
> > of this entity. It doesn't

RE: [RFC 02/11] drm: move lru_lock from ttm_device to drm_device

2023-11-02 Thread Zeng, Oak


> -Original Message-
> From: Christian König 
> Sent: Thursday, November 2, 2023 8:53 AM
> To: Zeng, Oak ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org
> Cc: thomas.hellst...@linux.intel.com; felix.kuehl...@amd.com;
> airl...@gmail.com; Welty, Brian 
> Subject: Re: [RFC 02/11] drm: move lru_lock from ttm_device to drm_device
> 
> Am 02.11.23 um 05:32 schrieb Oak Zeng:
> > In the coming patches, we will share the lru list b/t
> > ttm bo based memory allocator and hmm/svm based memory
> > allocator. Thus lru_lock (which is used mainly to protect
> > the lru list) is moved from struct ttm_device to struct
> > drm_device, so this lock can be shared b/t those two
> > memory allocators.
> >
> > To minimize code change, struct ttm_device still hold
> > a weak reference of lru_lock, so ttm layer can still
> > reference to this lock easily.
> 
> I would rather like to see drm_device to become the base class of
> ttm_device.

Yah...so drm_dev is the base of ttm_device, and ttm_device is the base of 
amdgpu_device or xe_device...
> 
> Similar to how drm_gem_object is the base class of ttm_buffer_object.
And ttm_buffer_object is base of amdgpu_bo

Pretty uniformed structure 
> 
> That is probably a bit more work, but would also eliminate some of the
> duplicate house keeping we currently have (e.g. bdev pointer in
> ttm_buffer_object etc...).

Right, if we do that, we can cast a ttm buffer object to amdgpu_bo, then get 
amdgpu_device of this amdgpu_bo, then get the ttm_device. Need to write a 
function to get this. Yes, this way we only need to keep a amdgpu_device 
pointer in amdgpu_bo. Cleaner than keep bdev pointer in tbo.
> 
> Moving then stuff from the ttm_device into the drm_device becomes trivial.

Agree.

Oak
> 
> Regards,
> Christian.
> 
> >
> > Signed-off-by: Oak Zeng 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c   |  4 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c |  4 +-
> >   drivers/gpu/drm/drm_drv.c|  1 +
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c  |  4 +-
> >   drivers/gpu/drm/ttm/ttm_bo.c | 40 +--
> >   drivers/gpu/drm/ttm/ttm_device.c | 18 -
> >   drivers/gpu/drm/ttm/ttm_resource.c   | 42 ++--
> >   drivers/gpu/drm/xe/xe_bo.c   |  4 +-
> >   drivers/gpu/drm/xe/xe_exec.c |  4 +-
> >   drivers/gpu/drm/xe/xe_vm.c   |  4 +-
> >   include/drm/drm_device.h |  5 +++
> >   include/drm/ttm/ttm_bo.h |  4 +-
> >   include/drm/ttm/ttm_device.h |  4 +-
> >   13 files changed, 72 insertions(+), 66 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index f5daadcec865..747bcad86d5d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -368,9 +368,9 @@ int amdgpu_vm_lock_pd(struct amdgpu_vm *vm, struct
> drm_exec *exec,
> >   void amdgpu_vm_move_to_lru_tail(struct amdgpu_device *adev,
> > struct amdgpu_vm *vm)
> >   {
> > -   spin_lock(>mman.bdev.lru_lock);
> > +   spin_lock(adev->mman.bdev.lru_lock);
> > ttm_lru_bulk_move_tail(>lru_bulk_move);
> > -   spin_unlock(>mman.bdev.lru_lock);
> > +   spin_unlock(adev->mman.bdev.lru_lock);
> >   }
> >
> >   /* Create scheduler entities for page table updates */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > index c7085a747b03..b83e1741905e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > @@ -290,9 +290,9 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
> >
> > vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
> > atomic64_add(vis_usage, >vis_usage);
> > -   spin_lock(>bdev->lru_lock);
> > +   spin_lock(man->bdev->lru_lock);
> > man->usage += rsv->size;
> > -   spin_unlock(>bdev->lru_lock);
> > +   spin_unlock(man->bdev->lru_lock);
> > list_move(>blocks, >reserved_pages);
> > }
> >   }
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 3eda026ffac6..1943c38815aa 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -623,6 +623,7 @

unified LRU for ttm and svm

2023-10-19 Thread Zeng, Oak
Hello all,

As a follow up to this thread 
https://www.spinics.net/lists/dri-devel/msg410740.html, I looked further into 
the idea of a shared LRU list for both ttm/bo and svm (to achieve a mutual 
eviction b/t them). I came up a rough design which I think better to align with 
you before I move too far.

As illustrated in below diagram:


  1.  There will be a global drm_lru_manager to maintain the shared LRU list. 
Each memory type will have a list, i.e., system memory has a list, gpu memory 
has a list. On system which has multiple gpu memory regions, we can have 
multiple GPU LRU
  2.  Move the LRU operation functions (such as bulk_move related) from 
ttm_resource_manager to drm_lru_manager
  3.  Drm_lru_manager should be initialized during device initialization. Ttm 
layer or svm layer can have weak reference to it for convenience.
  4.  Abstract a drm_lru_entity: This is supposed to be embedded in 
ttm_resource and svm_resource struct, as illustrated. Since ttm_resource and 
svm_resource are quite different in nature (ttm_resource is coupled with bo and 
svm_resource is struct page/pfn based), we can't provide unified eviction 
function for them. So a evict_func pointer is introduced in drm_lru_entity[Note 
1].
  5.  Lru_lock. Currently the lru_lock is in ttm_device structure. Ideally this 
can be moved to drm_lru_manager. But besides the lru list, lru_lock also 
protect other ttm specific thing such as ttm_device's pinned list. The current 
plan is to move lru_lock to xe_device/amdgpu_device and ttm_device or svm can 
have a weak reference for convenience.

[cid:image001.png@01DA0285.844FA910]


Note 1: I have been considering a structure like below. Each hmm/svm resource 
page is backed by a struct page and struct page already has a lru member. So 
theoretically  the LRU list can be as below. This way we don't need to 
introduce the drm_lru_entity struct. The difficulty is, without modify the 
linux struct page, we can't cast a lru node to struct page or struct 
ttm_resource, since we don't know whether this node is used by ttm or svm. This 
is why I had to introduce drm_lru_entity to hold an evict_function above. But 
let me know if you have better idea.

[cid:image002.png@01DA0289.9AD5D110]

Thanks,
Oak



RE: bulk_move in ttm_resource manager

2023-10-04 Thread Zeng, Oak

> -Original Message-
> From: Christian König 
> Sent: Wednesday, October 4, 2023 8:45 AM
> To: Thomas Hellström ; Zeng, Oak
> 
> Cc: intel...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: Re: bulk_move in ttm_resource manager
> 
> Am 04.10.23 um 09:17 schrieb Thomas Hellström:
> > On Wed, 2023-10-04 at 03:52 +, Zeng, Oak wrote:
> >> Hi Christian,
> >>
> >> As a follow up to this thread:
> >> https://www.spinics.net/lists/dri-devel/msg410740.html, I started the
> >> work of moving the lru out of ttm_resource_manager and make it a
> >> common library for both ttm and svm. While look into the details of
> >> the bulk_move in ttm resource manager, I found a potential problem:
> >>
> >> For simplicity, let’s say we only have one memory type and one
> >> priority, so ttm resource manager only maintains one global lru list.
> >> Let’s say this list has 10 nodes, node1 to node10.
> >>
> >> But the lru_bulk_move is per vm. Let’s say vm1 has a bulk_move
> >> covering node range [node4, node7] and vm2 has a bulk_move covering
> >> node range [node6, node9]. Notice those two range has an overlap.
> >> Since two vm can simultaneously add nodes to lru, I think this
> >> scenario can happen.
> 
> That can't happen. See what ttm_resource_move_to_lru_tail() does when
> the BO has a bulk move associated with it.

I spent more time reading the codes and I am convinced the codes guarantee all 
nodes in a bulk move range are all belongs to one vm. Yes each time when we add 
a node to bulk move range, ttm_resource_move_to_lru_tail (and other helpers 
such as ttm_resource_add_bulk_move) moves the newly added node to the tail of 
bulk move. When the first node is added to the bulk move, the first and last 
pointer of the bulk move both point to the same first node - this is the 
initial condition that nodes in a bulk move are not separated. Eventually when 
new nodes are added, we always move them to the tail of the bulk move. So after 
the move, all nodes in a bulk move are still not separated (by nodes from other 
vm). 

I doubt whether this implementation of bulk move can actually cut LRU 
maintenance overhead. Even though we can move bulk nodes at once at the end, 
but when *each* node are added to LRU or moved in LRU, we moved them to the 
tail of bulk move range due to above bulk move restriction(when bulk move is 
enabled) - this is already link list operation. Why not just add node to the 
tail of LRU, or just move node to LRU tail when node is touched by GPU?
 
> 
> >>
> >> Now if we perform a bulk move for vm1, moving [node4, node7] to the
> >> tail of the lru list. The lru after this bulk move will be: node1,
> >> node2, node3,node8,node9, node10, node4, node5, node6, node7. Now
> >> notice that for vm2’s bulk_move, the first pointer  (pointing to
> >> node6) is actually after the last pointer (pointing to node9), which
> >> doesn’t make sense.
> >>
> >> Is this a real problem? As I understand it, with this issue, we only
> >> mess up the lru list order, but there won’t be any functional
> >> problem. If it is a real problem, should we make the bulk_move global
> >> instead of per vm based?
> >>
> >> Thanks,
> >> Oak
> >>
> > FWIW I have a patch set that converts the TTM bulk move code to using
> > sublists; a list item is either a resource or a sublist, and when
> > performing a bulk move essentially the sublist is moved. Bumping
> > resource LRU within a VM would touch only the sublist.
> 
> That sounds like my very first attempt at bulk moves which we abandoned
> for various reasons.
> 
> That's easily >5years ago, but the history of that should still be on
> the mailing list if I'm not completely mistaken.

So for my refactor work, I plan to do it based on the current upstream 
implementation. I will revisit if we end up using the sublists.

Regards,
Oak

> 
> Regards,
> Christian.
> 
> >
> > Currently functionality and TTM API is essentially the same but when
> > experimenting with LRU traversal for exhaustive WW-locking eviction
> > this concept was easier to use. Also hopefully this would reduce
> > fragility and improve understanding since a scenario like the above
> > could really never happen...
> >
> > Let me know if I should send it out as an RFC.
> >
> > Code is here:
> > https://gitlab.freedesktop.org/drm/xe/kernel/-
> /merge_requests/351/commits
> >
> > /Thomas
> >
> >
> >
> >
> >



bulk_move in ttm_resource manager

2023-10-03 Thread Zeng, Oak
Hi Christian,

As a follow up to this thread: 
https://www.spinics.net/lists/dri-devel/msg410740.html, I started the work of 
moving the lru out of ttm_resource_manager and make it a common library for 
both ttm and svm. While look into the details of the bulk_move in ttm resource 
manager, I found a potential problem:

For simplicity, let's say we only have one memory type and one priority, so ttm 
resource manager only maintains one global lru list. Let's say this list has 10 
nodes, node1 to node10.

But the lru_bulk_move is per vm. Let's say vm1 has a bulk_move covering node 
range [node4, node7] and vm2 has a bulk_move covering node range [node6, 
node9]. Notice those two range has an overlap. Since two vm can simultaneously 
add nodes to lru, I think this scenario can happen.

Now if we perform a bulk move for vm1, moving [node4, node7] to the tail of the 
lru list. The lru after this bulk move will be: node1, node2, node3,node8, 
node9, node10, node4, node5, node6, node7. Now notice that for vm2's bulk_move, 
the first pointer  (pointing to node6) is actually after the last pointer 
(pointing to node9), which doesn't make sense.

Is this a real problem? As I understand it, with this issue, we only mess up 
the lru list order, but there won't be any functional problem. If it is a real 
problem, should we make the bulk_move global instead of per vm based?

Thanks,
Oak



RE: Implement svm without BO concept in xe driver

2023-08-22 Thread Zeng, Oak

> -Original Message-
> From: Ruhl, Michael J 
> Sent: August 22, 2023 7:44 AM
> To: Felix Kuehling ; Zeng, Oak ;
> Dave Airlie 
> Cc: Brost, Matthew ; Thomas Hellström
> ; Philip Yang ;
> Welty, Brian ; dri-devel@lists.freedesktop.org; 
> Christian
> König ; Vishwanathapura, Niranjana
> ; intel...@lists.freedesktop.org
> Subject: RE: Implement svm without BO concept in xe driver
> 
> >-Original Message-
> >From: Felix Kuehling 
> >Sent: Monday, August 21, 2023 4:57 PM
> >To: Zeng, Oak ; Dave Airlie 
> >Cc: Brost, Matthew ; Thomas Hellström
> >; Philip Yang ;
> >Welty, Brian ; dri-devel@lists.freedesktop.org;
> >Christian König ; Vishwanathapura, Niranjana
> >; intel...@lists.freedesktop.org;
> >Ruhl, Michael J 
> >Subject: Re: Implement svm without BO concept in xe driver
> >
> >
> >On 2023-08-21 15:41, Zeng, Oak wrote:
> >>> I have thought about emulating BO allocation APIs on top of system SVM.
> >>> This was in the context of KFD where memory management is not tied into
> >>> command submissions APIs, which would add a whole other layer of
> >>> complexity. The main unsolved (unsolvable?) problem I ran into was, that
> >>> there is no way to share SVM memory as DMABufs. So there is no good
> >way
> >>> to support applications that expect to share memory in that way.
> >> Great point. I also discussed the dmabuf thing with Mike (cc'ed). dmabuf 
> >> is a
> >particular technology created specially for the BO driver (and other driver) 
> >to
> >share buffer b/t devices. Hmm/system SVM doesn't need this technology:
> >malloc'ed memory by the nature is already shared b/t different devices (in
> >one process) and CPU. We just can simply submit GPU kernel to all devices
> >with malloc'ed memory and let kmd decide the memory placement (such as
> >map in place or migrate). No need of buffer export/import in hmm/system
> >SVM world.
> >
> >I disagree. DMABuf can be used for sharing memory between processes. And
> >it can be used for sharing memory with 3rd-party devices via PCIe P2P
> >(e.g. a Mellanox NIC). You cannot easily do that with malloc'ed memory.
> >POSIX IPC requires that you know that you'll be sharing the memory at
> >allocation time. It adds overhead. And because it's file-backed, it's
> >currently incompatible with migration. And HMM currently doesn't have a
> >solution for P2P. Any access by a different device causes a migration to
> >system memory.
> 
> Hey Oak,
> 
> I think we were discussing this solution in the context of using the P2P_DMA
> feature.  This has an allocation path and a device 2 device capabilities.


I was thinking sharing malloc'ed memory b/t CPU and multiple devices inside one 
process. I thought this should work. With Felix's words above, I looked more 
details. Now I agree with Felix this doesn't work with hmm.

And as Felix pointed out, POSIX IPC also doesn't work with hmm. Theoretically 
driver can do similar migration b/t device memory and file-backed memory, just 
as what we did with anonymous memory. But I am not sure whether people want to 
do that.

Anyway, buffer sharing with hmm/system SVM seems a big open. I will not try to 
solve this problem for now.

Cheers,
Oak

> 
> Mike
> 
> 
> >Regards,
> >   Felix
> >
> >
> >>
> >> So yes from buffer sharing perspective, the design philosophy is also very
> >different.
> >>
> >> Thanks,
> >> Oak
> >>


RE: Implement svm without BO concept in xe driver

2023-08-21 Thread Zeng, Oak

> -Original Message-
> From: dri-devel  On Behalf Of Felix
> Kuehling
> Sent: August 21, 2023 3:18 PM
> To: Zeng, Oak ; Dave Airlie 
> Cc: Brost, Matthew ; Thomas Hellström
> ; Philip Yang ;
> Welty, Brian ; dri-devel@lists.freedesktop.org; 
> Christian
> König ; Vishwanathapura, Niranjana
> ; intel...@lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> 
> On 2023-08-21 11:10, Zeng, Oak wrote:
> > Accidently deleted Brian. Add back.
> >
> > Thanks,
> > Oak
> >
> >> -Original Message-
> >> From: Zeng, Oak
> >> Sent: August 21, 2023 11:07 AM
> >> To: Dave Airlie 
> >> Cc: Brost, Matthew ; Thomas Hellström
> >> ; Philip Yang ;
> Felix
> >> Kuehling ; dri-devel@lists.freedesktop.org; intel-
> >> x...@lists.freedesktop.org; Vishwanathapura, Niranjana
> >> ; Christian König
> >> 
> >> Subject: RE: Implement svm without BO concept in xe driver
> >>
> >>> -Original Message-
> >>> From: dri-devel  On Behalf Of
> Dave
> >>> Airlie
> >>> Sent: August 20, 2023 6:21 PM
> >>> To: Zeng, Oak 
> >>> Cc: Brost, Matthew ; Thomas Hellström
> >>> ; Philip Yang ;
> >> Felix
> >>> Kuehling ; Welty, Brian ;
> >> dri-
> >>> de...@lists.freedesktop.org; intel...@lists.freedesktop.org;
> Vishwanathapura,
> >>> Niranjana ; Christian König
> >>> 
> >>> Subject: Re: Implement svm without BO concept in xe driver
> >>>
> >>> On Thu, 17 Aug 2023 at 12:13, Zeng, Oak  wrote:
> >>>>> -Original Message-
> >>>>> From: Dave Airlie 
> >>>>> Sent: August 16, 2023 6:52 PM
> >>>>> To: Felix Kuehling 
> >>>>> Cc: Zeng, Oak ; Christian König
> >>>>> ; Thomas Hellström
> >>>>> ; Brost, Matthew
> >>>>> ; maarten.lankho...@linux.intel.com;
> >>>>> Vishwanathapura, Niranjana ;
> >> Welty,
> >>>>> Brian ; Philip Yang ;
> intel-
> >>>>> x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> >>>>> Subject: Re: Implement svm without BO concept in xe driver
> >>>>>
> >>>>> On Thu, 17 Aug 2023 at 08:15, Felix Kuehling 
> >>> wrote:
> >>>>>> On 2023-08-16 13:30, Zeng, Oak wrote:
> >>>>>>> I spoke with Thomas. We discussed two approaches:
> >>>>>>>
> >>>>>>> 1) make ttm_resource a central place for vram management functions
> >>> such as
> >>>>> eviction, cgroup memory accounting. Both the BO-based driver and BO-
> less
> >>> SVM
> >>>>> codes call into ttm_resource_alloc/free functions for vram 
> >>>>> allocation/free.
> >>>>>>>   *This way BO driver and SVM driver shares the eviction/cgroup 
> >>>>>>> logic,
> >> no
> >>>>> need to reimplment LRU eviction list in SVM driver. Cgroup logic should 
> >>>>> be
> >> in
> >>>>> ttm_resource layer. +Maarten.
> >>>>>>>   *ttm_resource is not a perfect match for SVM to allocate vram. 
> >>>>>>> It is
> >> still
> >>> a
> >>>>> big overhead. The *bo* member of ttm_resource is not needed for SVM -
> >>> this
> >>>>> might end up with invasive changes to ttm...need to look into more 
> >>>>> details
> >>>>>> Overhead is a problem. We'd want to be able to allocate, free and evict
> >>>>>> memory at a similar granularity as our preferred migration and page
> >>>>>> fault granularity, which defaults to 2MB in our SVM implementation.
> >>>>>>
> >>>>>>
> >>>>>>> 2) svm code allocate memory directly from drm-buddy allocator, and
> >>> expose
> >>>>> memory eviction functions from both ttm and svm so they can evict
> >> memory
> >>>>> from each other. For example, expose the ttm_mem_evict_first function
> >>> from
> >>>>> ttm side so hmm/svm code can call it; expose a similar function from svm
> >> side
> >>> so
> >>>>> ttm can evict hmm memory.
> >>>>>> I like this option. One thing that needs some thou

RE: Implement svm without BO concept in xe driver

2023-08-21 Thread Zeng, Oak
Accidently deleted Brian. Add back.

Thanks,
Oak

> -Original Message-
> From: Zeng, Oak
> Sent: August 21, 2023 11:07 AM
> To: Dave Airlie 
> Cc: Brost, Matthew ; Thomas Hellström
> ; Philip Yang ; Felix
> Kuehling ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org; Vishwanathapura, Niranjana
> ; Christian König
> 
> Subject: RE: Implement svm without BO concept in xe driver
> 
> > -Original Message-
> > From: dri-devel  On Behalf Of Dave
> > Airlie
> > Sent: August 20, 2023 6:21 PM
> > To: Zeng, Oak 
> > Cc: Brost, Matthew ; Thomas Hellström
> > ; Philip Yang ;
> Felix
> > Kuehling ; Welty, Brian ;
> dri-
> > de...@lists.freedesktop.org; intel...@lists.freedesktop.org; 
> > Vishwanathapura,
> > Niranjana ; Christian König
> > 
> > Subject: Re: Implement svm without BO concept in xe driver
> >
> > On Thu, 17 Aug 2023 at 12:13, Zeng, Oak  wrote:
> > >
> > > > -Original Message-
> > > > From: Dave Airlie 
> > > > Sent: August 16, 2023 6:52 PM
> > > > To: Felix Kuehling 
> > > > Cc: Zeng, Oak ; Christian König
> > > > ; Thomas Hellström
> > > > ; Brost, Matthew
> > > > ; maarten.lankho...@linux.intel.com;
> > > > Vishwanathapura, Niranjana ;
> Welty,
> > > > Brian ; Philip Yang ; intel-
> > > > x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> > > > Subject: Re: Implement svm without BO concept in xe driver
> > > >
> > > > On Thu, 17 Aug 2023 at 08:15, Felix Kuehling 
> > wrote:
> > > > >
> > > > > On 2023-08-16 13:30, Zeng, Oak wrote:
> > > > > > I spoke with Thomas. We discussed two approaches:
> > > > > >
> > > > > > 1) make ttm_resource a central place for vram management functions
> > such as
> > > > eviction, cgroup memory accounting. Both the BO-based driver and BO-less
> > SVM
> > > > codes call into ttm_resource_alloc/free functions for vram 
> > > > allocation/free.
> > > > > >  *This way BO driver and SVM driver shares the eviction/cgroup 
> > > > > > logic,
> no
> > > > need to reimplment LRU eviction list in SVM driver. Cgroup logic should 
> > > > be
> in
> > > > ttm_resource layer. +Maarten.
> > > > > >  *ttm_resource is not a perfect match for SVM to allocate vram. 
> > > > > > It is
> still
> > a
> > > > big overhead. The *bo* member of ttm_resource is not needed for SVM -
> > this
> > > > might end up with invasive changes to ttm...need to look into more 
> > > > details
> > > > >
> > > > > Overhead is a problem. We'd want to be able to allocate, free and 
> > > > > evict
> > > > > memory at a similar granularity as our preferred migration and page
> > > > > fault granularity, which defaults to 2MB in our SVM implementation.
> > > > >
> > > > >
> > > > > >
> > > > > > 2) svm code allocate memory directly from drm-buddy allocator, and
> > expose
> > > > memory eviction functions from both ttm and svm so they can evict
> memory
> > > > from each other. For example, expose the ttm_mem_evict_first function
> > from
> > > > ttm side so hmm/svm code can call it; expose a similar function from svm
> side
> > so
> > > > ttm can evict hmm memory.
> > > > >
> > > > > I like this option. One thing that needs some thought with this is how
> > > > > to get some semblance of fairness between the two types of clients.
> > > > > Basically how to choose what to evict. And what share of the available
> > > > > memory does each side get to use on average. E.g. an idle client may 
> > > > > get
> > > > > all its memory evicted while a busy client may get a bigger share of 
> > > > > the
> > > > > available memory.
> > > >
> > > > I'd also like to suggest we try to write any management/generic code
> > > > in driver agnostic way as much as possible here. I don't really see
> > > > much hw difference should be influencing it.
> > > >
> > > > I do worry about having effectively 2 LRUs here, you can't really have
> > > > two "leasts".
> > > >
> > > > Like if we hit the shrinker paths wh

RE: Implement svm without BO concept in xe driver

2023-08-21 Thread Zeng, Oak
> -Original Message-
> From: dri-devel  On Behalf Of Dave
> Airlie
> Sent: August 20, 2023 6:21 PM
> To: Zeng, Oak 
> Cc: Brost, Matthew ; Thomas Hellström
> ; Philip Yang ; Felix
> Kuehling ; Welty, Brian ; dri-
> de...@lists.freedesktop.org; intel...@lists.freedesktop.org; Vishwanathapura,
> Niranjana ; Christian König
> 
> Subject: Re: Implement svm without BO concept in xe driver
> 
> On Thu, 17 Aug 2023 at 12:13, Zeng, Oak  wrote:
> >
> > > -Original Message-
> > > From: Dave Airlie 
> > > Sent: August 16, 2023 6:52 PM
> > > To: Felix Kuehling 
> > > Cc: Zeng, Oak ; Christian König
> > > ; Thomas Hellström
> > > ; Brost, Matthew
> > > ; maarten.lankho...@linux.intel.com;
> > > Vishwanathapura, Niranjana ; Welty,
> > > Brian ; Philip Yang ; intel-
> > > x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> > > Subject: Re: Implement svm without BO concept in xe driver
> > >
> > > On Thu, 17 Aug 2023 at 08:15, Felix Kuehling 
> wrote:
> > > >
> > > > On 2023-08-16 13:30, Zeng, Oak wrote:
> > > > > I spoke with Thomas. We discussed two approaches:
> > > > >
> > > > > 1) make ttm_resource a central place for vram management functions
> such as
> > > eviction, cgroup memory accounting. Both the BO-based driver and BO-less
> SVM
> > > codes call into ttm_resource_alloc/free functions for vram 
> > > allocation/free.
> > > > >  *This way BO driver and SVM driver shares the eviction/cgroup 
> > > > > logic, no
> > > need to reimplment LRU eviction list in SVM driver. Cgroup logic should 
> > > be in
> > > ttm_resource layer. +Maarten.
> > > > >  *ttm_resource is not a perfect match for SVM to allocate vram. 
> > > > > It is still
> a
> > > big overhead. The *bo* member of ttm_resource is not needed for SVM -
> this
> > > might end up with invasive changes to ttm...need to look into more details
> > > >
> > > > Overhead is a problem. We'd want to be able to allocate, free and evict
> > > > memory at a similar granularity as our preferred migration and page
> > > > fault granularity, which defaults to 2MB in our SVM implementation.
> > > >
> > > >
> > > > >
> > > > > 2) svm code allocate memory directly from drm-buddy allocator, and
> expose
> > > memory eviction functions from both ttm and svm so they can evict memory
> > > from each other. For example, expose the ttm_mem_evict_first function
> from
> > > ttm side so hmm/svm code can call it; expose a similar function from svm 
> > > side
> so
> > > ttm can evict hmm memory.
> > > >
> > > > I like this option. One thing that needs some thought with this is how
> > > > to get some semblance of fairness between the two types of clients.
> > > > Basically how to choose what to evict. And what share of the available
> > > > memory does each side get to use on average. E.g. an idle client may get
> > > > all its memory evicted while a busy client may get a bigger share of the
> > > > available memory.
> > >
> > > I'd also like to suggest we try to write any management/generic code
> > > in driver agnostic way as much as possible here. I don't really see
> > > much hw difference should be influencing it.
> > >
> > > I do worry about having effectively 2 LRUs here, you can't really have
> > > two "leasts".
> > >
> > > Like if we hit the shrinker paths who goes first? do we shrink one
> > > object from each side in turn?
> >
> > One way to solve this fairness problem is to create a driver agnostic
> drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory
> eviction/cgroups memory accounting logic from ttm_resource manager to
> drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to
> allocate/free memory.
> >
> > I am not sure whether this meets the 2M allocate/free/evict granularity
> requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO
> driver should be able to allocate any arbitrary sized blocks - So the 
> eviction is also
> arbitrary size.
> >
> > >
> > > Also will we have systems where we can expose system SVM but userspace
> > > may choose to not use the fine grained SVM and use one of the older
> > > modes, will that path get emulated on top of SVM or use the BO paths?

RE: Implement svm without BO concept in xe driver

2023-08-18 Thread Zeng, Oak
Thanks Thomas. I will then look into more details of option 3:

   * create a lean drm layer vram manager, a central control place for vram 
eviction and cgroup accounting. Single LRU for eviction fairness.
   * pretty much move the current ttm_resource eviction/cgroups logic to drm 
layer
   * the eviction/allocation granularity should be flexible so svm can do 2M 
while ttm can do arbitrary size
   * both ttm_resource and svm code should call the new drm_vram_manager for 
eviction/accounting

I will come back with some RFC proof of concept codes later.

Cheers,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: August 18, 2023 3:36 AM
> To: Zeng, Oak ; Dave Airlie ; Felix
> Kuehling 
> Cc: Christian König ; Brost, Matthew
> ; maarten.lankho...@linux.intel.com;
> Vishwanathapura, Niranjana ; Welty,
> Brian ; Philip Yang ; intel-
> x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> 
> On 8/17/23 04:12, Zeng, Oak wrote:
> >> -Original Message-
> >> From: Dave Airlie 
> >> Sent: August 16, 2023 6:52 PM
> >> To: Felix Kuehling 
> >> Cc: Zeng, Oak ; Christian König
> >> ; Thomas Hellström
> >> ; Brost, Matthew
> >> ; maarten.lankho...@linux.intel.com;
> >> Vishwanathapura, Niranjana ; Welty,
> >> Brian ; Philip Yang ; intel-
> >> x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> >> Subject: Re: Implement svm without BO concept in xe driver
> >>
> >> On Thu, 17 Aug 2023 at 08:15, Felix Kuehling  
> >> wrote:
> >>> On 2023-08-16 13:30, Zeng, Oak wrote:
> >>>> I spoke with Thomas. We discussed two approaches:
> >>>>
> >>>> 1) make ttm_resource a central place for vram management functions such
> as
> >> eviction, cgroup memory accounting. Both the BO-based driver and BO-less
> SVM
> >> codes call into ttm_resource_alloc/free functions for vram allocation/free.
> >>>>   *This way BO driver and SVM driver shares the eviction/cgroup 
> >>>> logic, no
> >> need to reimplment LRU eviction list in SVM driver. Cgroup logic should be 
> >> in
> >> ttm_resource layer. +Maarten.
> >>>>   *ttm_resource is not a perfect match for SVM to allocate vram. It 
> >>>> is still a
> >> big overhead. The *bo* member of ttm_resource is not needed for SVM - this
> >> might end up with invasive changes to ttm...need to look into more details
> >>> Overhead is a problem. We'd want to be able to allocate, free and evict
> >>> memory at a similar granularity as our preferred migration and page
> >>> fault granularity, which defaults to 2MB in our SVM implementation.
> >>>
> >>>
> >>>> 2) svm code allocate memory directly from drm-buddy allocator, and
> expose
> >> memory eviction functions from both ttm and svm so they can evict memory
> >> from each other. For example, expose the ttm_mem_evict_first function
> from
> >> ttm side so hmm/svm code can call it; expose a similar function from svm 
> >> side
> so
> >> ttm can evict hmm memory.
> >>> I like this option. One thing that needs some thought with this is how
> >>> to get some semblance of fairness between the two types of clients.
> >>> Basically how to choose what to evict. And what share of the available
> >>> memory does each side get to use on average. E.g. an idle client may get
> >>> all its memory evicted while a busy client may get a bigger share of the
> >>> available memory.
> >> I'd also like to suggest we try to write any management/generic code
> >> in driver agnostic way as much as possible here. I don't really see
> >> much hw difference should be influencing it.
> >>
> >> I do worry about having effectively 2 LRUs here, you can't really have
> >> two "leasts".
> >>
> >> Like if we hit the shrinker paths who goes first? do we shrink one
> >> object from each side in turn?
> > One way to solve this fairness problem is to create a driver agnostic
> drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory
> eviction/cgroups memory accounting logic from ttm_resource manager to
> drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to
> allocate/free memory.
> >
> > I am not sure whether this meets the 2M allocate/free/evict granularity
> requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO
> driver should be able to allocat

RE: Implement svm without BO concept in xe driver

2023-08-16 Thread Zeng, Oak
> -Original Message-
> From: Dave Airlie 
> Sent: August 16, 2023 6:52 PM
> To: Felix Kuehling 
> Cc: Zeng, Oak ; Christian König
> ; Thomas Hellström
> ; Brost, Matthew
> ; maarten.lankho...@linux.intel.com;
> Vishwanathapura, Niranjana ; Welty,
> Brian ; Philip Yang ; intel-
> x...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> On Thu, 17 Aug 2023 at 08:15, Felix Kuehling  wrote:
> >
> > On 2023-08-16 13:30, Zeng, Oak wrote:
> > > I spoke with Thomas. We discussed two approaches:
> > >
> > > 1) make ttm_resource a central place for vram management functions such as
> eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM
> codes call into ttm_resource_alloc/free functions for vram allocation/free.
> > >  *This way BO driver and SVM driver shares the eviction/cgroup logic, 
> > > no
> need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in
> ttm_resource layer. +Maarten.
> > >  *ttm_resource is not a perfect match for SVM to allocate vram. It is 
> > > still a
> big overhead. The *bo* member of ttm_resource is not needed for SVM - this
> might end up with invasive changes to ttm...need to look into more details
> >
> > Overhead is a problem. We'd want to be able to allocate, free and evict
> > memory at a similar granularity as our preferred migration and page
> > fault granularity, which defaults to 2MB in our SVM implementation.
> >
> >
> > >
> > > 2) svm code allocate memory directly from drm-buddy allocator, and expose
> memory eviction functions from both ttm and svm so they can evict memory
> from each other. For example, expose the ttm_mem_evict_first function from
> ttm side so hmm/svm code can call it; expose a similar function from svm side 
> so
> ttm can evict hmm memory.
> >
> > I like this option. One thing that needs some thought with this is how
> > to get some semblance of fairness between the two types of clients.
> > Basically how to choose what to evict. And what share of the available
> > memory does each side get to use on average. E.g. an idle client may get
> > all its memory evicted while a busy client may get a bigger share of the
> > available memory.
> 
> I'd also like to suggest we try to write any management/generic code
> in driver agnostic way as much as possible here. I don't really see
> much hw difference should be influencing it.
> 
> I do worry about having effectively 2 LRUs here, you can't really have
> two "leasts".
> 
> Like if we hit the shrinker paths who goes first? do we shrink one
> object from each side in turn?

One way to solve this fairness problem is to create a driver agnostic 
drm_vram_mgr. Maintain a single LRU in drm_vram_mgr. Move the memory 
eviction/cgroups memory accounting logic from ttm_resource manager to 
drm_vram_mgr. Both BO-based driver and SVM driver calls to drm_vram_mgr to 
allocate/free memory.

I am not sure whether this meets the 2M allocate/free/evict granularity 
requirement Felix mentioned above. SVM can allocate 2M size blocks. But BO 
driver should be able to allocate any arbitrary sized blocks - So the eviction 
is also arbitrary size.

> 
> Also will we have systems where we can expose system SVM but userspace
> may choose to not use the fine grained SVM and use one of the older
> modes, will that path get emulated on top of SVM or use the BO paths?


If by "older modes" you meant the gem_bo_create (such as xe_gem_create or 
amdgpu_gem_create), then today both amd and intel implement those interfaces 
using BO path. We don't have a plan to emulate that old mode on tope of SVM, 
afaict.

Thanks,
Oak

> 
> Dave.


RE: Implement svm without BO concept in xe driver

2023-08-16 Thread Zeng, Oak
I spoke with Thomas. We discussed two approaches:

1) make ttm_resource a central place for vram management functions such as 
eviction, cgroup memory accounting. Both the BO-based driver and BO-less SVM 
codes call into ttm_resource_alloc/free functions for vram allocation/free.
*This way BO driver and SVM driver shares the eviction/cgroup logic, no 
need to reimplment LRU eviction list in SVM driver. Cgroup logic should be in 
ttm_resource layer. +Maarten.
*ttm_resource is not a perfect match for SVM to allocate vram. It is still 
a big overhead. The *bo* member of ttm_resource is not needed for SVM - this 
might end up with invasive changes to ttm...need to look into more details

2) svm code allocate memory directly from drm-buddy allocator, and expose 
memory eviction functions from both ttm and svm so they can evict memory from 
each other. For example, expose the ttm_mem_evict_first function from ttm side 
so hmm/svm code can call it; expose a similar function from svm side so ttm can 
evict hmm memory.


Today we don't know which approach is better. I will work on some prove of 
concept codes, starting with #1 approach firstly.

Btw, I talked with application engineers and they said most applications 
actually use a mixture of gem_bo create and malloc, so we definitely need to 
solve this problem. 

Cheers,
Oak

> -Original Message-
> From: Christian König 
> Sent: August 16, 2023 2:06 AM
> To: Zeng, Oak ; Felix Kuehling ;
> Thomas Hellström ; Brost, Matthew
> ; Vishwanathapura, Niranjana
> ; Welty, Brian ;
> Philip Yang ; intel...@lists.freedesktop.org; dri-
> de...@lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> Hi Oak,
> 
> yeah, I completely agree with you and Felix. The main problem here is
> getting the memory pressure visible on both sides.
> 
> At the moment I have absolutely no idea how to handle that, maybe
> something like the ttm_resource object shared between TTM and HMM?
> 
> Regards,
> Christian.
> 
> Am 16.08.23 um 05:47 schrieb Zeng, Oak:
> > Hi Felix,
> >
> > It is great to hear from you!
> >
> > When I implement the HMM-based SVM for intel devices, I found this
> interesting problem: HMM uses struct page based memory management scheme
> which is completely different against the BO/TTM style memory management
> philosophy. Writing SVM code upon the BO/TTM concept seems overkill and
> awkward. So I thought we better make the SVM code BO-less and TTM-less. But
> on the other hand, currently vram eviction and cgroup memory accounting are 
> all
> hooked to the TTM layer, which means a TTM-less SVM driver won't be able to
> evict vram allocated through TTM/gpu_vram_mgr.
> >
> > Ideally HMM migration should use drm-buddy for vram allocation, but we need
> to solve this TTM/HMM mutual eviction problem as you pointed out (I am
> working with application engineers to figure out whether mutual eviction can
> truly benefit applications). Maybe we can implement a TTM-less vram
> management block which can be shared b/t the HMM-based driver and the BO-
> based driver:
> > * allocate/free memory from drm-buddy, buddy-block based
> > * memory eviction logics, allow driver to specify which allocation is 
> > evictable
> > * memory accounting, cgroup logic
> >
> > Maybe such a block can be placed at drm layer (say, call it drm_vram_mgr for
> now), so it can be shared b/t amd and intel. So I involved amd folks. Today 
> both
> amd and intel-xe driver implemented a TTM-based vram manager which doesn't
> serve above design goal. Once the drm_vram_mgr is implemented, both amd
> and intel's BO-based/TTM-based vram manager, and the HMM-based vram
> manager can call into this drm-vram-mgr.
> >
> > Thanks again,
> > Oak
> >
> >> -Original Message-
> >> From: Felix Kuehling 
> >> Sent: August 15, 2023 6:17 PM
> >> To: Zeng, Oak ; Thomas Hellström
> >> ; Brost, Matthew
> >> ; Vishwanathapura, Niranjana
> >> ; Welty, Brian
> ;
> >> Christian König ; Philip Yang
> >> ; intel...@lists.freedesktop.org; dri-
> >> de...@lists.freedesktop.org
> >> Subject: Re: Implement svm without BO concept in xe driver
> >>
> >> Hi Oak,
> >>
> >> I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or
> >> are you looking for comments about
> >>
> >>* Our plans for VRAM management with HMM
> >>* Our experience with BO-based VRAM management
> >>* Something else?
> >>
> >> IMO, having separate memory pools for HMM and TTM is a non-starter for
> >> AMD. We need access to the full VRAM in

RE: Implement svm without BO concept in xe driver

2023-08-15 Thread Zeng, Oak
Hi Felix,

It is great to hear from you!

When I implement the HMM-based SVM for intel devices, I found this interesting 
problem: HMM uses struct page based memory management scheme which is 
completely different against the BO/TTM style memory management philosophy. 
Writing SVM code upon the BO/TTM concept seems overkill and awkward. So I 
thought we better make the SVM code BO-less and TTM-less. But on the other 
hand, currently vram eviction and cgroup memory accounting are all hooked to 
the TTM layer, which means a TTM-less SVM driver won't be able to evict vram 
allocated through TTM/gpu_vram_mgr.

Ideally HMM migration should use drm-buddy for vram allocation, but we need to 
solve this TTM/HMM mutual eviction problem as you pointed out (I am working 
with application engineers to figure out whether mutual eviction can truly 
benefit applications). Maybe we can implement a TTM-less vram management block 
which can be shared b/t the HMM-based driver and the BO-based driver:
   * allocate/free memory from drm-buddy, buddy-block based
   * memory eviction logics, allow driver to specify which allocation is 
evictable
   * memory accounting, cgroup logic

Maybe such a block can be placed at drm layer (say, call it drm_vram_mgr for 
now), so it can be shared b/t amd and intel. So I involved amd folks. Today 
both amd and intel-xe driver implemented a TTM-based vram manager which doesn't 
serve above design goal. Once the drm_vram_mgr is implemented, both amd and 
intel's BO-based/TTM-based vram manager, and the HMM-based vram manager can 
call into this drm-vram-mgr.

Thanks again,
Oak

> -Original Message-
> From: Felix Kuehling 
> Sent: August 15, 2023 6:17 PM
> To: Zeng, Oak ; Thomas Hellström
> ; Brost, Matthew
> ; Vishwanathapura, Niranjana
> ; Welty, Brian ;
> Christian König ; Philip Yang
> ; intel...@lists.freedesktop.org; dri-
> de...@lists.freedesktop.org
> Subject: Re: Implement svm without BO concept in xe driver
> 
> Hi Oak,
> 
> I'm not sure what you're looking for from AMD? Are we just CC'ed FYI? Or
> are you looking for comments about
> 
>   * Our plans for VRAM management with HMM
>   * Our experience with BO-based VRAM management
>   * Something else?
> 
> IMO, having separate memory pools for HMM and TTM is a non-starter for
> AMD. We need access to the full VRAM in either of the APIs for it to be
> useful. That also means, we need to handle memory pressure in both
> directions. That's one of the main reasons we went with the BO-based
> approach initially. I think in the long run, using the buddy allocator,
> or the amdgpu_vram_mgr directly for HMM migrations would be better,
> assuming we can handle memory pressure in both directions between HMM
> and TTM sharing the same pool of physical memory.
> 
> Regards,
>    Felix
> 
> 
> On 2023-08-15 16:34, Zeng, Oak wrote:
> >
> > Also + Christian
> >
> > Thanks,
> >
> > Oak
> >
> > *From:*Intel-xe  *On Behalf Of
> > *Zeng, Oak
> > *Sent:* August 14, 2023 11:38 PM
> > *To:* Thomas Hellström ; Brost,
> > Matthew ; Vishwanathapura, Niranjana
> > ; Welty, Brian
> > ; Felix Kuehling ;
> > Philip Yang ; intel...@lists.freedesktop.org;
> > dri-devel@lists.freedesktop.org
> > *Subject:* [Intel-xe] Implement svm without BO concept in xe driver
> >
> > Hi Thomas, Matt and all,
> >
> > This came up when I port i915 svm codes to xe driver. In i915
> > implementation, we have i915_buddy manage gpu vram and svm codes
> > directly call i915_buddy layer to allocate/free vram. There is no
> > gem_bo/ttm bo concept involved in the svm implementation.
> >
> > In xe driver,  we have drm_buddy, xe_ttm_vram_mgr and ttm layer to
> > manage vram. Drm_buddy is initialized during xe_ttm_vram_mgr
> > initialization. Vram allocation/free is done through xe_ttm_vram_mgr
> > functions which call into drm_buddy layer to allocate vram blocks.
> >
> > I plan to implement xe svm driver the same way as we did in i915,
> > which means there will not be bo concept in the svm implementation.
> > Drm_buddy will be passed to svm layer during vram initialization and
> > svm will allocate/free memory directly from drm_buddy, bypassing
> > ttm/xee vram manager. Here are a few considerations/things we are
> > aware of:
> >
> >  1. This approach seems match hmm design better than bo concept. Our
> > svm implementation will be based on hmm. In hmm design, each vram
> > page is backed by a struct page. It is very easy to perform page
> > granularity migrations (b/t vram and system memory). If BO concept
> > is involved, we will have to split/remerge BOs during page
> > granularity migrations.
&g

RE: Implement svm without BO concept in xe driver

2023-08-15 Thread Zeng, Oak
Also + Christian

Thanks,
Oak

From: Intel-xe  On Behalf Of Zeng, Oak
Sent: August 14, 2023 11:38 PM
To: Thomas Hellström ; Brost, Matthew 
; Vishwanathapura, Niranjana 
; Welty, Brian ; 
Felix Kuehling ; Philip Yang ; 
intel...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: [Intel-xe] Implement svm without BO concept in xe driver

Hi Thomas, Matt and all,

This came up when I port i915 svm codes to xe driver. In i915 implementation, 
we have i915_buddy manage gpu vram and svm codes directly call i915_buddy layer 
to allocate/free vram. There is no gem_bo/ttm bo concept involved in the svm 
implementation.

In xe driver,  we have drm_buddy, xe_ttm_vram_mgr and ttm layer to manage vram. 
Drm_buddy is initialized during xe_ttm_vram_mgr initialization. Vram 
allocation/free is done through xe_ttm_vram_mgr functions which call into 
drm_buddy layer to allocate vram blocks.

I plan to implement xe svm driver the same way as we did in i915, which means 
there will not be bo concept in the svm implementation. Drm_buddy will be 
passed to svm layer during vram initialization and svm will allocate/free 
memory directly from drm_buddy, bypassing ttm/xee vram manager. Here are a few 
considerations/things we are aware of:


  1.  This approach seems match hmm design better than bo concept. Our svm 
implementation will be based on hmm. In hmm design, each vram page is backed by 
a struct page. It is very easy to perform page granularity migrations (b/t vram 
and system memory). If BO concept is involved, we will have to split/remerge 
BOs during page granularity migrations.



  1.  We have a prove of concept of this approach in i915, originally 
implemented by Niranjana. It seems work but it only has basic functionalities 
for now. We don't have advanced features such as memory eviction etc.



  1.  With this approach, vram will divided into two separate pools: one for 
xe_gem_created BOs and one for vram used by svm. Those two pools are not 
connected: memory pressure from one pool won't be able to evict vram from 
another pool. At this point, we don't whether this aspect is good or not.



  1.  Amdkfd svm went different approach which is BO based. The benefit of this 
approach is a lot of existing driver facilities (such as memory 
eviction/cgroup/accounting) can be reused



Do you have any comment to this approach? Should I come back with a RFC of some 
POC codes?

Thanks,
Oak



Implement svm without BO concept in xe driver

2023-08-14 Thread Zeng, Oak
Hi Thomas, Matt and all,

This came up when I port i915 svm codes to xe driver. In i915 implementation, 
we have i915_buddy manage gpu vram and svm codes directly call i915_buddy layer 
to allocate/free vram. There is no gem_bo/ttm bo concept involved in the svm 
implementation.

In xe driver,  we have drm_buddy, xe_ttm_vram_mgr and ttm layer to manage vram. 
Drm_buddy is initialized during xe_ttm_vram_mgr initialization. Vram 
allocation/free is done through xe_ttm_vram_mgr functions which call into 
drm_buddy layer to allocate vram blocks.

I plan to implement xe svm driver the same way as we did in i915, which means 
there will not be bo concept in the svm implementation. Drm_buddy will be 
passed to svm layer during vram initialization and svm will allocate/free 
memory directly from drm_buddy, bypassing ttm/xee vram manager. Here are a few 
considerations/things we are aware of:


  1.  This approach seems match hmm design better than bo concept. Our svm 
implementation will be based on hmm. In hmm design, each vram page is backed by 
a struct page. It is very easy to perform page granularity migrations (b/t vram 
and system memory). If BO concept is involved, we will have to split/remerge 
BOs during page granularity migrations.



  1.  We have a prove of concept of this approach in i915, originally 
implemented by Niranjana. It seems work but it only has basic functionalities 
for now.



  1.  With this approach, vram will divided into two separate pools: one for 
xe_gem_created BOs and one for vram used by svm. Those two pools are not 
connected: memory pressure from one pool won't be able to evict vram from 
another pool. At this point, we don't whether this aspect is good or not.



  1.  Amdkfd svm went different approach which is BO based. The benefit of this 
approach is a lot of existing driver facilities can be reused.



Do you have any comment to this approach? Should I come back with a RFC of some 
POC codes?

Thanks,
Oak



RE: [RFC PATCH] Documentation/gpu: Add a VM_BIND async draft document.

2023-05-30 Thread Zeng, Oak
Hi Thomas,

Thanks for the document. See one question inline.

Thanks,
Oak

> -Original Message-
> From: dri-devel  On Behalf Of
> Thomas Hellström
> Sent: May 30, 2023 4:43 AM
> To: intel...@lists.freedesktop.org
> Cc: Brost, Matthew ; Thomas Hellström
> ; linux-ker...@vger.kernel.org; dri-
> de...@lists.freedesktop.org; Danilo Krummrich 
> Subject: [RFC PATCH] Documentation/gpu: Add a VM_BIND async draft
> document.
> 
> Add a motivation for and description of asynchronous VM_BIND operation
> 
> Signed-off-by: Thomas Hellström 
> ---
>  Documentation/gpu/drm-vm-bind-async.rst | 138
> 
>  1 file changed, 138 insertions(+)
>  create mode 100644 Documentation/gpu/drm-vm-bind-async.rst
> 
> diff --git a/Documentation/gpu/drm-vm-bind-async.rst
> b/Documentation/gpu/drm-vm-bind-async.rst
> new file mode 100644
> index ..7f7f8f7ddfea
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-async.rst
> @@ -0,0 +1,138 @@
> +
> +Asynchronous VM_BIND
> +
> +
> +Nomenclature:
> +=
> +
> +* VRAM: On-device memory. Sometimes referred to as device local memory.
> +
> +* vm: A GPU address space. Typically per process, but can be shared by
> +  multiple processes.
> +
> +* VM_BIND: An operation or a list of operations to modify a vm using
> +  an IOCTL. The operations include mapping and unmapping system- or
> +  VRAM memory.
> +
> +* syncobj: A container that abstracts synchronization objects. The
> +  synchronization objects can be either generic, like dma-fences or
> +  driver specific. A syncobj typically indicates the type of the
> +  underlying synchronization object.
> +
> +* in-syncobj: Argument to a VM_BIND IOCTL, the VM_BIND operation waits
> +  for these before starting.
> +
> +* out-syncbj: Argument to a VM_BIND_IOCTL, the VM_BIND operation
> +  signals these when the bind operation is complete.
> +
> +* memory fence: A synchronization object, different from a dma-fence
> +  that uses the value of a specified memory location to determine
> +  signaled status. 

Are you saying memory fence (user fence) uses specific memory location to 
determine signaled status, while dma-fence doesn't use specific memory location 
to determine status?

My understanding is, both user fence and dma fence use a memory to determine 
status...in the dma fence case, it is the seqno field of struct dma_fence. The 
difference b/t those two is, for dma-fence, people agreed it has to be signaled 
in certain amount of time; while user fence doesn't has such contract.

-Oak

A memory fence can be awaited and signaled by both
> +  the GPU and CPU. Memory fences are sometimes referred to as
> +  user-fences.
> +
> +* long-running workload: A workload that may take more than the
> +  current stipulated dma-fence maximum signal delay to complete and
> +  which therefore needs to set the VM or the GPU execution context in
> +  a certain mode that disallows completion dma-fences.
> +
> +* UMD: User-mode driver.
> +
> +* KMD: Kernel-mode driver.
> +
> +
> +Synchronous / Asynchronous VM_BIND operation
> +
> +
> +Synchronous VM_BIND
> +___
> +With Synchronous VM_BIND, the VM_BIND operations all complete before the
> +ioctl returns. A synchronous VM_BIND takes neither in-fences nor
> +out-fences. Synchronous VM_BIND may block and wait for GPU operations;
> +for example swapin or clearing, or even previous binds.
> +
> +Asynchronous VM_BIND
> +
> +Asynchronous VM_BIND accepts both in-syncobjs and out-syncobjs. While the
> +IOCTL may return immediately, the VM_BIND operations wait for the in-
> syncobjs
> +before modifying the GPU page-tables, and signal the out-syncobjs when
> +the modification is done in the sense that the next execbuf that
> +awaits for the out-syncobjs will see the change. Errors are reported
> +synchronously assuming that the asynchronous part of the job never errors.
> +In low-memory situations the implementation may block, performing the
> +VM_BIND synchronously, because there might not be enough memory
> +immediately available for preparing the asynchronous operation.
> +
> +If the VM_BIND IOCTL takes a list or an array of operations as an argument,
> +the in-syncobjs needs to signal before the first operation starts to
> +execute, and the out-syncobjs signal after the last operation
> +completes. Operations in the operation list can be assumed, where it
> +matters, to complete in order.
> +
> +To aid in supporting user-space queues, the VM_BIND may take a bind context
> +AKA bind engine identifier argument. All VM_BIND operations using the same
> +bind engine can then be assumed, where it matters, to complete in
> +order. No such assumptions can be made between VM_BIND operations
> +using separate bind contexts.
> +
> +The purpose of an Asynchronous VM_BIND operation is for user-mode
> +drivers to be able to pipeline interleaved vm modifications and
> 

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Zeng, Oak
So this series basically go with option 2. The part that option2 makes me 
uncomfortable is, dma-fence doesn't work for long running workload, why we 
generate it in the first place? As long as dma-fence is generated, it will 
become a source of confusion in the future. It doesn't matter how much you 
annotate it/document it. So if we decide to go with option2, the bottom line 
is, don't generate dma-fence for long running workload during job submission. 
This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. 
When it was created, no gpu support page fault. Obviously this is not a good 
technology for modern gpu with page fault support. I think the best way is to 
create a new scheduler and dependency tracking mechanism works for both page 
fault enabled and page fault disabled context. I think this matches what 
Christian said below. Maybe nobody think this is easy?  

Thanks,
Oak

> -Original Message-
> From: Brost, Matthew 
> Sent: April 5, 2023 2:53 PM
> To: Zeng, Oak 
> Cc: Christian König ; Vetter, Daniel
> ; Thomas Hellström
> ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; 
> faith.ekstr...@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > Hi,
> >
> > Using dma-fence for completion/dependency tracking for long-run
> workload(more precisely on-demand paging/page fault enabled workload) can
> cause deadlock. This seems the significant issue here. Other issues such as 
> the
> drm scheduler completion order implication etc are minors which can be solve
> inside the framework of drm scheduler. We need to evaluate below paths:
> >
> > 1) still use drm scheduler for job submission, and use dma-fence for job
> completion waiting/dependency tracking. This is solution proposed in this 
> series.
> Annotate dma-fence for long-run workload: user can still wait dma-fence for 
> job
> completion but can't wait dma-fence while holding any memory management
> locks.  We still use dma-fence for dependency tracking. But it is just very 
> easily
> run into deadlock when on-demand paging is in the picture. The annotation 
> helps
> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> solution: It is almost impossible to completely avoid dependency deadlock in
> complex runtime environment
> >
> 
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.
> 
> > 2) Still use drm scheduler but not use dma-fence for completion 
> > signaling
> and dependency tracking. This way we still get some free functions (reset, err
> handling ring flow control as Matt said)from drm scheduler, but push the
> dependency/completion tracking completely to user space using techniques such
> as user space fence. User space doesn't have chance to wait fence while 
> holding
> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> >
> 
> We use user space fence for syncs.
> 
> > 3) Completely discard drm scheduler and dma-fence for long-run
> workload. Use user queue/doorbell for super fast submission, directly interact
> with fw scheduler. Use user fence for completion/dependency tracking.
> >
> 
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.
> 
> Matt
> 
> > Thanks,
> > Oak
> >
> > > -Original Message-
> > > From: Christian König 
> > > Sent: April 5, 2023 3:30 AM
> > > To: Brost, Matthew ; Zeng, Oak
> > > 
> > > Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> > > robdcl...@chromium.org; thomas.hellst...@linux.intel.com;
> airl...@linux.ie;
> > > l...@asahilina.net; boris.brezil...@collabora.com;
> faith.ekstr...@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > >
> > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > >> Hi Matt, Thomas,
> > > >>
> > > >> Some very bold out of box thinking in this area:
> > > >>
> > > >> 1. so you want to use drm scheduler and dma-fence for long running
> workload.
> > > Why you want to do this in the first place? Wha

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Zeng, Oak
Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more 
precisely on-demand paging/page fault enabled workload) can cause deadlock. 
This seems the significant issue here. Other issues such as the drm scheduler 
completion order implication etc are minors which can be solve inside the 
framework of drm scheduler. We need to evaluate below paths:

1) still use drm scheduler for job submission, and use dma-fence for 
job completion waiting/dependency tracking. This is solution proposed in this 
series. Annotate dma-fence for long-run workload: user can still wait dma-fence 
for job completion but can't wait dma-fence while holding any memory management 
locks.  We still use dma-fence for dependency tracking. But it is just very 
easily run into deadlock when on-demand paging is in the picture. The 
annotation helps us to detect deadlock but not solve deadlock problems. Seems 
*not* a complete solution: It is almost impossible to completely avoid 
dependency deadlock in complex runtime environment

2) Still use drm scheduler but not use dma-fence for completion 
signaling and dependency tracking. This way we still get some free functions 
(reset, err handling ring flow control as Matt said)from drm scheduler, but 
push the dependency/completion tracking completely to user space using 
techniques such as user space fence. User space doesn't have chance to wait 
fence while holding a kernel memory management lock, thus the dma-fence 
deadlock issue is solved.

3) Completely discard drm scheduler and dma-fence for long-run 
workload. Use user queue/doorbell for super fast submission, directly interact 
with fw scheduler. Use user fence for completion/dependency tracking.

Thanks,
Oak

> -Original Message-
> From: Christian König 
> Sent: April 5, 2023 3:30 AM
> To: Brost, Matthew ; Zeng, Oak
> 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; 
> faith.ekstr...@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> >> Hi Matt, Thomas,
> >>
> >> Some very bold out of box thinking in this area:
> >>
> >> 1. so you want to use drm scheduler and dma-fence for long running 
> >> workload.
> Why you want to do this in the first place? What is the benefit? Drm 
> scheduler is
> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> level, as you said below for intel this is Guc. Can xe driver just directly 
> submit job
> to Guc, bypassing drm scheduler?
> >>
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> >> 2. using dma-fence for long run workload: I am well aware that page fault 
> >> (and
> the consequent memory allocation/lock acquiring to fix the fault) can cause
> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> used purely because the nature of the workload that it runs very long 
> (indefinite).
> I did a math: the dma_fence_wait_timeout function's third param is the timeout
> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days 
> is not long
> enough, can we just change the timeout parameter to signed 64 bits so it is 
> much
> longer than our life time...
> >>
> >> So I mainly argue we can't use dma-fence for long-run workload is not
> because the workload runs very long, rather because of the fact that we use
> page fault for long-run workload. If we enable page fault for short-run 
> workload,
> we can't use dma-fence either. Page fault is the key thing here.
> >>
> >> Now since we use page fault which is *fundamentally* controversial with
> dma-fence design, why now just introduce a independent concept such as user-
> fence instead of extending existing dma-fence?
> >>
> >> I like unified design. If drm scheduler, dma-fence can be extended to work 
> >> for
> everything, it is beautiful. But seems we have some fundamental problem here.
> >>
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fenc

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Zeng, Oak
Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload. 
Why you want to do this in the first place? What is the benefit? Drm scheduler 
is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw 
level, as you said below for intel this is Guc. Can xe driver just directly 
submit job to Guc, bypassing drm scheduler? 

2. using dma-fence for long run workload: I am well aware that page fault (and 
the consequent memory allocation/lock acquiring to fix the fault) can cause 
deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be 
used purely because the nature of the workload that it runs very long 
(indefinite). I did a math: the dma_fence_wait_timeout function's third param 
is the timeout which is a signed long type. If HZ is 1000, this is about 23 
days. If 23 days is not long enough, can we just change the timeout parameter 
to signed 64 bits so it is much longer than our life time... 

So I mainly argue we can't use dma-fence for long-run workload is not because 
the workload runs very long, rather because of the fact that we use page fault 
for long-run workload. If we enable page fault for short-run workload, we can't 
use dma-fence either. Page fault is the key thing here.

Now since we use page fault which is *fundamentally* controversial with 
dma-fence design, why now just introduce a independent concept such as 
user-fence instead of extending existing dma-fence? 

I like unified design. If drm scheduler, dma-fence can be extended to work for 
everything, it is beautiful. But seems we have some fundamental problem here.

Thanks,
Oak

> -Original Message-
> From: dri-devel  On Behalf Of
> Matthew Brost
> Sent: April 3, 2023 8:22 PM
> To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org
> Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com; 
> airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew
> ; christian.koe...@amd.com;
> faith.ekstr...@collabora.com
> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> 
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> 

RE: [Intel-gfx] [RFC v4 06/14] drm/i915/vm_bind: Handle persistent vmas

2022-09-26 Thread Zeng, Oak



Regards,
Oak

> -Original Message-
> From: Intel-gfx  On Behalf Of 
> Niranjana
> Vishwanathapura
> Sent: September 21, 2022 3:10 AM
> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Cc: Zanoni, Paulo R ; Hellstrom, Thomas
> ; Auld, Matthew ; Vetter,
> Daniel ; christian.koe...@amd.com
> Subject: [Intel-gfx] [RFC v4 06/14] drm/i915/vm_bind: Handle persistent vmas
> 
> Treat VM_BIND vmas as persistent across execbuf ioctl calls and handle
> them during the request submission in the execbuff path.
> 
> Support eviction by maintaining a list of evicted persistent vmas
> for rebinding during next submission.
> 
> Signed-off-by: Niranjana Vishwanathapura 
> Signed-off-by: Andi Shyti 
> ---
>  .../drm/i915/gem/i915_gem_vm_bind_object.c|  7 +++
>  drivers/gpu/drm/i915/gt/intel_gtt.c   |  2 +
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |  4 ++
>  drivers/gpu/drm/i915/i915_gem_gtt.c   | 39 
>  drivers/gpu/drm/i915/i915_gem_gtt.h   |  3 ++
>  drivers/gpu/drm/i915/i915_vma.c   | 46 +++
>  drivers/gpu/drm/i915/i915_vma.h   | 45 +-
>  drivers/gpu/drm/i915/i915_vma_types.h | 17 +++
>  8 files changed, 151 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> index 7ca6a41fc981..236f901b8b9c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> @@ -91,6 +91,12 @@ static void i915_gem_vm_bind_remove(struct i915_vma
> *vma, bool release_obj)
>  {
>   lockdep_assert_held(>vm->vm_bind_lock);
> 
> + spin_lock(>vm->vm_rebind_lock);
> + if (!list_empty(>vm_rebind_link))
> + list_del_init(>vm_rebind_link);
> + i915_vma_set_purged(vma);
> + spin_unlock(>vm->vm_rebind_lock);
> +
>   list_del_init(>vm_bind_link);
>   list_del_init(>non_priv_vm_bind_link);
>   i915_vm_bind_it_remove(vma, >vm->va);
> @@ -181,6 +187,7 @@ static struct i915_vma *vm_bind_get_vma(struct
> i915_address_space *vm,
> 
>   vma->start = va->start;
>   vma->last = va->start + va->length - 1;
> + i915_vma_set_persistent(vma);
> 
>   return vma;
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c
> b/drivers/gpu/drm/i915/gt/intel_gtt.c
> index da4f9dee0397..6db31197fa87 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
> @@ -296,6 +296,8 @@ void i915_address_space_init(struct i915_address_space
> *vm, int subclass)
>   INIT_LIST_HEAD(>non_priv_vm_bind_list);
>   vm->root_obj = i915_gem_object_create_internal(vm->i915, PAGE_SIZE);
>   GEM_BUG_ON(IS_ERR(vm->root_obj));
> + INIT_LIST_HEAD(>vm_rebind_list);
> + spin_lock_init(>vm_rebind_lock);
>  }
> 
>  void *__px_vaddr(struct drm_i915_gem_object *p)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h
> b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 3f2e87d3bf34..b73d35b4e05d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -273,6 +273,10 @@ struct i915_address_space {
>   struct list_head vm_bind_list;
>   /** @vm_bound_list: List of vm_binding completed */
>   struct list_head vm_bound_list;
> + /* @vm_rebind_list: list of vmas to be rebinded */
> + struct list_head vm_rebind_list;
> + /* @vm_rebind_lock: protects vm_rebound_list */
> + spinlock_t vm_rebind_lock;
>   /* @va: tree of persistent vmas */
>   struct rb_root_cached va;
>   struct list_head non_priv_vm_bind_list;
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c
> b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index 329ff75b80b9..b7d0844de561 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -25,6 +25,45 @@
>  #include "i915_trace.h"
>  #include "i915_vgpu.h"
> 
> +/**
> + * i915_vm_sync() - Wait until address space is not in use
> + * @vm: address space
> + *
> + * Waits until all requests using the address space are complete.
> + *
> + * Returns: 0 if success, -ve err code upon failure
> + */
> +int i915_vm_sync(struct i915_address_space *vm)
> +{
> + int ret;
> +
> + /* Wait for all requests under this vm to finish */
> + ret = dma_resv_wait_timeout(vm->root_obj->base.resv,
> + DMA_RESV_USAGE_BOOKKEEP, false,
> + MAX_SCHEDULE_TIMEOUT);
> + if (ret < 0)
> + return ret;
> + else if (ret > 0)
> + return 0;
> + else
> + return -ETIMEDOUT;
> +}
> +
> +/**
> + * i915_vm_is_active() - Check if address space is being used
> + * @vm: address space
> + *
> + * Check if any request using the specified address space is
> + * active.
> + *
> + * Returns: true if address space is active, false otherwise.
> + */
> +bool i915_vm_is_active(const struct 

RE: [Intel-gfx] [RFC 05/10] drm/i915/vm_bind: Handle persistent vmas

2022-07-05 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: C, Ramalingam 
> Sent: July 5, 2022 5:20 AM
> To: Zeng, Oak 
> Cc: Vishwanathapura, Niranjana ;
> intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel ; christian.koe...@amd.com; Hellstrom,
> Thomas ; Zanoni, Paulo R
> ; Auld, Matthew 
> Subject: Re: [Intel-gfx] [RFC 05/10] drm/i915/vm_bind: Handle persistent
> vmas
> 
> On 2022-07-04 at 17:05:38 +, Zeng, Oak wrote:
> >
> >
> > Thanks,
> > Oak
> >
> > > -Original Message-
> > > From: Intel-gfx  On Behalf
> > > Of Niranjana Vishwanathapura
> > > Sent: July 1, 2022 6:51 PM
> > > To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> > > Cc: Zanoni, Paulo R ; Hellstrom, Thomas
> > > ; Auld, Matthew
> > > ; Vetter, Daniel ;
> > > christian.koe...@amd.com
> > > Subject: [Intel-gfx] [RFC 05/10] drm/i915/vm_bind: Handle persistent
> > > vmas
> > >
> > > Treat VM_BIND vmas as persistent and handle them during the request
> > > submission in the execbuff path.
> >
> > Hi Niranjana,
> >
> > Is the meaning of "persistent" above persistent across all the subsequent
> execbuf ioctls?
> 
> Yes oak. Thats correct. persistent across multiple execbuf ioctls.

Thank you Ram. Maybe we can add that to the commit message: " Treat VM_BIND 
vmas as persistent across multiple execbuf ioctls"?  I think this is versus the 
old execbuf mode where we bind in the execbuf ioctl and bindings are only valid 
for that execbuf. For those who don't have such background knowledge, it is 
hard to guess the meaning of persistent.

Thanks,
Oak
> 
> Regards,
> Ram.
> >
> > Thanks,
> > Oak
> >
> > >
> > > Support eviction by maintaining a list of evicted persistent vmas
> > > for rebinding during next submission.
> > >
> > > Signed-off-by: Niranjana Vishwanathapura
> > > 
> > > ---
> > >  drivers/gpu/drm/i915/gem/i915_gem_object.c|  1 +
> > >  drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h   |  3 +
> > >  .../drm/i915/gem/i915_gem_vm_bind_object.c| 12 ++-
> > >  drivers/gpu/drm/i915/gt/intel_gtt.c   |  2 +
> > >  drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +
> > >  drivers/gpu/drm/i915/i915_gem_gtt.h   | 22 ++
> > >  drivers/gpu/drm/i915/i915_vma.c   | 32 +++-
> > >  drivers/gpu/drm/i915/i915_vma.h   | 78 +--
> > >  drivers/gpu/drm/i915/i915_vma_types.h | 23 ++
> > >  9 files changed, 163 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > > b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > > index ccec4055fde3..5121f02ba95c 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> > > @@ -38,6 +38,7 @@
> > >  #include "i915_gem_mman.h"
> > >  #include "i915_gem_object.h"
> > >  #include "i915_gem_ttm.h"
> > > +#include "i915_gem_vm_bind.h"
> > >  #include "i915_memcpy.h"
> > >  #include "i915_trace.h"
> > >
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> > > b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> > > index 849bf3c1061e..eaadf5a6ab09 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> > > @@ -6,6 +6,7 @@
> > >  #ifndef __I915_GEM_VM_BIND_H
> > >  #define __I915_GEM_VM_BIND_H
> > >
> > > +#include 
> > >  #include "i915_drv.h"
> > >
> > >  #define assert_vm_bind_held(vm)   lockdep_assert_held(&(vm)-
> > > >vm_bind_lock)
> > > @@ -26,6 +27,8 @@ static inline void i915_gem_vm_bind_unlock(struct
> > > i915_address_space *vm)
> > >   mutex_unlock(>vm_bind_lock);
> > >  }
> > >
> > > +#define assert_vm_priv_held(vm)   assert_object_held((vm)->root_obj)
> > > +
> > >  static inline int i915_gem_vm_priv_lock(struct i915_address_space *vm,
> > >   struct i915_gem_ww_ctx *ww)
> > >  {
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> > > b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> > > index 96f139cc8060..1a8efa83547f 100644
> > > --- a/drivers

RE: [Intel-gfx] [RFC 05/10] drm/i915/vm_bind: Handle persistent vmas

2022-07-04 Thread Zeng, Oak



Thanks,
Oak

> -Original Message-
> From: Intel-gfx  On Behalf Of
> Niranjana Vishwanathapura
> Sent: July 1, 2022 6:51 PM
> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Cc: Zanoni, Paulo R ; Hellstrom, Thomas
> ; Auld, Matthew ;
> Vetter, Daniel ; christian.koe...@amd.com
> Subject: [Intel-gfx] [RFC 05/10] drm/i915/vm_bind: Handle persistent vmas
> 
> Treat VM_BIND vmas as persistent and handle them during the request
> submission in the execbuff path.

Hi Niranjana,

Is the meaning of "persistent" above persistent across all the subsequent 
execbuf ioctls?

Thanks,
Oak 

> 
> Support eviction by maintaining a list of evicted persistent vmas for 
> rebinding
> during next submission.
> 
> Signed-off-by: Niranjana Vishwanathapura
> 
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_object.c|  1 +
>  drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h   |  3 +
>  .../drm/i915/gem/i915_gem_vm_bind_object.c| 12 ++-
>  drivers/gpu/drm/i915/gt/intel_gtt.c   |  2 +
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |  2 +
>  drivers/gpu/drm/i915/i915_gem_gtt.h   | 22 ++
>  drivers/gpu/drm/i915/i915_vma.c   | 32 +++-
>  drivers/gpu/drm/i915/i915_vma.h   | 78 +--
>  drivers/gpu/drm/i915/i915_vma_types.h | 23 ++
>  9 files changed, 163 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index ccec4055fde3..5121f02ba95c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -38,6 +38,7 @@
>  #include "i915_gem_mman.h"
>  #include "i915_gem_object.h"
>  #include "i915_gem_ttm.h"
> +#include "i915_gem_vm_bind.h"
>  #include "i915_memcpy.h"
>  #include "i915_trace.h"
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> index 849bf3c1061e..eaadf5a6ab09 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind.h
> @@ -6,6 +6,7 @@
>  #ifndef __I915_GEM_VM_BIND_H
>  #define __I915_GEM_VM_BIND_H
> 
> +#include 
>  #include "i915_drv.h"
> 
>  #define assert_vm_bind_held(vm)   lockdep_assert_held(&(vm)-
> >vm_bind_lock)
> @@ -26,6 +27,8 @@ static inline void i915_gem_vm_bind_unlock(struct
> i915_address_space *vm)
>   mutex_unlock(>vm_bind_lock);
>  }
> 
> +#define assert_vm_priv_held(vm)   assert_object_held((vm)->root_obj)
> +
>  static inline int i915_gem_vm_priv_lock(struct i915_address_space *vm,
>   struct i915_gem_ww_ctx *ww)
>  {
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> index 96f139cc8060..1a8efa83547f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_vm_bind_object.c
> @@ -85,6 +85,13 @@ void i915_gem_vm_bind_remove(struct i915_vma
> *vma, bool release_obj)  {
>   assert_vm_bind_held(vma->vm);
> 
> + spin_lock(>vm->vm_rebind_lock);
> + if (!list_empty(>vm_rebind_link))
> + list_del_init(>vm_rebind_link);
> + i915_vma_set_purged(vma);
> + i915_vma_set_freed(vma);
> + spin_unlock(>vm->vm_rebind_lock);
> +
>   if (!list_empty(>vm_bind_link)) {
>   list_del_init(>vm_bind_link);
>   list_del_init(>non_priv_vm_bind_link);
> @@ -220,6 +227,7 @@ static struct i915_vma *vm_bind_get_vma(struct
> i915_address_space *vm,
> 
>   vma->start = va->start;
>   vma->last = va->start + va->length - 1;
> + i915_vma_set_persistent(vma);
> 
>   return vma;
>  }
> @@ -304,8 +312,10 @@ int i915_gem_vm_bind_obj(struct
> i915_address_space *vm,
> 
>   i915_vm_bind_put_fence(vma);
>  put_vma:
> - if (ret)
> + if (ret) {
> + i915_vma_set_freed(vma);
>   i915_vma_destroy(vma);
> + }
> 
>   i915_gem_ww_ctx_fini();
>  unlock_vm:
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.c
> b/drivers/gpu/drm/i915/gt/intel_gtt.c
> index df0a8459c3c6..55d5389b2c6c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.c
> @@ -293,6 +293,8 @@ void i915_address_space_init(struct
> i915_address_space *vm, int subclass)
>   INIT_LIST_HEAD(>non_priv_vm_bind_list);
>   vm->root_obj = i915_gem_object_create_internal(vm->i915,
> PAGE_SIZE);
>   GEM_BUG_ON(IS_ERR(vm->root_obj));
> + INIT_LIST_HEAD(>vm_rebind_list);
> + spin_lock_init(>vm_rebind_lock);
>  }
> 
>  void *__px_vaddr(struct drm_i915_gem_object *p) diff --git
> a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index f538ce9115c9..fe5485c4a1cd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -265,6 +265,8 @@ struct i915_address_space {
>   struct mutex vm_bind_lock;  /* Protects vm_bind lists */
>   struct 

RE: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-28 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: June 28, 2022 4:58 AM
> To: Zeng, Oak ; Landwerlin, Lionel G
> ; Vishwanathapura, Niranjana
> 
> Cc: Zanoni, Paulo R ; intel-
> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Hellstrom,
> Thomas ; Wilson, Chris P
> ; Vetter, Daniel ;
> christian.koe...@amd.com; Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> 
> On 27/06/2022 19:58, Zeng, Oak wrote:
> >
> >
> > Thanks,
> > Oak
> >
> >> -Original Message-
> >> From: Tvrtko Ursulin 
> >> Sent: June 27, 2022 4:30 AM
> >> To: Zeng, Oak ; Landwerlin, Lionel G
> >> ; Vishwanathapura, Niranjana
> >> 
> >> Cc: Zanoni, Paulo R ; intel-
> >> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
> >> Hellstrom, Thomas ; Wilson, Chris P
> >> ; Vetter, Daniel ;
> >> christian.koe...@amd.com; Auld, Matthew 
> >> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >> definition
> >>
> >>
> >> On 24/06/2022 21:23, Zeng, Oak wrote:
> >>> Let's compare "tlb invalidate at vm unbind" vs "tlb invalidate at
> >>> backing
> >> storage":
> >>>
> >>> Correctness:
> >>> consider this sequence of:
> >>> 1. unbind va1 from pa1,
> >>> 2. then bind va1 to pa2. //user space has the freedom to do this as
> >>> it manages virtual address space 3. Submit shader code using va1, 4.
> >>> Then retire pa1.
> >>>
> >>> If you don't perform tlb invalidate at step #1, in step #3, shader
> >>> will use
> >> stale entries in tlb and pa1 will be used for the shader. User want to use
> pa2.
> >> So I don't think invalidate tlb at step #4 make correctness.
> >>
> >> Define step 3. Is it a new execbuf? If so then there will be a TLB flush
> there.
> >> Unless the plan is to stop doing that with eb3 but I haven't picked
> >> up on that anywhere so far.
> >
> > In Niranjana's latest patch series, he removed the TLB flushing from
> vm_unbind. He also said explicitly TLB invalidation will be performed at job
> submission and backing storage releasing time, which is the existing behavior
> of the current i915 driver.
> >
> > I think if we invalidate TLB on each vm_unbind, then we don't need to
> invalidate at submission and backing storage releasing. It doesn't make a lot
> of sense to me to perform a tlb invalidation at execbuf time. Maybe it is a
> behavior for the old implicit binding programming model. For vm_bind and
> eb3, we separate the binding and job submission into two APIs. It is more
> natural the TLB invalidation be coupled with the vm bind/unbind, not job
> submission. So in my opinion we should remove tlb invalidation from
> submission and backing storage releasing and add it to vm unbind. This is
> method is cleaner to me.
> 
> You can propose this model (not flushing in eb3) but I have my doubts.
> Consider the pointlessness of flushing on N unbinds for 99% of clients which
> are not infinite compute batch. And consider how you make the behaviour
> consistent on all platforms (selective vs global tlb flush).

When I thought about eb3, compute workload and ulls were also in the picture. 
Under ulls, user mode keep submitting job without calling execbuf (it uses a 
semaphore to notify HW of the new batch). The execbuf + backing release flush 
has a correctness issue as I pointed out. Now we decided eb3 is only for mesa, 
not for compute, we don't have this correctness problem for now. We can close 
this conversation for now and revive it when we move to Xe and vm bind for 
compute.

Regards,
Oak


> 
> Also note that this discussion is orthogonal to unbind vs backing store 
> release.
> 
> > Regarding performance, we don't have data. In my opinion, we should
> make things work in a most straight forward way as the first step. Then
> consider performance improvement if necessary. Consider some delayed tlb
> invalidation at submission and backing release time without performance
> data support wasn't a good decision.
> 
> It is quite straightforward though. ;) It aligns with the eb2 model and
> argument can be made backing store release is (much) less frequent than
> unbind (consider softpin where client could trigger a lot of pointless 
> flushes).
> 
> Regards,
> 
> Tvrtko


RE: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-27 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: June 27, 2022 4:30 AM
> To: Zeng, Oak ; Landwerlin, Lionel G
> ; Vishwanathapura, Niranjana
> 
> Cc: Zanoni, Paulo R ; intel-
> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Hellstrom,
> Thomas ; Wilson, Chris P
> ; Vetter, Daniel ;
> christian.koe...@amd.com; Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> 
> On 24/06/2022 21:23, Zeng, Oak wrote:
> > Let's compare "tlb invalidate at vm unbind" vs "tlb invalidate at backing
> storage":
> >
> > Correctness:
> > consider this sequence of:
> > 1. unbind va1 from pa1,
> > 2. then bind va1 to pa2. //user space has the freedom to do this as it
> > manages virtual address space 3. Submit shader code using va1, 4. Then
> > retire pa1.
> >
> > If you don't perform tlb invalidate at step #1, in step #3, shader will use
> stale entries in tlb and pa1 will be used for the shader. User want to use 
> pa2.
> So I don't think invalidate tlb at step #4 make correctness.
> 
> Define step 3. Is it a new execbuf? If so then there will be a TLB flush 
> there.
> Unless the plan is to stop doing that with eb3 but I haven't picked up on that
> anywhere so far.

In Niranjana's latest patch series, he removed the TLB flushing from vm_unbind. 
He also said explicitly TLB invalidation will be performed at job submission 
and backing storage releasing time, which is the existing behavior of the 
current i915 driver.

I think if we invalidate TLB on each vm_unbind, then we don't need to 
invalidate at submission and backing storage releasing. It doesn't make a lot 
of sense to me to perform a tlb invalidation at execbuf time. Maybe it is a 
behavior for the old implicit binding programming model. For vm_bind and eb3, 
we separate the binding and job submission into two APIs. It is more natural 
the TLB invalidation be coupled with the vm bind/unbind, not job submission. So 
in my opinion we should remove tlb invalidation from submission and backing 
storage releasing and add it to vm unbind. This is method is cleaner to me.

Regarding performance, we don't have data. In my opinion, we should make things 
work in a most straight forward way as the first step. Then consider 
performance improvement if necessary. Consider some delayed tlb invalidation at 
submission and backing release time without performance data support wasn't a 
good decision.

Regards,
Oak

> 
> > Performance:
> > It is straight forward to invalidate tlb at step 1. If platform support 
> > range
> based tlb invalidation, we can perform range based invalidation easily at
> step1.
> 
> If the platform supports range base yes. If it doesn't _and_ the flush at
> unbind is not needed for 99% of use cases then it is simply a waste.
> 
> > If you do it at step 4, you either need to perform a whole gt tlb 
> > invalidation
> (worse performance), or you need to record all the VAs that this pa has been
> bound to and invalidate all the VA ranges - ugly program.
> 
> Someone can setup some benchmarking? :)
> 
> Regards,
> 
> Tvrtko
> 
> >
> >
> > Thanks,
> > Oak
> >
> >> -Original Message-
> >> From: Tvrtko Ursulin 
> >> Sent: June 24, 2022 4:32 AM
> >> To: Zeng, Oak ; Landwerlin, Lionel G
> >> ; Vishwanathapura, Niranjana
> >> 
> >> Cc: Zanoni, Paulo R ; intel-
> >> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
> >> Hellstrom, Thomas ; Wilson, Chris P
> >> ; Vetter, Daniel ;
> >> christian.koe...@amd.com; Auld, Matthew 
> >> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >> definition
> >>
> >>
> >> On 23/06/2022 22:05, Zeng, Oak wrote:
> >>>> -Original Message-
> >>>> From: Intel-gfx  On Behalf
> >>>> Of Tvrtko Ursulin
> >>>> Sent: June 23, 2022 7:06 AM
> >>>> To: Landwerlin, Lionel G ;
> >>>> Vishwanathapura, Niranjana 
> >>>> Cc: Zanoni, Paulo R ;
> >>>> intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
> >>>> Hellstrom, Thomas ; Wilson, Chris P
> >>>> ; Vetter, Daniel
> >>>> ; christian.koe...@amd.com; Auld,
> Matthew
> >>>> 
> >>>> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >>>> definition
> >>>>
> >>>>
> >>>> On 23/06/2022 09:57, Lionel Landwerlin wrote:
> >>>>> On 23/06/2022 11:27, Tvrtko Ursulin wrote:
> &g

RE: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-24 Thread Zeng, Oak
Let's compare "tlb invalidate at vm unbind" vs "tlb invalidate at backing 
storage":

Correctness: 
consider this sequence of:
1. unbind va1 from pa1, 
2. then bind va1 to pa2. //user space has the freedom to do this as it manages 
virtual address space
3. Submit shader code using va1, 
4. Then retire pa1. 

If you don't perform tlb invalidate at step #1, in step #3, shader will use 
stale entries in tlb and pa1 will be used for the shader. User want to use pa2. 
So I don't think invalidate tlb at step #4 make correctness.


Performance: 
It is straight forward to invalidate tlb at step 1. If platform support range 
based tlb invalidation, we can perform range based invalidation easily at step1.
If you do it at step 4, you either need to perform a whole gt tlb invalidation 
(worse performance), or you need to record all the VAs that this pa has been 
bound to and invalidate all the VA ranges - ugly program.


Thanks,
Oak

> -Original Message-
> From: Tvrtko Ursulin 
> Sent: June 24, 2022 4:32 AM
> To: Zeng, Oak ; Landwerlin, Lionel G
> ; Vishwanathapura, Niranjana
> 
> Cc: Zanoni, Paulo R ; intel-
> g...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Hellstrom,
> Thomas ; Wilson, Chris P
> ; Vetter, Daniel ;
> christian.koe...@amd.com; Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> 
> On 23/06/2022 22:05, Zeng, Oak wrote:
> >> -Original Message-
> >> From: Intel-gfx  On Behalf
> >> Of Tvrtko Ursulin
> >> Sent: June 23, 2022 7:06 AM
> >> To: Landwerlin, Lionel G ;
> >> Vishwanathapura, Niranjana 
> >> Cc: Zanoni, Paulo R ;
> >> intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
> >> Hellstrom, Thomas ; Wilson, Chris P
> >> ; Vetter, Daniel ;
> >> christian.koe...@amd.com; Auld, Matthew 
> >> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >> definition
> >>
> >>
> >> On 23/06/2022 09:57, Lionel Landwerlin wrote:
> >>> On 23/06/2022 11:27, Tvrtko Ursulin wrote:
> >>>>>
> >>>>> After a vm_unbind, UMD can re-bind to same VA range against an
> >>>>> active VM.
> >>>>> Though I am not sue with Mesa usecase if that new mapping is
> >>>>> required for running GPU job or it will be for the next
> >>>>> submission. But ensuring the tlb flush upon unbind, KMD can ensure
> >>>>> correctness.
> >>>>
> >>>> Isn't that their problem? If they re-bind for submitting _new_ work
> >>>> then they get the flush as part of batch buffer pre-amble.
> >>>
> >>> In the non sparse case, if a VA range is unbound, it is invalid to
> >>> use that range for anything until it has been rebound by something else.
> >>>
> >>> We'll take the fence provided by vm_bind and put it as a wait fence
> >>> on the next execbuffer.
> >>>
> >>> It might be safer in case of memory over fetching?
> >>>
> >>>
> >>> TLB flush will have to happen at some point right?
> >>>
> >>> What's the alternative to do it in unbind?
> >>
> >> Currently TLB flush happens from the ring before every BB_START and
> >> also when i915 returns the backing store pages to the system.
> >
> >
> > Can you explain more why tlb flush when i915 retire the backing storage? I
> never figured that out when I looked at the codes. As I understand it, tlb
> caches the gpu page tables which map a va to a pa. So it is straight forward 
> to
> me that we perform a tlb flush when we change the page table (either at vm
> bind time or unbind time. Better at unbind time for performance reason).
> 
> I don't know what performs better - someone can measure the two
> approaches? Certainly on platforms where we only have global TLB flushing
> the cost is quite high so my thinking was to allow i915 to control when it 
> will
> be done and not guarantee it in the uapi if it isn't needed for security 
> reasons.
> 
> > But it is rather tricky to me to flush tlb when we retire a backing 
> > storage. I
> don't see how backing storage can be connected to page table. Let's say user
> unbind va1 from pa1, then bind va1 to pa2. Then retire pa1. Submit shader
> code using va1. If we don't tlb flush after unbind va1, the new shader code
> which is supposed to use pa2 will still use pa1 due to the stale entries in 
> tlb,
> right? The point is, tlb cached is tagged with virtual address, not physical
> address. so after we unbind va1 from 

RE: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-23 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Intel-gfx  On Behalf Of Tvrtko
> Ursulin
> Sent: June 23, 2022 7:06 AM
> To: Landwerlin, Lionel G ; Vishwanathapura,
> Niranjana 
> Cc: Zanoni, Paulo R ; 
> intel-...@lists.freedesktop.org;
> dri-devel@lists.freedesktop.org; Hellstrom, Thomas 
> ;
> Wilson, Chris P ; Vetter, Daniel
> ; christian.koe...@amd.com; Auld, Matthew
> 
> Subject: Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> 
> On 23/06/2022 09:57, Lionel Landwerlin wrote:
> > On 23/06/2022 11:27, Tvrtko Ursulin wrote:
> >>>
> >>> After a vm_unbind, UMD can re-bind to same VA range against an active
> >>> VM.
> >>> Though I am not sue with Mesa usecase if that new mapping is required
> >>> for
> >>> running GPU job or it will be for the next submission. But ensuring the
> >>> tlb flush upon unbind, KMD can ensure correctness.
> >>
> >> Isn't that their problem? If they re-bind for submitting _new_ work
> >> then they get the flush as part of batch buffer pre-amble.
> >
> > In the non sparse case, if a VA range is unbound, it is invalid to use
> > that range for anything until it has been rebound by something else.
> >
> > We'll take the fence provided by vm_bind and put it as a wait fence on
> > the next execbuffer.
> >
> > It might be safer in case of memory over fetching?
> >
> >
> > TLB flush will have to happen at some point right?
> >
> > What's the alternative to do it in unbind?
> 
> Currently TLB flush happens from the ring before every BB_START and also
> when i915 returns the backing store pages to the system.


Can you explain more why tlb flush when i915 retire the backing storage? I 
never figured that out when I looked at the codes. As I understand it, tlb 
caches the gpu page tables which map a va to a pa. So it is straight forward to 
me that we perform a tlb flush when we change the page table (either at vm bind 
time or unbind time. Better at unbind time for performance reason).

But it is rather tricky to me to flush tlb when we retire a backing storage. I 
don't see how backing storage can be connected to page table. Let's say user 
unbind va1 from pa1, then bind va1 to pa2. Then retire pa1. Submit shader code 
using va1. If we don't tlb flush after unbind va1, the new shader code which is 
supposed to use pa2 will still use pa1 due to the stale entries in tlb, right? 
The point is, tlb cached is tagged with virtual address, not physical address. 
so after we unbind va1 from pa1, regardless we retire pa1 or not, va1 can be 
bound to another pa2.

Thanks,
Oak 


> 
> For the former, I haven't seen any mention that for execbuf3 there are
> plans to stop doing it? Anyway, as long as this is kept and sequence of
> bind[1..N]+execbuf is safe and correctly sees all the preceding binds.
> Hence about the alternative to doing it in unbind - first I think lets
> state the problem that is trying to solve.
> 
> For instance is it just for the compute "append work to the running
> batch" use case? I honestly don't remember how was that supposed to work
> so maybe the tlb flush on bind was supposed to deal with that scenario?
> 
> Or you see a problem even for Mesa with the current model?
> 
> Regards,
> 
> Tvrtko


RE: [PATCH v2 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-20 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: Vishwanathapura, Niranjana 
> Sent: June 17, 2022 1:15 AM
> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel 
> Cc: Hellstrom, Thomas ; Wilson, Chris P
> ; ja...@jlekstrand.net;
> christian.koe...@amd.com; Brost, Matthew ;
> Ursulin, Tvrtko ; Auld, Matthew
> ; Landwerlin, Lionel G
> ; Zanoni, Paulo R
> ; Zeng, Oak 
> Subject: [PATCH v2 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> VM_BIND and related uapi definitions
> 
> v2: Reduce the scope to simple Mesa use case.
> 
> Signed-off-by: Niranjana Vishwanathapura
> 
> ---
>  Documentation/gpu/rfc/i915_vm_bind.h | 226
> +++
>  1 file changed, 226 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
> b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644
> index ..b7540ddb526d
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
> @@ -0,0 +1,226 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +/**
> + * DOC: I915_PARAM_HAS_VM_BIND
> + *
> + * VM_BIND feature availability.
> + * See typedef drm_i915_getparam_t param.
> + */
> +#define I915_PARAM_HAS_VM_BIND   57
> +
> +/**
> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
> + *
> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
> + * See struct drm_i915_gem_vm_control flags.
> + *
> + * The older execbuf2 ioctl will not support VM_BIND mode of operation.
> + * For VM_BIND mode, we have new execbuf3 ioctl which will not accept
> any
> + * execlist (See struct drm_i915_gem_execbuffer3 for more details).
> + *
> + */
> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
> +
> +/* VM_BIND related ioctls */
> +#define DRM_I915_GEM_VM_BIND 0x3d
> +#define DRM_I915_GEM_VM_UNBIND   0x3e
> +#define DRM_I915_GEM_EXECBUFFER3 0x3f
> +
> +#define DRM_IOCTL_I915_GEM_VM_BIND
>   DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND,
> struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_VM_UNBIND
>   DRM_IOWR(DRM_COMMAND_BASE +
> DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
> +#define DRM_IOCTL_I915_GEM_EXECBUFFER3
>   DRM_IOWR(DRM_COMMAND_BASE +
> DRM_I915_GEM_EXECBUFFER3, struct drm_i915_gem_execbuffer3)
> +
> +/**
> + * struct drm_i915_gem_vm_bind_fence - Bind/unbind completion
> notification.
> + *
> + * A timeline out fence for vm_bind/unbind completion notification.
> + */
> +struct drm_i915_gem_vm_bind_fence {
> + /** @handle: User's handle for a drm_syncobj to signal. */
> + __u32 handle;
> +
> + /** @rsvd: Reserved, MBZ */
> + __u32 rsvd;
> +
> + /**
> +  * @value: A point in the timeline.
> +  * Value must be 0 for a binary drm_syncobj. A Value of 0 for a
> +  * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
> +  * binary one.
> +  */
> + __u64 value;
> +};
> +
> +/**
> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
> + *
> + * This structure is passed to VM_BIND ioctl and specifies the mapping of
> GPU
> + * virtual address (VA) range to the section of an object that should be
> bound
> + * in the device page table of the specified address space (VM).
> + * The VA range specified must be unique (ie., not currently bound) and can
> + * be mapped to whole object or a section of the object (partial binding).
> + * Multiple VA mappings can be created to the same section of the object
> + * (aliasing).
> + *
> + * The @start, @offset and @length should be 4K page aligned. However
> the DG2
> + * and XEHPSDV has 64K page size for device local-memory and has compact
> page
> + * table. On those platforms, for binding device local-memory objects, the
> + * @start should be 2M aligned, @offset and @length should be 64K aligned.
> + * Also, on those platforms, it is not allowed to bind an device local-memory
> + * object and a system memory object in a single 2M section of VA range.
> + */
> +struct drm_i915_gem_vm_bind {
> + /** @vm_id: VM (address space) id to bind */
> + __u32 vm_id;
> +
> + /** @handle: Object handle */
> + __u32 handle;
> +
> + /** @start: Virtual Address start to bind */
> + __u64 start;
> +
> + /** @offset: Offset in object to bind */
> + __u64 offset;
> +
> + /** @length: Length of mapping to bind */
> + __u64 length;
> +
> + /**
> +  * @flags: Supported flags are:
> +  *
> +  * I915_GEM_VM_BIND_READONLY:
> +

RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-14 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: dri-devel  On Behalf Of
> Zeng, Oak
> Sent: June 14, 2022 5:13 PM
> To: Vishwanathapura, Niranjana ;
> Landwerlin, Lionel G 
> Cc: Intel GFX ; Wilson, Chris P
> ; Hellstrom, Thomas
> ; Maling list - DRI developers  de...@lists.freedesktop.org>; Vetter, Daniel ;
> Christian König 
> Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> 
> 
> Thanks,
> Oak
> 
> > -Original Message-
> > From: Vishwanathapura, Niranjana 
> > Sent: June 14, 2022 1:02 PM
> > To: Landwerlin, Lionel G 
> > Cc: Zeng, Oak ; Intel GFX  > g...@lists.freedesktop.org>; Maling list - DRI developers  > de...@lists.freedesktop.org>; Hellstrom, Thomas
> > ; Wilson, Chris P
> ;
> > Vetter, Daniel ; Christian König
> > 
> > Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> > document
> >
> > On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> > >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> > >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> > >>>
> > >>>
> > >>>Regards,
> > >>>Oak
> > >>>
> > >>>>-Original Message-
> > >>>>From: Intel-gfx  On
> > >>>>Behalf Of Niranjana
> > >>>>Vishwanathapura
> > >>>>Sent: June 10, 2022 1:43 PM
> > >>>>To: Landwerlin, Lionel G 
> > >>>>Cc: Intel GFX ; Maling list -
> > >>>>DRI developers  > >>>>de...@lists.freedesktop.org>; Hellstrom, Thomas
> > >>>>;
> > >>>>Wilson, Chris P ; Vetter, Daniel
> > >>>>; Christian König
> > 
> > >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> > >>>>feature design
> > >>>>document
> > >>>>
> > >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> > >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> > >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> > >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> > >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin
> wrote:
> > >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> > >>>>>>>>>
> > >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> > >>>>>>>>>  wrote:
> > >>>>>>>>>
> > >>>>>>>>>  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> > >>>>Ursulin wrote:
> > >>>>>>>>>  >
> > >>>>>>>>>  >
> > >>>>>>>>>  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> > >>>>>>>>>  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>  wrote:
> > >>>>>>>>>  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> > >>>>>>>>>Ekstrand wrote:
> > >>>>>>>>>  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> > >>>>Vishwanathapura
> > >>>>>>>>>  >>>>  wrote:
> > >>>>>>>>>  >>>>
> > >>>>>>>>>  >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> > >>>>>>>>>Landwerlin
> > >>>>>>>>>  wrote:
> > >>>>>>>>>  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> > >>>>>>>>>  >>>>   >
> > >>>>>>>>>  >>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> > >>>>>>>>>Vishwanathapura
> > >>>>>>>>>  >>>>   >  wrote:
> > >>>>>>>>>  >>>>   >
> > >>>>>>>>>  >>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM
> > >>>>-0700, Matt

RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-14 Thread Zeng, Oak


Thanks,
Oak

> -Original Message-
> From: Vishwanathapura, Niranjana 
> Sent: June 14, 2022 1:02 PM
> To: Landwerlin, Lionel G 
> Cc: Zeng, Oak ; Intel GFX  g...@lists.freedesktop.org>; Maling list - DRI developers  de...@lists.freedesktop.org>; Hellstrom, Thomas
> ; Wilson, Chris P ;
> Vetter, Daniel ; Christian König
> 
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
> >On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
> >>On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
> >>>
> >>>
> >>>Regards,
> >>>Oak
> >>>
> >>>>-Original Message-
> >>>>From: Intel-gfx  On
> >>>>Behalf Of Niranjana
> >>>>Vishwanathapura
> >>>>Sent: June 10, 2022 1:43 PM
> >>>>To: Landwerlin, Lionel G 
> >>>>Cc: Intel GFX ; Maling list -
> >>>>DRI developers  >>>>de...@lists.freedesktop.org>; Hellstrom, Thomas
> >>>>;
> >>>>Wilson, Chris P ; Vetter, Daniel
> >>>>; Christian König
> 
> >>>>Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND
> >>>>feature design
> >>>>document
> >>>>
> >>>>On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >>>>>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>>>>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>>>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> >>>>>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >>>>>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >>>>>>>>>
> >>>>>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >>>>>>>>>  wrote:
> >>>>>>>>>
> >>>>>>>>>  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
> >>>>Ursulin wrote:
> >>>>>>>>>  >
> >>>>>>>>>  >
> >>>>>>>>>  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>>>>>>>>  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>  wrote:
> >>>>>>>>>  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >>>>>>>>>Ekstrand wrote:
> >>>>>>>>>  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
> >>>>Vishwanathapura
> >>>>>>>>>  >>>>  wrote:
> >>>>>>>>>  >>>>
> >>>>>>>>>  >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >>>>>>>>>Landwerlin
> >>>>>>>>>  wrote:
> >>>>>>>>>  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>>>>>>>  >>>>   >
> >>>>>>>>>  >>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >>>>>>>>>Vishwanathapura
> >>>>>>>>>  >>>>   >  wrote:
> >>>>>>>>>  >>>>   >
> >>>>>>>>>  >>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM
> >>>>-0700, Matthew
> >>>>>>>>>  >>>>Brost wrote:
> >>>>>>>>>  >>>>   >   >On Wed, Jun 01, 2022 at 05:25:49PM
> >>>>+0300, Lionel
> >>>>>>>>>  Landwerlin
> >>>>>>>>>  >>>>   wrote:
> >>>>>>>>>  >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >>>>>>>>>  wrote:
> >>>>>>>>>  >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>>>>>>>  >>>>   binding/unbinding
> >>>>>>>>>  &

RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-13 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Intel-gfx  On Behalf Of 
> Niranjana
> Vishwanathapura
> Sent: June 10, 2022 1:43 PM
> To: Landwerlin, Lionel G 
> Cc: Intel GFX ; Maling list - DRI developers 
>  de...@lists.freedesktop.org>; Hellstrom, Thomas ;
> Wilson, Chris P ; Vetter, Daniel
> ; Christian König 
> Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design
> document
> 
> On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
> >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
> >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
> >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
> >  On 09/06/2022 00:55, Jason Ekstrand wrote:
> >
> >    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
> >     wrote:
> >
> >  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >  >
> >  >
> >  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
> >Vishwanathapura
> >  wrote:
> >  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
> >Ekstrand wrote:
> >   On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >    wrote:
> >  
> >     On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
> >Landwerlin
> >  wrote:
> >     >   On 02/06/2022 23:35, Jason Ekstrand wrote:
> >     >
> >     > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
> >Vishwanathapura
> >     >  wrote:
> >     >
> >     >   On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >  Brost wrote:
> >     >   >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
> >  Landwerlin
> >     wrote:
> >     >   >> On 17/05/2022 21:32, Niranjana Vishwanathapura
> >  wrote:
> >     >   >> > +VM_BIND/UNBIND ioctl will immediately start
> >     binding/unbinding
> >     >   the mapping in an
> >     >   >> > +async worker. The binding and
> >unbinding will
> >  work like a
> >     special
> >     >   GPU engine.
> >     >   >> > +The binding and unbinding operations are
> >  serialized and
> >     will
> >     >   wait on specified
> >     >   >> > +input fences before the operation
> >and will signal
> >  the
> >     output
> >     >   fences upon the
> >     >   >> > +completion of the operation. Due to
> >  serialization,
> >     completion of
> >     >   an operation
> >     >   >> > +will also indicate that all
> >previous operations
> >  are also
> >     >   complete.
> >     >   >>
> >     >   >> I guess we should avoid saying "will
> >immediately
> >  start
> >     >   binding/unbinding" if
> >     >   >> there are fences involved.
> >     >   >>
> >     >   >> And the fact that it's happening in an async
> >  worker seem to
> >     imply
> >     >   it's not
> >     >   >> immediate.
> >     >   >>
> >     >
> >     >   Ok, will fix.
> >     >   This was added because in earlier design
> >binding was
> >  deferred
> >     until
> >     >   next execbuff.
> >     >   But now it is non-deferred (immediate in
> >that sense).
> >  But yah,
> >     this is
> >     >   confusing
> >     >   and will fix it.
> >     >
> >     >   >>
> >     >   >> I have a question on the behavior of the bind
> >  operation when
> >     no
> >     >   input fence
> >     >   >> is provided. Let say I do :
> >     >   >>
> >     >   >> VM_BIND (out_fence=fence1)
> >     >   >>
> >     >   >> VM_BIND (out_fence=fence2)
> >     >   >>
> >     >   >> VM_BIND (out_fence=fence3)
> >     >   >>
> >     >   >>
> >     >   >> In what order are the fences going to
> >be signaled?
> >     >   >>
> >     >   >> In the order of VM_BIND ioctls? Or out
> >of order?
> >     >   >>
> >     >   >> Because you wrote "serialized I assume
> 

RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-06 Thread Zeng, Oak



Regards,
Oak

> -Original Message-
> From: Vishwanathapura, Niranjana 
> Sent: June 2, 2022 4:49 PM
> To: Zeng, Oak 
> Cc: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel ; Brost, Matthew ;
> Hellstrom, Thomas ; ja...@jlekstrand.net;
> Wilson, Chris P ; christian.koe...@amd.com
> Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
> >
> >
> >Regards,
> >Oak
> >
> >> -Original Message-
> >> From: dri-devel  On Behalf Of
> >> Niranjana Vishwanathapura
> >> Sent: May 17, 2022 2:32 PM
> >> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; 
> >> Vetter,
> >> Daniel 
> >> Cc: Brost, Matthew ; Hellstrom, Thomas
> >> ; ja...@jlekstrand.net; Wilson, Chris P
> >> ; christian.koe...@amd.com
> >> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> >>
> >> VM_BIND design document with description of intended use cases.
> >>
> >> v2: Add more documentation and format as per review comments
> >> from Daniel.
> >>
> >> Signed-off-by: Niranjana Vishwanathapura
> >> 
> >> ---
> >>  Documentation/driver-api/dma-buf.rst   |   2 +
> >>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> >> +
> >>  Documentation/gpu/rfc/index.rst|   4 +
> >>  3 files changed, 310 insertions(+)
> >>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> >>
> >> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> >> api/dma-buf.rst
> >> index 36a76cbe9095..64cb924ec5bb 100644
> >> --- a/Documentation/driver-api/dma-buf.rst
> >> +++ b/Documentation/driver-api/dma-buf.rst
> >> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
> >>  .. kernel-doc:: include/linux/sync_file.h
> >> :internal:
> >>
> >> +.. _indefinite_dma_fences:
> >> +
> >>  Indefinite DMA Fences
> >>  ~
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> >> b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> new file mode 100644
> >> index ..f1be560d313c
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> >> @@ -0,0 +1,304 @@
> >> +==
> >> +I915 VM_BIND feature design and use cases
> >> +==
> >> +
> >> +VM_BIND feature
> >> +
> >> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> >> buffer
> >> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> >> +specified address space (VM). These mappings (also referred to as 
> >> persistent
> >> +mappings) will be persistent across multiple GPU submissions (execbuff 
> >> calls)
> >> +issued by the UMD, without user having to provide a list of all required
> >> +mappings during each submission (as required by older execbuff mode).
> >> +
> >> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> >> +to specify how the binding/unbinding should sync with other operations
> >> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> >> +for non-Compute contexts (See struct
> >> drm_i915_vm_bind_ext_timeline_fences).
> >> +For Compute contexts, they will be user/memory fences (See struct
> >> +drm_i915_vm_bind_ext_user_fence).
> >> +
> >> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> >> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> >> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> >> extension.
> >> +
> >> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
> mapping in
> >> an
> >> +async worker. The binding and unbinding will work like a special GPU 
> >> engine.
> >> +The binding and unbinding operations are serialized and will wait on 
> >> specified
> >> +input fences before the operation and will signal the output fences upon 
> >> the
> >> +completion of the operation. Due to serialization, completion of an 
> >> operation
> >> +will also indicate that all previous operations are also complete.
> >
> >Hi,
> >
> >Is

RE: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-01 Thread Zeng, Oak



Regards,
Oak

> -Original Message-
> From: dri-devel  On Behalf Of
> Niranjana Vishwanathapura
> Sent: May 17, 2022 2:32 PM
> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter,
> Daniel 
> Cc: Brost, Matthew ; Hellstrom, Thomas
> ; ja...@jlekstrand.net; Wilson, Chris P
> ; christian.koe...@amd.com
> Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
> 
> VM_BIND design document with description of intended use cases.
> 
> v2: Add more documentation and format as per review comments
> from Daniel.
> 
> Signed-off-by: Niranjana Vishwanathapura
> 
> ---
>  Documentation/driver-api/dma-buf.rst   |   2 +
>  Documentation/gpu/rfc/i915_vm_bind.rst | 304
> +
>  Documentation/gpu/rfc/index.rst|   4 +
>  3 files changed, 310 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-
> api/dma-buf.rst
> index 36a76cbe9095..64cb924ec5bb 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
> :internal:
> 
> +.. _indefinite_dma_fences:
> +
>  Indefinite DMA Fences
>  ~
> 
> diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
> b/Documentation/gpu/rfc/i915_vm_bind.rst
> new file mode 100644
> index ..f1be560d313c
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
> @@ -0,0 +1,304 @@
> +==
> +I915 VM_BIND feature design and use cases
> +==
> +
> +VM_BIND feature
> +
> +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
> buffer
> +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
> +specified address space (VM). These mappings (also referred to as persistent
> +mappings) will be persistent across multiple GPU submissions (execbuff calls)
> +issued by the UMD, without user having to provide a list of all required
> +mappings during each submission (as required by older execbuff mode).
> +
> +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
> +to specify how the binding/unbinding should sync with other operations
> +like the GPU job submission. These fences will be timeline 'drm_syncobj's
> +for non-Compute contexts (See struct
> drm_i915_vm_bind_ext_timeline_fences).
> +For Compute contexts, they will be user/memory fences (See struct
> +drm_i915_vm_bind_ext_user_fence).
> +
> +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
> +User has to opt-in for VM_BIND mode of binding for an address space (VM)
> +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
> extension.
> +
> +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
> an
> +async worker. The binding and unbinding will work like a special GPU engine.
> +The binding and unbinding operations are serialized and will wait on 
> specified
> +input fences before the operation and will signal the output fences upon the
> +completion of the operation. Due to serialization, completion of an operation
> +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job 
using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out 
fence signaled?

I think there could be different behavior on a non-faultable platform and a 
faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; 
and on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards,
Oak 


> +
> +VM_BIND features include:
> +
> +* Multiple Virtual Address (VA) mappings can map to the same physical pages
> +  of an object (aliasing).
> +* VA mapping can map to a partial section of the BO (partial binding).
> +* Support capture of persistent mappings in the dump upon GPU error.
> +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
> +  use cases will be helpful.
> +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
> +* Support for userptr gem objects (no special uapi is required for this).
> +
> +Execbuff ioctl in VM_BIND mode
> +---
> +The execbuff ioctl handling in VM_BIND mode differs significantly from the
> +older method. A VM in VM_BIND mode will not support older execbuff mode of
> +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
> +no support for implicit sync. It is expected that the below work will be able
> +to support requirements of object dependency setting in all use cases:
> +
> +"dma-buf: Add an API for exporting sync files"
> 

RE: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as argument for gtt binding / unbinding

2022-01-05 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: January 5, 2022 8:44 AM
> To: Zeng, Oak ; intel-...@lists.freedesktop.org; 
> dri-devel@lists.freedesktop.org; Bloomfield, Jon
> ; Vetter, Daniel ; Wilson, 
> Chris P 
> Cc: Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> argument for gtt binding / unbinding
> 
> 
> On 1/4/22 17:07, Thomas Hellström wrote:
> > Hi, Oak,
> >
> > On 1/4/22 16:35, Zeng, Oak wrote:
> >>
> >> Regards,
> >> Oak
> >>
> >>> -----Original Message-
> >>> From: Thomas Hellström 
> >>> Sent: January 4, 2022 3:29 AM
> >>> To: Zeng, Oak ; intel-...@lists.freedesktop.org;
> >>> dri-devel@lists.freedesktop.org
> >>> Cc: Auld, Matthew 
> >>> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma
> >>> resource as argument for gtt binding / unbinding
> >>>
> >>> Hi, Oak.
> >>>
> >>> On 1/4/22 00:08, Zeng, Oak wrote:
> >>>> Regards,
> >>>> Oak
> >>> Looks like your emails always start with "Regards, Oak". a
> >>> misconfiguration?
> >> My mail client (outlook) is set to automatically add a regards, when
> >> I compose new mail or reply email. Not a big problem 
> >>
> >>>
> >>>>> -Original Message-
> >>>>> From: Thomas Hellström 
> >>>>> Sent: January 3, 2022 1:58 PM
> >>>>> To: Zeng, Oak ;
> >>>>> intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> >>>>> Cc: Auld, Matthew 
> >>>>> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma
> >>>>> resource as argument for gtt binding / unbinding
> >>>>>
> >>>>> Hi, Oak.
> >>>>>
> >>>>> On 1/3/22 19:17, Zeng, Oak wrote:
> >>>>>> Regards,
> >>>>>> Oak
> >>>>>>
> >>>>>>> -Original Message-
> >>>>>>> From: Intel-gfx  On
> >>>>>>> Behalf Of Thomas Hellström
> >>>>>>> Sent: January 3, 2022 7:00 AM
> >>>>>>> To: intel-...@lists.freedesktop.org;
> >>>>>>> dri-devel@lists.freedesktop.org
> >>>>>>> Cc: Thomas Hellström ; Auld,
> >>>>>>> Matthew 
> >>>>>>> Subject: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma
> >>>>>>> resource as argument for gtt binding / unbinding
> >>>>>>>
> >>>>>>> When introducing asynchronous unbinding, the vma itself may no
> >>>>>>> longer
> >>>>>>> be alive when the actual binding or unbinding takes place.
> >>>>>> Can we take an extra reference counter of the vma to keep the vma
> >>>>>> alive, until the actual binding/unbinding takes place?
> >>>>> The point here is that that's not needed, and should be avoided.
> >>>> Can you explain more why "keeping vma alive until unbinding takes
> >>>> place" should be avoided?
> >>>>
> >>>> As I understand it, your series introduce asynchronized unbinding.
> >>>> But since vma might be no longer alive at the time of unbinding.
> >>> To overcome this difficulty, you introduce a vma resource structure
> >>> and you guarantee vma resource is alive at bind/unbind time. So
> >>> you can use vma resource for the bind/unbind operation. My question
> >>> is, can we achieve the asynchronized unbinding still using vma
> >>> structure by keeping vma structure alive ( by ref count it). This
> >>> way the change should be much smaller (compared to this series). Why
> >>> it is harmful to keep the vma alive? Maybe you have other reasons to
> >>> introduce vma resource that I don't see.
> >>>
> >>> When we allow asynchronous unbinding, it's allowed to immediately
> >>> rebind
> >>> the vma, possibly into the same gpu virtual address, but with different
> >>> pages. And when doing that we don't want to block waiting for the
> >>> unbind
> >>> to execute.
> >> Imagine this sequence:
> >>
> >> 1. Virtual address a1 is bound to physical page p1
> >> 2. Unbind a1 from p1, asynchronous

RE: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as argument for gtt binding / unbinding

2022-01-04 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: January 4, 2022 3:29 AM
> To: Zeng, Oak ; intel-...@lists.freedesktop.org; 
> dri-devel@lists.freedesktop.org
> Cc: Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> argument for gtt binding / unbinding
> 
> Hi, Oak.
> 
> On 1/4/22 00:08, Zeng, Oak wrote:
> >
> > Regards,
> > Oak
> 
> Looks like your emails always start with "Regards, Oak". a misconfiguration?

My mail client (outlook) is set to automatically add a regards, when I compose 
new mail or reply email. Not a big problem 

> 
> 
> >> -Original Message-----
> >> From: Thomas Hellström 
> >> Sent: January 3, 2022 1:58 PM
> >> To: Zeng, Oak ; intel-...@lists.freedesktop.org; 
> >> dri-devel@lists.freedesktop.org
> >> Cc: Auld, Matthew 
> >> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> >> argument for gtt binding / unbinding
> >>
> >> Hi, Oak.
> >>
> >> On 1/3/22 19:17, Zeng, Oak wrote:
> >>> Regards,
> >>> Oak
> >>>
> >>>> -Original Message-
> >>>> From: Intel-gfx  On Behalf Of 
> >>>> Thomas Hellström
> >>>> Sent: January 3, 2022 7:00 AM
> >>>> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> >>>> Cc: Thomas Hellström ; Auld, Matthew 
> >>>> 
> >>>> Subject: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> >>>> argument for gtt binding / unbinding
> >>>>
> >>>> When introducing asynchronous unbinding, the vma itself may no longer
> >>>> be alive when the actual binding or unbinding takes place.
> >>> Can we take an extra reference counter of the vma to keep the vma alive, 
> >>> until the actual binding/unbinding takes place?
> >> The point here is that that's not needed, and should be avoided.
> > Can you explain more why "keeping vma alive until unbinding takes place" 
> > should be avoided?
> >
> > As I understand it, your series introduce asynchronized unbinding. But 
> > since vma might be no longer alive at the time of unbinding.
> To overcome this difficulty, you introduce a vma resource structure and you 
> guarantee vma resource is alive at bind/unbind time. So
> you can use vma resource for the bind/unbind operation. My question is, can 
> we achieve the asynchronized unbinding still using vma
> structure by keeping vma structure alive ( by ref count it). This way the 
> change should be much smaller (compared to this series). Why
> it is harmful to keep the vma alive? Maybe you have other reasons to 
> introduce vma resource that I don't see.
> 
> When we allow asynchronous unbinding, it's allowed to immediately rebind
> the vma, possibly into the same gpu virtual address, but with different
> pages. And when doing that we don't want to block waiting for the unbind
> to execute. 

Imagine this sequence:

1. Virtual address a1 is bound to physical page p1
2. Unbind a1 from p1, asynchronous so actual unbind not happen yet
3. bind a1 to physical page p2, page table is changed, now a1 pointing to p2 in 
page table.
4. #2 now take place now - since in page table, a1 points to p2 now, does a1 
point to scratch page after #4?

In fact, we could allow a large number of outstanding binds
> and unbinds for a vma, which makes the vma structure unsuitable to track
> this, since there will no longer be a single mapping between a set of
> active pages and a vma, or a virtual gpu range and a vma.

I agree that if pages - vma - virtual gpu range is not 1:1:1 mapping, we need 
introduce a finer-grained vma resource to for the non-1:1 mapping. I also 
understand the asynchronous unbinding utilize the virtual address space more 
effectively. But my feeling is that this non-1:1 mapping makes our program hard 
to understand and maintain. Since this non- 1:1 mapping is introduced by 
asynchronous binding/unbinding, maybe the real question here is, is it really 
benefit to introduce asynchronous unbinding?

I am still not familiar enough to the codes. I suggest other experts to take a 
look also. @Bloomfield, Jon @Vetter, Daniel @Wilson, Chris P.

Regards,
Oak
> 
> Thanks,
> 
> /Thomas
> 
> >
> > Regards,
> > Oak
> >
> >   If the
> >> vma is no longer alive, that means nobody uses it anymore, but the GPU
> >> may still have work in the pipe that references the GPU virtual address.
> >>
> >> /Thomas.
> >>


RE: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as argument for gtt binding / unbinding

2022-01-03 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: January 3, 2022 1:58 PM
> To: Zeng, Oak ; intel-...@lists.freedesktop.org; 
> dri-devel@lists.freedesktop.org
> Cc: Auld, Matthew 
> Subject: Re: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> argument for gtt binding / unbinding
> 
> Hi, Oak.
> 
> On 1/3/22 19:17, Zeng, Oak wrote:
> >
> > Regards,
> > Oak
> >
> >> -Original Message-
> >> From: Intel-gfx  On Behalf Of 
> >> Thomas Hellström
> >> Sent: January 3, 2022 7:00 AM
> >> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> >> Cc: Thomas Hellström ; Auld, Matthew 
> >> 
> >> Subject: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> >> argument for gtt binding / unbinding
> >>
> >> When introducing asynchronous unbinding, the vma itself may no longer
> >> be alive when the actual binding or unbinding takes place.
> > Can we take an extra reference counter of the vma to keep the vma alive, 
> > until the actual binding/unbinding takes place?
> 
> The point here is that that's not needed, and should be avoided.

Can you explain more why "keeping vma alive until unbinding takes place" should 
be avoided? 

As I understand it, your series introduce asynchronized unbinding. But since 
vma might be no longer alive at the time of unbinding. To overcome this 
difficulty, you introduce a vma resource structure and you guarantee vma 
resource is alive at bind/unbind time. So you can use vma resource for the 
bind/unbind operation. My question is, can we achieve the asynchronized 
unbinding still using vma structure by keeping vma structure alive ( by ref 
count it). This way the change should be much smaller (compared to this 
series). Why it is harmful to keep the vma alive? Maybe you have other reasons 
to introduce vma resource that I don't see.

Regards,
Oak

 If the
> vma is no longer alive, that means nobody uses it anymore, but the GPU
> may still have work in the pipe that references the GPU virtual address.
> 
> /Thomas.
> 



RE: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as argument for gtt binding / unbinding

2022-01-03 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Intel-gfx  On Behalf Of Thomas 
> Hellström
> Sent: January 3, 2022 7:00 AM
> To: intel-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Cc: Thomas Hellström ; Auld, Matthew 
> 
> Subject: [Intel-gfx] [PATCH v4 2/4] drm/i915: Use the vma resource as 
> argument for gtt binding / unbinding
> 
> When introducing asynchronous unbinding, the vma itself may no longer
> be alive when the actual binding or unbinding takes place.

Can we take an extra reference counter of the vma to keep the vma alive, until 
the actual binding/unbinding takes place?

Regards,
Oak

> 
> Update the gtt i915_vma_ops accordingly to take a struct i915_vma_resource
> instead of a struct i915_vma for the bind_vma() and unbind_vma() ops.
> Similarly change the insert_entries() op for struct i915_address_space.
> 
> Replace a couple of i915_vma_snapshot members with their newly introduced
> i915_vma_resource counterparts, since they have the same lifetime.
> 
> Also make sure to avoid changing the struct i915_vma_flags (in particular
> the bind flags) async. That should now only be done sync under the
> vm mutex.
> 
> v2:
> - Update the vma_res::bound_flags when binding to the aliased ggtt
> 
> Signed-off-by: Thomas Hellström 
> ---
>  drivers/gpu/drm/i915/display/intel_dpt.c  | 27 ++---
>  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 27 +
>  .../gpu/drm/i915/gem/selftests/huge_pages.c   | 37 +++
>  drivers/gpu/drm/i915/gt/gen6_ppgtt.c  | 19 ++--
>  drivers/gpu/drm/i915/gt/gen8_ppgtt.c  | 37 +++
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  4 +-
>  drivers/gpu/drm/i915/gt/intel_ggtt.c  | 70 ++---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   | 16 +--
>  drivers/gpu/drm/i915/gt/intel_ppgtt.c | 22 +++--
>  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  | 13 ++-
>  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h  |  2 +-
>  drivers/gpu/drm/i915/i915_debugfs.c   |  3 +-
>  drivers/gpu/drm/i915/i915_gpu_error.c |  6 +-
>  drivers/gpu/drm/i915/i915_vma.c   | 24 -
>  drivers/gpu/drm/i915/i915_vma.h   | 11 +--
>  drivers/gpu/drm/i915/i915_vma_resource.c  |  9 +-
>  drivers/gpu/drm/i915/i915_vma_resource.h  | 99 ++-
>  drivers/gpu/drm/i915/i915_vma_snapshot.c  |  4 -
>  drivers/gpu/drm/i915/i915_vma_snapshot.h  |  8 --
>  drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 64 
>  drivers/gpu/drm/i915/selftests/mock_gtt.c | 12 +--
>  21 files changed, 308 insertions(+), 206 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/display/intel_dpt.c 
> b/drivers/gpu/drm/i915/display/intel_dpt.c
> index 8f674745e7e0..63a83d5f85a1 100644
> --- a/drivers/gpu/drm/i915/display/intel_dpt.c
> +++ b/drivers/gpu/drm/i915/display/intel_dpt.c
> @@ -48,7 +48,7 @@ static void dpt_insert_page(struct i915_address_space *vm,
>  }
> 
>  static void dpt_insert_entries(struct i915_address_space *vm,
> -struct i915_vma *vma,
> +struct i915_vma_resource *vma_res,
>  enum i915_cache_level level,
>  u32 flags)
>  {
> @@ -64,8 +64,8 @@ static void dpt_insert_entries(struct i915_address_space 
> *vm,
>* not to allow the user to override access to a read only page.
>*/
> 
> - i = vma->node.start / I915_GTT_PAGE_SIZE;
> - for_each_sgt_daddr(addr, sgt_iter, vma->pages)
> + i = vma_res->start / I915_GTT_PAGE_SIZE;
> + for_each_sgt_daddr(addr, sgt_iter, vma_res->bi.pages)
>   gen8_set_pte([i++], pte_encode | addr);
>  }
> 
> @@ -76,35 +76,38 @@ static void dpt_clear_range(struct i915_address_space *vm,
> 
>  static void dpt_bind_vma(struct i915_address_space *vm,
>struct i915_vm_pt_stash *stash,
> -  struct i915_vma *vma,
> +  struct i915_vma_resource *vma_res,
>enum i915_cache_level cache_level,
>u32 flags)
>  {
> - struct drm_i915_gem_object *obj = vma->obj;
>   u32 pte_flags;
> 
> + if (vma_res->bound_flags)
> + return;
> +
>   /* Applicable to VLV (gen8+ do not support RO in the GGTT) */
>   pte_flags = 0;
> - if (vma->vm->has_read_only && i915_gem_object_is_readonly(obj))
> + if (vm->has_read_only && vma_res->bi.readonly)
>   pte_flags |= PTE_READ_ONLY;
> - if (i915_gem_object_is_lmem(obj))
> + if (vma_res->bi.lmem)
>   pte_flags |= PTE_LM;
> 
> - vma->vm->insert_entries(vma->vm, vma, cache_level, pte_flags);
> + vm->insert_entries(vm, vma_res, cache_level, pte_flags);
> 
> - vma->page_sizes.gtt = I915_GTT_PAGE_SIZE;
> + vma_res->page_sizes_gtt = I915_GTT_PAGE_SIZE;
> 
>   /*
>* Without aliasing PPGTT there's no difference between
>* GLOBAL/LOCAL_BIND, it's all the same ptes. Hence 

RE: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem backend

2021-10-05 Thread Zeng, Oak
Thanks for explanation. This patch is Acked-by: Oak Zeng 

Regards,
Oak

> -Original Message-
> From: Auld, Matthew 
> Sent: October 5, 2021 1:07 PM
> To: Zeng, Oak ; Thomas Hellström
> ; intel-...@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org; Christian König
> 
> Subject: Re: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem
> backend
> 
> On 05/10/2021 15:23, Zeng, Oak wrote:
> >
> >
> > Regards,
> > Oak
> >
> >> -Original Message-
> >> From: Thomas Hellström 
> >> Sent: October 5, 2021 9:48 AM
> >> To: Zeng, Oak ; Auld, Matthew
> >> ; intel-...@lists.freedesktop.org
> >> Cc: dri-devel@lists.freedesktop.org; Christian König
> >> 
> >> Subject: Re: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem
> >> backend
> >>
> >>
> >> On 10/5/21 04:05, Zeng, Oak wrote:
> >>> Hi Matthew/Thomas,
> >>>
> >>> See one question inline
> >>>
> >>> Regards,
> >>> Oak
> >>>
> >>> -Original Message-
> >>> From: Intel-gfx  On Behalf Of
> >> Matthew Auld
> >>> Sent: September 27, 2021 7:41 AM
> >>> To: intel-...@lists.freedesktop.org
> >>> Cc: dri-devel@lists.freedesktop.org; Thomas Hellström
> >> ; Christian König
> >> 
> >>> Subject: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem
> backend
> >>>
> >>> For cached objects we can allocate our pages directly in shmem. This
> should
> >> make it possible(in a later patch) to utilise the existing i915-gem 
> >> shrinker
> code
> >> for such objects. For now this is still disabled.
> >>>
> >>> v2(Thomas):
> >>> - Add optional try_to_writeback hook for objects. Importantly we need
> >>>   to check if the object is even still shrinkable; in between us
> >>>   dropping the shrinker LRU lock and acquiring the object lock it 
> >>> could for
> >>>   example have been moved. Also we need to differentiate between
> >>>   "lazy" shrinking and the immediate writeback mode. Also later we
> need
> >> to
> >>>   handle objects which don't even have mm.pages, so bundling this into
> >>>   put_pages() would require somehow handling that edge case, hence
> >>>   just letting the ttm backend handle everything in try_to_writeback
> >>>   doesn't seem too bad.
> >>> v3(Thomas):
> >>> - Likely a bad idea to touch the object from the unpopulate hook,
> >>>   since it's not possible to hold a reference, without also creating
> >>>   circular dependency, so likely this is too fragile. For now just
> >>>   ensure we at least mark the pages as dirty/accessed when called from
> the
> >>>   shrinker on WILLNEED objects.
> >>> - s/try_to_writeback/shrinker_release_pages, since this can do more
> >>>   than just writeback.
> >>> - Get rid of do_backup boolean and just set the SWAPPED flag prior to
> >>>   calling unpopulate.
> >>> - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk,
> >> since
> >>>   these just get skipped anyway. We can try to come up with something
> >>>   better later.
> >>>
> >>> Signed-off-by: Matthew Auld 
> >>> Cc: Thomas Hellström 
> >>> Cc: Christian König 
> >>> ---
> >>>drivers/gpu/drm/i915/gem/i915_gem_object.h|   8 +
> >>>.../gpu/drm/i915/gem/i915_gem_object_types.h  |   2 +
> >>>drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  14 +-
> >>>drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  17 +-
> >>>drivers/gpu/drm/i915/gem/i915_gem_ttm.c   | 240
> -
> >> -
> >>>5 files changed, 245 insertions(+), 36 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> >> b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> >>> index 3043fcbd31bd..1c9a1d8d3434 100644
> >>> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> >>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> >>> @@ -601,6 +601,14 @@ int i915_gem_object_wait_migration(struct
> >> drm_i915_gem_object *obj,  bool
> >> i915_gem_object_placement_possible(struct drm_i915_gem_object *obj,
> >>> 

RE: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem backend

2021-10-05 Thread Zeng, Oak


Regards,
Oak

> -Original Message-
> From: Thomas Hellström 
> Sent: October 5, 2021 9:48 AM
> To: Zeng, Oak ; Auld, Matthew
> ; intel-...@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org; Christian König
> 
> Subject: Re: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem
> backend
> 
> 
> On 10/5/21 04:05, Zeng, Oak wrote:
> > Hi Matthew/Thomas,
> >
> > See one question inline
> >
> > Regards,
> > Oak
> >
> > -Original Message-
> > From: Intel-gfx  On Behalf Of
> Matthew Auld
> > Sent: September 27, 2021 7:41 AM
> > To: intel-...@lists.freedesktop.org
> > Cc: dri-devel@lists.freedesktop.org; Thomas Hellström
> ; Christian König
> 
> > Subject: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem backend
> >
> > For cached objects we can allocate our pages directly in shmem. This should
> make it possible(in a later patch) to utilise the existing i915-gem shrinker 
> code
> for such objects. For now this is still disabled.
> >
> > v2(Thomas):
> >- Add optional try_to_writeback hook for objects. Importantly we need
> >  to check if the object is even still shrinkable; in between us
> >  dropping the shrinker LRU lock and acquiring the object lock it could 
> > for
> >  example have been moved. Also we need to differentiate between
> >  "lazy" shrinking and the immediate writeback mode. Also later we need
> to
> >  handle objects which don't even have mm.pages, so bundling this into
> >  put_pages() would require somehow handling that edge case, hence
> >  just letting the ttm backend handle everything in try_to_writeback
> >  doesn't seem too bad.
> > v3(Thomas):
> >- Likely a bad idea to touch the object from the unpopulate hook,
> >  since it's not possible to hold a reference, without also creating
> >  circular dependency, so likely this is too fragile. For now just
> >  ensure we at least mark the pages as dirty/accessed when called from 
> > the
> >  shrinker on WILLNEED objects.
> >- s/try_to_writeback/shrinker_release_pages, since this can do more
> >  than just writeback.
> >- Get rid of do_backup boolean and just set the SWAPPED flag prior to
> >  calling unpopulate.
> >- Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk,
> since
> >  these just get skipped anyway. We can try to come up with something
> >  better later.
> >
> > Signed-off-by: Matthew Auld 
> > Cc: Thomas Hellström 
> > Cc: Christian König 
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_object.h|   8 +
> >   .../gpu/drm/i915/gem/i915_gem_object_types.h  |   2 +
> >   drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  14 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  17 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c   | 240 -
> -
> >   5 files changed, 245 insertions(+), 36 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > index 3043fcbd31bd..1c9a1d8d3434 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> > @@ -601,6 +601,14 @@ int i915_gem_object_wait_migration(struct
> drm_i915_gem_object *obj,  bool
> i915_gem_object_placement_possible(struct drm_i915_gem_object *obj,
> > enum intel_memory_type type);
> >
> > +struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
> > +   size_t size, struct intel_memory_region *mr,
> > +   struct address_space *mapping,
> > +   unsigned int max_segment);
> > +void shmem_free_st(struct sg_table *st, struct address_space *mapping,
> > +  bool dirty, bool backup);
> > +void __shmem_writeback(size_t size, struct address_space *mapping);
> > +
> >   #ifdef CONFIG_MMU_NOTIFIER
> >   static inline bool
> >   i915_gem_object_is_userptr(struct drm_i915_gem_object *obj) diff --git
> a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > index fa2ba9e2a4d0..f0fb17be2f7a 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > @@ -56,6 +56,8 @@ struct drm_i915_gem_object_ops {
> >   struct sg_table *pages);
> > void (*truncate)(struct drm_i915_gem_object *obj);
> > void (*wri

RE: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem backend

2021-10-04 Thread Zeng, Oak
Hi Matthew/Thomas,

See one question inline

Regards,
Oak

-Original Message-
From: Intel-gfx  On Behalf Of Matthew 
Auld
Sent: September 27, 2021 7:41 AM
To: intel-...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org; Thomas Hellström 
; Christian König 
Subject: [Intel-gfx] [PATCH v5 09/13] drm/i915/ttm: add tt shmem backend

For cached objects we can allocate our pages directly in shmem. This should 
make it possible(in a later patch) to utilise the existing i915-gem shrinker 
code for such objects. For now this is still disabled.

v2(Thomas):
  - Add optional try_to_writeback hook for objects. Importantly we need
to check if the object is even still shrinkable; in between us
dropping the shrinker LRU lock and acquiring the object lock it could for
example have been moved. Also we need to differentiate between
"lazy" shrinking and the immediate writeback mode. Also later we need to
handle objects which don't even have mm.pages, so bundling this into
put_pages() would require somehow handling that edge case, hence
just letting the ttm backend handle everything in try_to_writeback
doesn't seem too bad.
v3(Thomas):
  - Likely a bad idea to touch the object from the unpopulate hook,
since it's not possible to hold a reference, without also creating
circular dependency, so likely this is too fragile. For now just
ensure we at least mark the pages as dirty/accessed when called from the
shrinker on WILLNEED objects.
  - s/try_to_writeback/shrinker_release_pages, since this can do more
than just writeback.
  - Get rid of do_backup boolean and just set the SWAPPED flag prior to
calling unpopulate.
  - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk, since
these just get skipped anyway. We can try to come up with something
better later.

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Christian König 
---
 drivers/gpu/drm/i915/gem/i915_gem_object.h|   8 +
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   2 +
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  14 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  17 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c   | 240 --
 5 files changed, 245 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 3043fcbd31bd..1c9a1d8d3434 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -601,6 +601,14 @@ int i915_gem_object_wait_migration(struct 
drm_i915_gem_object *obj,  bool i915_gem_object_placement_possible(struct 
drm_i915_gem_object *obj,
enum intel_memory_type type);
 
+struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
+   size_t size, struct intel_memory_region *mr,
+   struct address_space *mapping,
+   unsigned int max_segment);
+void shmem_free_st(struct sg_table *st, struct address_space *mapping,
+  bool dirty, bool backup);
+void __shmem_writeback(size_t size, struct address_space *mapping);
+
 #ifdef CONFIG_MMU_NOTIFIER
 static inline bool
 i915_gem_object_is_userptr(struct drm_i915_gem_object *obj) diff --git 
a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index fa2ba9e2a4d0..f0fb17be2f7a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -56,6 +56,8 @@ struct drm_i915_gem_object_ops {
  struct sg_table *pages);
void (*truncate)(struct drm_i915_gem_object *obj);
void (*writeback)(struct drm_i915_gem_object *obj);
+   int (*shrinker_release_pages)(struct drm_i915_gem_object *obj,
+ bool should_writeback);
 
int (*pread)(struct drm_i915_gem_object *obj,
 const struct drm_i915_gem_pread *arg); diff --git 
a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 36b711ae9e28..19e55cc29a15 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -25,8 +25,8 @@ static void check_release_pagevec(struct pagevec *pvec)
cond_resched();
 }
 
-static void shmem_free_st(struct sg_table *st, struct address_space *mapping,
- bool dirty, bool backup)
+void shmem_free_st(struct sg_table *st, struct address_space *mapping,
+  bool dirty, bool backup)
 {
struct sgt_iter sgt_iter;
struct pagevec pvec;
@@ -52,10 +52,10 @@ static void shmem_free_st(struct sg_table *st, struct 
address_space *mapping,
kfree(st);
 }
 
-static struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
-  size_t size, struct intel_memory_region 
*mr,
-

Re: [PATCH v6 05/13] drm/amdkfd: generic type as sys mem on migration to ram

2021-08-16 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-08-16, 3:53 PM, "amd-gfx on behalf of Sierra Guiza, Alejandro (Alex)" 
 wrote:


On 8/15/2021 10:38 AM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2021 at 01:31:42AM -0500, Alex Sierra wrote:
>>  migrate.vma = vma;
>>  migrate.start = start;
>>  migrate.end = end;
>> -migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>>  migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
>>   
>> +if (adev->gmc.xgmi.connected_to_cpu)
>> +migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
>> +else
>> +migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> It's been a while since I touched this migrate code, but doesn't this
> mean that if the range already contains system memory the migration
> now won't do anything? for the connected_to_cpu case?

For above’s condition equal to connected_to_cpu , we’re explicitly 
migrating from
device memory to system memory with device generic type. 

For MEMORY_DEVICE_GENERIC memory type, why do we need to explicitly migrate it 
from device memory to normal system memory? I thought the design was, for this 
type of memory, CPU can access it in place without migration(just like CPU 
access normal system memory), so there is no need to migrate such type of 
memory to normal system memory...

With this patch, the migration behavior will be: when memory is accessed by 
CPU, it will be migrated to normal system memory; when memory is accessed by 
GPU, it will be migrated to device vram. This is basically the same behavior as 
when vram is treated as DEVICE_PRIVATE. 

I thought the whole goal of introducing DEVICE_GENERIC is to avoid such back 
and forth migration b/t device memory and normal system memory. But maybe I am 
missing something here

Regards,
Oak

In this type, 
device PTEs are
present in CPU page table.

During migrate_vma_collect_pmd walk op at migrate_vma_setup call, 
there’s a condition
for present pte that require migrate->flags be set for 
MIGRATE_VMA_SELECT_SYSTEM.
Otherwise, the migration for this entry will be ignored.

Regards,
Alex S.




Re: [PATCH v4 06/13] include/linux/mm.h: helpers to check zone device generic type

2021-07-19 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-07-17, 3:22 PM, "amd-gfx on behalf of Alex Sierra" 
 wrote:

Two helpers added. One checks if zone device page is generic
type. The other if page is either private or generic type.

Signed-off-by: Alex Sierra 
---
 include/linux/mm.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d8d79bb94be8..f5b247a63044 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1125,6 +1125,14 @@ static inline bool is_device_private_page(const 
struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }

+static inline bool is_device_page(const struct page *page)

The function name here is confusing. Theoretically as long as page's zone 
number is ZONE_DEVICE, then the page is a device page. You put the condition 
more strict below just because the kfd svm implementation only uses 
MEMORY_DEVICE_PRIVATE/GENERIC. But MEMORY_DEVICE_FS_DAX and 
MEMORY_DEVICE_PCI_P2PDMA is also device memory (compared to normal cpu system 
memory).

+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   is_zone_device_page(page) &&
+   (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+page->pgmap->type == MEMORY_DEVICE_GENERIC);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-- 
2.32.0

___
amd-gfx mailing list
amd-...@lists.freedesktop.org

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Coak.zeng%40amd.com%7C6a01845fd1524360d79a08d949583365%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637621465431851292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=nHblZdGYyYuQVws%2BgG4HgnKpGiQXCfxxEnWZuozKy9g%3Dreserved=0



Re: [PATCH v2 00/10] Implement multi-GPU DMA mappings for KFD

2021-04-27 Thread Zeng, Oak
This series is Acked-by: Oak Zeng  

Regards,
Oak 

 

On 2021-04-21, 9:31 PM, "dri-devel on behalf of Felix Kuehling" 
 
wrote:

This patch series fixes DMA-mappings of system memory (GTT and userptr)
for KFD running on multi-GPU systems with IOMMU enabled. One SG-BO per
GPU is needed to maintain the DMA mappings of each BO.

Changes in v2:
- Made the original BO parent of the SG BO to fix bo destruction order
- Removed individualiation hack that is, not needed with parent BO
- Removed resv locking hace in amdgpu_ttm_unpopulate, not needed without
  the individualization hack
- Added a patch to enable the Intel IOMMU driver in rock-dbg_defconfig
- Added a patch to move dmabuf attach/detach into backend_(un)bind

I'm still seeing some IOMMU access faults in the eviction test. They seem
to be related to userptr handling. They happen even without this patch
series on a single-GPU system, where this patch series is not needed. I
believe this is an old problem in KFD or amdgpu that is being exposed by
device isolation from the IOMMU. I'm debugging it, but it should not hold
up this patch series.

"drm/ttm: Don't count pages in SG BOs against pages_limit" was already
applied to drm-misc (I think). I'm still including it here because my
patches depend on it. Without that, the SG BOs created for DMA mappings
cause many tests fail because TTM incorrectly thinks it's out of memory.

Felix Kuehling (10):
  rock-dbg_defconfig: Enable Intel IOMMU
  drm/amdgpu: Rename kfd_bo_va_list to kfd_mem_attachment
  drm/amdgpu: Keep a bo-reference per-attachment
  drm/amdgpu: Simplify AQL queue mapping
  drm/amdgpu: Add multi-GPU DMA mapping helpers
  drm/amdgpu: DMA map/unmap when updating GPU mappings
  drm/amdgpu: Move kfd_mem_attach outside reservation
  drm/amdgpu: Add DMA mapping of GTT BOs
  drm/ttm: Don't count pages in SG BOs against pages_limit
  drm/amdgpu: Move dmabuf attach/detach to backend_(un)bind

 arch/x86/configs/rock-dbg_defconfig   |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  18 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 530 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  51 +-
 drivers/gpu/drm/ttm/ttm_tt.c  |  27 +-
 5 files changed, 437 insertions(+), 200 deletions(-)

-- 
2.31.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-develdata=04%7C01%7Coak.zeng%40amd.com%7Cfb31922bd50846641e9508d9052e635d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637546519058204046%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=yxNesWxDmM5H8ObiNmeaa0DBIEyptiBpjUKSUqS%2B52M%3Dreserved=0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 08/10] drm/amdgpu: Add DMA mapping of GTT BOs

2021-04-27 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-04-26, 11:56 PM, "Kuehling, Felix"  wrote:

Am 2021-04-26 um 8:35 p.m. schrieb Zeng, Oak:
> Regards,
> Oak 
>
>  
>
> On 2021-04-21, 9:31 PM, "amd-gfx on behalf of Felix Kuehling" 
 
wrote:
>
> Use DMABufs with dynamic attachment to DMA-map GTT BOs on other GPUs.
>
> Signed-off-by: Felix Kuehling 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 76 
++-
>  2 files changed, 77 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 63668433f5a6..b706e5a54782 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -41,6 +41,7 @@ struct amdgpu_device;
>  enum kfd_mem_attachment_type {
>   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another 
attachment's */
>   KFD_MEM_ATT_USERPTR,/* SG bo to DMA map pages from a 
userptr bo */
> + KFD_MEM_ATT_DMABUF, /* DMAbuf to DMA map TTM BOs */
>  };
>
>  struct kfd_mem_attachment {
> @@ -56,6 +57,7 @@ struct kfd_mem_attachment {
>  struct kgd_mem {
>   struct mutex lock;
>   struct amdgpu_bo *bo;
> + struct dma_buf *dmabuf;
>   struct list_head attachments;
>   /* protected by amdkfd_process_info.lock */
>   struct ttm_validate_buffer validate_list;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 9eeedd0c7920..18a1f9222a59 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -524,6 +524,16 @@ kfd_mem_dmamap_userptr(struct kgd_mem *mem,
>   return ret;
>  }
>
> +static int
> +kfd_mem_dmamap_dmabuf(struct kfd_mem_attachment *attachment)
> +{
> + struct ttm_operation_ctx ctx = {.interruptible = true};
> + struct amdgpu_bo *bo = attachment->bo_va->base.bo;
> +
> + amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
> + return ttm_bo_validate(>tbo, >placement, );
> How does this work? The function name says this is dma mapping a buffer 
but from the implementation, it is just a placement and validation

Conceptually, calling ttm_bo_validate ensures that the BO is in the
specified domain, in this case GTT. Before calling validate, it can be
in the CPU domain, which means it may be swapped to disk so it's not GPU
accessible. For a DMABuf attachment, the CPU domain means, that the
DMABuf is not attached because the underlying memory object may be on
the move or swapped out.

The actual implementation of the dmabuf attachment is currently in
amdgpu_ttm_populate/unpopulate. This is incorrect. Patch 10 in this
series fixes that to move the actual dmabuf attachment into
amdgpu_ttm_backend_bind/unbind, which is called from amdgpu_bo_move when
a BO is moved between the CPU and GTT domains.

Thanks for the explanation. One more thing I don't quite understand: before 
this series, GTT memory should already has been validated somewhere before GTT 
memory is mapped to GPU. You added GTT memory validation here - will this 
validation be duplicated?

The function naming kfd_mem_dmamap_dmabuf is still confusing since it seems to 
me it is only some preparation work before dynamically dma-map a GTT memory. 
But I understand from this series' perspective, compared to usrptr (where you 
actually do the dma-mapping in function kfd_mem_dmamap_usrptr), for gtt memory 
you leveraged the amdgpu ttm function of dynamic dma-mapping. So maybe the 
naming here makes sense from that perspective.

Another thing related but not directly to this series: for GTT memory, it is 
dma-mapped when it is allocated. See function ttm_populate_and_map_pages 
calling dma_map_page. The question is, will gtt be first dma-unmapping before 
it is mapped in amdgpu_ttm_backend_bind? It is existing work, not from your 
series. Maybe there is not issue but I just want to make sure while we are 
looking at this area. 

Regards,
  Felix


> +}
> +
>  static int
>  kfd_mem_dmamap_attachment(struct kgd_mem *mem,
> struct kfd_mem_attachment *attachment)
> @@ -533,6 +543,8 @@ kfd_mem_dmamap_attachment(struct kgd_mem *mem,
>   return 0;
&g

Re: [PATCH v2 08/10] drm/amdgpu: Add DMA mapping of GTT BOs

2021-04-26 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-04-21, 9:31 PM, "amd-gfx on behalf of Felix Kuehling" 
 
wrote:

Use DMABufs with dynamic attachment to DMA-map GTT BOs on other GPUs.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 76 ++-
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 63668433f5a6..b706e5a54782 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -41,6 +41,7 @@ struct amdgpu_device;
 enum kfd_mem_attachment_type {
KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
KFD_MEM_ATT_USERPTR,/* SG bo to DMA map pages from a userptr bo */
+   KFD_MEM_ATT_DMABUF, /* DMAbuf to DMA map TTM BOs */
 };

 struct kfd_mem_attachment {
@@ -56,6 +57,7 @@ struct kfd_mem_attachment {
 struct kgd_mem {
struct mutex lock;
struct amdgpu_bo *bo;
+   struct dma_buf *dmabuf;
struct list_head attachments;
/* protected by amdkfd_process_info.lock */
struct ttm_validate_buffer validate_list;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 9eeedd0c7920..18a1f9222a59 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -524,6 +524,16 @@ kfd_mem_dmamap_userptr(struct kgd_mem *mem,
return ret;
 }

+static int
+kfd_mem_dmamap_dmabuf(struct kfd_mem_attachment *attachment)
+{
+   struct ttm_operation_ctx ctx = {.interruptible = true};
+   struct amdgpu_bo *bo = attachment->bo_va->base.bo;
+
+   amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
+   return ttm_bo_validate(>tbo, >placement, );
How does this work? The function name says this is dma mapping a buffer but 
from the implementation, it is just a placement and validation
+}
+
 static int
 kfd_mem_dmamap_attachment(struct kgd_mem *mem,
  struct kfd_mem_attachment *attachment)
@@ -533,6 +543,8 @@ kfd_mem_dmamap_attachment(struct kgd_mem *mem,
return 0;
case KFD_MEM_ATT_USERPTR:
return kfd_mem_dmamap_userptr(mem, attachment);
+   case KFD_MEM_ATT_DMABUF:
+   return kfd_mem_dmamap_dmabuf(attachment);
default:
WARN_ON_ONCE(1);
}
@@ -562,6 +574,19 @@ kfd_mem_dmaunmap_userptr(struct kgd_mem *mem,
ttm->sg = NULL;
 }

+static void
+kfd_mem_dmaunmap_dmabuf(struct kfd_mem_attachment *attachment)
+{
+   struct ttm_operation_ctx ctx = {.interruptible = true};
+   struct amdgpu_bo *bo = attachment->bo_va->base.bo;
+
+   amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_CPU);
+   ttm_bo_validate(>tbo, >placement, );
+   /* FIXME: This does not guarantee that amdgpu_ttm_tt_unpopulate is
+* called
+*/
+}
+
 static void
 kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
struct kfd_mem_attachment *attachment)
@@ -572,6 +597,9 @@ kfd_mem_dmaunmap_attachment(struct kgd_mem *mem,
case KFD_MEM_ATT_USERPTR:
kfd_mem_dmaunmap_userptr(mem, attachment);
break;
+   case KFD_MEM_ATT_DMABUF:
+   kfd_mem_dmaunmap_dmabuf(attachment);
+   break;
default:
WARN_ON_ONCE(1);
}
@@ -605,6 +633,38 @@ kfd_mem_attach_userptr(struct amdgpu_device *adev, 
struct kgd_mem *mem,
return 0;
 }

+static int
+kfd_mem_attach_dmabuf(struct amdgpu_device *adev, struct kgd_mem *mem,
+ struct amdgpu_bo **bo)
+{
+   struct drm_gem_object *gobj;
+
+   if (!mem->dmabuf) {
+   mem->dmabuf = amdgpu_gem_prime_export(>bo->tbo.base,
+   mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
+   DRM_RDWR : 0);
+   if (IS_ERR(mem->dmabuf)) {
+   mem->dmabuf = NULL;
+   return PTR_ERR(mem->dmabuf);
+   }
+   }
+
+   gobj = amdgpu_gem_prime_import(>ddev, mem->dmabuf);
+   if (IS_ERR(gobj))
+   return PTR_ERR(gobj);
+
+   /* Import takes an extra reference on the dmabuf. Drop it now to
+* avoid leaking it. We only need the one reference in
+* kgd_mem->dmabuf.
+*/
+   dma_buf_put(mem->dmabuf);
+
+   *bo = gem_to_amdgpu_bo(gobj);
+   (*bo)->parent = amdgpu_bo_ref(mem->bo);
+
+   return 0;
+}
+
 /* kfd_mem_attach - Add a BO to a VM
  *
  * Everything that needs to bo done only once when a BO is first 

Re: [PATCH v2 06/10] drm/amdgpu: DMA map/unmap when updating GPU mappings

2021-04-26 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-04-21, 9:31 PM, "dri-devel on behalf of Felix Kuehling" 
 
wrote:

DMA map kfd_mem_attachments in update_gpuvm_pte. This function is called
with the BO and page tables reserved, so we can safely update the DMA
mapping.

DMA unmap when a BO is unmapped from a GPU and before updating mappings
in restore workers.

Signed-off-by: Felix Kuehling 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 56 ++-
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 49d1af4aa5f1..7d25d886b98c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -961,11 +961,12 @@ static int unreserve_bo_and_vms(struct 
bo_vm_reservation_context *ctx,
return ret;
 }

-static int unmap_bo_from_gpuvm(struct amdgpu_device *adev,
+static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
struct kfd_mem_attachment *entry,
struct amdgpu_sync *sync)
 {
struct amdgpu_bo_va *bo_va = entry->bo_va;
+   struct amdgpu_device *adev = entry->adev;
struct amdgpu_vm *vm = bo_va->base.vm;

amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
@@ -974,15 +975,20 @@ static int unmap_bo_from_gpuvm(struct amdgpu_device 
*adev,

amdgpu_sync_fence(sync, bo_va->last_pt_update);

-   return 0;
+   kfd_mem_dmaunmap_attachment(mem, entry);
 }

-static int update_gpuvm_pte(struct amdgpu_device *adev,
-   struct kfd_mem_attachment *entry,
-   struct amdgpu_sync *sync)
+static int update_gpuvm_pte(struct kgd_mem *mem,
+   struct kfd_mem_attachment *entry,
+   struct amdgpu_sync *sync)
 {
-   int ret;
struct amdgpu_bo_va *bo_va = entry->bo_va;
+   struct amdgpu_device *adev = entry->adev;
+   int ret;
+
+   ret = kfd_mem_dmamap_attachment(mem, entry);
Should the dma mapping be done in the kfd_mem_attach function on a memory 
object is attached to a vm the first time? Since each memory object can be 
mapped to many GPU or many VMs, by doing dma mapping the first it is attached 
can simplify the logics. Or even simpler, maybe we can just just dma map when a 
memory object is created - it wastes some iommu page table entry but really 
simplify the logic in this patch series. I found this series is not very easy 
to understand.
+   if (ret)
+   return ret;

/* Update the page tables  */
ret = amdgpu_vm_bo_update(adev, bo_va, false);
@@ -994,14 +1000,15 @@ static int update_gpuvm_pte(struct amdgpu_device 
*adev,
return amdgpu_sync_fence(sync, bo_va->last_pt_update);
 }

-static int map_bo_to_gpuvm(struct amdgpu_device *adev,
-   struct kfd_mem_attachment *entry, struct amdgpu_sync *sync,
-   bool no_update_pte)
+static int map_bo_to_gpuvm(struct kgd_mem *mem,
+  struct kfd_mem_attachment *entry,
+  struct amdgpu_sync *sync,
+  bool no_update_pte)
 {
int ret;

/* Set virtual address for the allocation */
-   ret = amdgpu_vm_bo_map(adev, entry->bo_va, entry->va, 0,
+   ret = amdgpu_vm_bo_map(entry->adev, entry->bo_va, entry->va, 0,
   amdgpu_bo_size(entry->bo_va->base.bo),
   entry->pte_flags);
if (ret) {
@@ -1013,7 +1020,7 @@ static int map_bo_to_gpuvm(struct amdgpu_device *adev,
if (no_update_pte)
return 0;

-   ret = update_gpuvm_pte(adev, entry, sync);
+   ret = update_gpuvm_pte(mem, entry, sync);
if (ret) {
pr_err("update_gpuvm_pte() failed\n");
goto update_gpuvm_pte_failed;
@@ -1022,7 +1029,7 @@ static int map_bo_to_gpuvm(struct amdgpu_device *adev,
return 0;

 update_gpuvm_pte_failed:
-   unmap_bo_from_gpuvm(adev, entry, sync);
+   unmap_bo_from_gpuvm(mem, entry, sync);
return ret;
 }

@@ -1596,7 +1603,7 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
pr_debug("\t map VA 0x%llx - 0x%llx in entry %p\n",
 entry->va, entry->va + bo_size, entry);

-   ret = map_bo_to_gpuvm(adev, entry, ctx.sync,
+   ret = map_bo_to_gpuvm(mem, entry, ctx.sync,
  is_invalid_userptr);
if (ret) {
pr_err("Failed to map bo to gpuvm\n");
@@ -1635,7 +1642,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
 int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
struct kgd_dev *kgd, struct kgd_mem *mem, void *drm_priv)
 {
-   struct amdgpu_device *adev = 

Re: [PATCH v2 05/10] drm/amdgpu: Add multi-GPU DMA mapping helpers

2021-04-26 Thread Zeng, Oak
As I understand it, when one GPU map another GPU's vram, this vram should also 
be mapped in iommu page table. Also normal GTT memory (versus userptr) also 
need to be mapped in iommu. But don't see this code below. I only see you map 
userptr in iommu. Maybe you map them in iommu not during memory attachment time?

Also see a nit-pick inline

Regards,
Oak 

 

On 2021-04-21, 9:31 PM, "dri-devel on behalf of Felix Kuehling" 
 
wrote:

Add BO-type specific helpers functions to DMA-map and unmap
kfd_mem_attachments. Implement this functionality for userptrs by creating
one SG BO per GPU and filling it with a DMA mapping of the pages from the
original mem->bo.

Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   8 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 146 +-
 2 files changed, 145 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index c24b2478f445..63668433f5a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -38,11 +38,17 @@ extern uint64_t amdgpu_amdkfd_total_mem_size;

 struct amdgpu_device;

+enum kfd_mem_attachment_type {
+   KFD_MEM_ATT_SHARED, /* Share kgd_mem->bo or another attachment's */
+   KFD_MEM_ATT_USERPTR,/* SG bo to DMA map pages from a userptr bo */
+};
+
 struct kfd_mem_attachment {
struct list_head list;
+   enum kfd_mem_attachment_type type;
+   bool is_mapped;
struct amdgpu_bo_va *bo_va;
struct amdgpu_device *adev;
-   bool is_mapped;
uint64_t va;
uint64_t pte_flags;
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index fbd7e786b54e..49d1af4aa5f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -473,12 +473,117 @@ static uint64_t get_pte_flags(struct amdgpu_device 
*adev, struct kgd_mem *mem)
return pte_flags;
 }

+static int
+kfd_mem_dmamap_userptr(struct kgd_mem *mem,
+  struct kfd_mem_attachment *attachment)
+{
+   enum dma_data_direction direction =
+   mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
+   DMA_BIDIRECTIONAL : DMA_TO_DEVICE;
+   struct ttm_operation_ctx ctx = {.interruptible = true};
+   struct amdgpu_bo *bo = attachment->bo_va->base.bo;
+   struct amdgpu_device *adev = attachment->adev;
+   struct ttm_tt *src_ttm = mem->bo->tbo.ttm;
+   struct ttm_tt *ttm = bo->tbo.ttm;
+   int ret;
+
+   ttm->sg = kmalloc(sizeof(*ttm->sg), GFP_KERNEL);
+   if (unlikely(!ttm->sg))
+   return -ENOMEM;
+
+   if (WARN_ON(ttm->num_pages != src_ttm->num_pages))
+   return -EINVAL;
+
+   /* Same sequence as in amdgpu_ttm_tt_pin_userptr */
+   ret = sg_alloc_table_from_pages(ttm->sg, src_ttm->pages,
+   ttm->num_pages, 0,
+   (u64)ttm->num_pages << PAGE_SHIFT,
+   GFP_KERNEL);
+   if (unlikely(ret))
+   goto release_sg;
Should go to a label starting from kfree below?
+
+   ret = dma_map_sgtable(adev->dev, ttm->sg, direction, 0);
+   if (unlikely(ret))
+   goto release_sg;
+
+   drm_prime_sg_to_dma_addr_array(ttm->sg, ttm->dma_address,
+  ttm->num_pages);
+
+   amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
+   ret = ttm_bo_validate(>tbo, >placement, );
+   if (ret)
+   goto release_sg;
+
+   return 0;
+
+release_sg:
+   pr_err("DMA map userptr failed: %d\n", ret);
+   sg_free_table(ttm->sg);
+   kfree(ttm->sg);
+   ttm->sg = NULL;
+   return ret;
+}
+
+static int
+kfd_mem_dmamap_attachment(struct kgd_mem *mem,
+ struct kfd_mem_attachment *attachment)
+{
+   switch (attachment->type) {
+   case KFD_MEM_ATT_SHARED:
+   return 0;
+   case KFD_MEM_ATT_USERPTR:
+   return kfd_mem_dmamap_userptr(mem, attachment);
+   default:
+   WARN_ON_ONCE(1);
+   }
+   return -EINVAL;
+}
+
+static void
+kfd_mem_dmaunmap_userptr(struct kgd_mem *mem,
+struct kfd_mem_attachment *attachment)
+{
+   enum dma_data_direction direction =
+   mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
+   DMA_BIDIRECTIONAL : DMA_TO_DEVICE;
+   struct ttm_operation_ctx ctx = {.interruptible = false};
+   struct amdgpu_bo *bo = attachment->bo_va->base.bo;
+   struct amdgpu_device *adev = attachment->adev;
+   

Re: [PATCH v2 04/10] drm/amdgpu: Simplify AQL queue mapping

2021-04-22 Thread Zeng, Oak


Regards,
Oak 

 

On 2021-04-21, 9:31 PM, "amd-gfx on behalf of Felix Kuehling" 
 
wrote:

Do AQL queue double-mapping with a single attach call. That will make it
easier to create per-GPU BOs later, to be shared between the two BO VA
mappings on the same GPU.

Freeing the attachments is not necessary if map_to_gpu fails. These will be
cleaned up when the kdg_mem object is destroyed in
amdgpu_amdkfd_gpuvm_free_memory_of_gpu.

Signed-off-by: Felix Kuehling 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 103 --
 1 file changed, 48 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 34c9a2d0028e..fbd7e786b54e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -486,70 +486,76 @@ static uint64_t get_pte_flags(struct amdgpu_device 
*adev, struct kgd_mem *mem)
  * 4a.  Validate new page tables and directories
  */
 static int kfd_mem_attach(struct amdgpu_device *adev, struct kgd_mem *mem,
-   struct amdgpu_vm *vm, bool is_aql,
-   struct kfd_mem_attachment **p_attachment)
+   struct amdgpu_vm *vm, bool is_aql)
 {
unsigned long bo_size = mem->bo->tbo.base.size;
uint64_t va = mem->va;
-   struct kfd_mem_attachment *attachment;
-   struct amdgpu_bo *bo;
-   int ret;
+   struct kfd_mem_attachment *attachment[2] = {NULL, NULL};
+   struct amdgpu_bo *bo[2] = {NULL, NULL};
+   int i, ret;

if (!va) {
pr_err("Invalid VA when adding BO to VM\n");
return -EINVAL;
}

-   if (is_aql)
-   va += bo_size;
-
-   attachment = kzalloc(sizeof(*attachment), GFP_KERNEL);
-   if (!attachment)
-   return -ENOMEM;
+   for (i = 0; i <= is_aql; i++) {
+   attachment[i] = kzalloc(sizeof(*attachment[i]), GFP_KERNEL);
+   if (unlikely(!attachment[i])) {
+   ret = -ENOMEM;
+   goto unwind;
+   }

-   pr_debug("\t add VA 0x%llx - 0x%llx to vm %p\n", va,
-   va + bo_size, vm);
+   pr_debug("\t add VA 0x%llx - 0x%llx to vm %p\n", va,
+va + bo_size, vm);

-   /* FIXME: For now all attachments use the same BO. This is incorrect
-* because one BO can only have one DMA mapping for one GPU. We need
-* one BO per GPU, e.g. a DMABuf import with dynamic attachment. This
-* will be addressed one BO-type at a time in subsequent patches.
-*/
-   bo = mem->bo;
-   drm_gem_object_get(>tbo.base);
+   /* FIXME: For now all attachments use the same BO. This is
+* incorrect because one BO can only have one DMA mapping
+* for one GPU. We need one BO per GPU, e.g. a DMABuf
+* import with dynamic attachment. This will be addressed
+* one BO-type at a time in subsequent patches.
+*/
+   bo[i] = mem->bo;
+   drm_gem_object_get([i]->tbo.base);

-   /* Add BO to VM internal data structures*/
-   attachment->bo_va = amdgpu_vm_bo_add(adev, vm, bo);
-   if (!attachment->bo_va) {
-   ret = -EINVAL;
-   pr_err("Failed to add BO object to VM. ret == %d\n",
-   ret);
-   goto err_vmadd;
-   }
+   /* Add BO to VM internal data structures */
+   attachment[i]->bo_va = amdgpu_vm_bo_add(adev, vm, bo[i]);
Just for discussion. Are we allowed to add one bo twice to a vm? When I looked 
at amdgpu_vm_bo_base_init (called by amdgpu_vm_bo_add), line:
bo->vm_bo = base;
when you add the same bo to vm the second time, bo->vm_bo will be overwritten. 
I am not sure whether this will cause an issue later.
This is not introduced by your code. The original code (calling kfd_mem_attach 
twice for aql) has the same problem.
+   if (unlikely(!attachment[i]->bo_va)) {
+   ret = -ENOMEM;
+   pr_err("Failed to add BO object to VM. ret == %d\n",
+  ret);
+   goto unwind;
+   }

-   attachment->va = va;
-   attachment->pte_flags = get_pte_flags(adev, mem);
-   attachment->adev = adev;
-   list_add(>list, >attachments);
+   attachment[i]->va = va;
+   attachment[i]->pte_flags = get_pte_flags(adev, mem);
+   attachment[i]->adev = adev;
+   list_add([i]->list, >attachments);

-   if (p_attachment)
-   *p_attachment = attachment;
+   va += bo_size;
+   }

/* Allocate validate page tables if needed */
ret = vm_validate_pt_pd_bos(vm);
if (unlikely(ret)) {
   

Re: [PATCH] drm/amdgpu: Mark mmhub_v1_7_setup_vm_pt_regs() as static

2021-03-12 Thread Zeng, Oak
[AMD Official Use Only - Internal Distribution Only]

Thank you Joarder for the fix. But this has already been fixed in our Alex's 
drm-next branch.

Regards,
Oak



On 2021-03-12, 5:19 PM, "Souptick Joarder"  wrote:

Kernel test robot throws below warning ->

drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c:56:6: warning: no previous
prototype for 'mmhub_v1_7_setup_vm_pt_regs' [-Wmissing-prototypes]

Mark mmhub_v1_7_setup_vm_pt_regs() as static.

Reported-by: kernel test robot 
Signed-off-by: Souptick Joarder 
---
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
index 4df0b73..ae7d8a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
@@ -53,7 +53,7 @@ static u64 mmhub_v1_7_get_fb_location(struct 
amdgpu_device *adev)
 return base;
 }

-void mmhub_v1_7_setup_vm_pt_regs(struct amdgpu_device *adev, uint32_t vmid,
+static void mmhub_v1_7_setup_vm_pt_regs(struct amdgpu_device *adev, 
uint32_t vmid,
 uint64_t page_table_base)
 {
 struct amdgpu_vmhub *hub = >vmhub[AMDGPU_MMHUB_0];
--
1.9.1


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [PATCH] drm/ttm: ioremap buffer according to TTM mem caching setting

2021-03-03 Thread Zeng, Oak
[AMD Official Use Only - Internal Distribution Only]

Hi Christian,

Can you explain why __iomem annotation is mandatory for amdgpu driver? If this 
is the case, we can't switch to memremap. The only fix seems to me is add a 
#ifdef __x86_64__ to the ioremap_cache codes.

Regards,
Oak

From: Christian König 
Sent: Wednesday, March 3, 2021 5:46 AM
To: Zeng, Oak ; amd-...@lists.freedesktop.org; 
dri-devel@lists.freedesktop.org; Daniel Vetter ; Dave Airlie 
; Thomas Hellström (Intel) ; 
dan.j.willi...@intel.com
Cc: kbuild-...@lists.01.org; Kuehling, Felix ; 
Kasiviswanathan, Harish ; Deucher, Alexander 
; Huang, JinHuiEric ; 
Koenig, Christian 
Subject: Re: [PATCH] drm/ttm: ioremap buffer according to TTM mem caching 
setting

Hi Oak,



config: parisc-randconfig-r012-20210302 (attached as .config)

It's not the Intel driver build which fails here, but the build bot is just 
hosted by Intel.

The problem is that the parisc architecture doesn't defines the ioremap_cache() 
function.

I've looked at using memremap() instead of ioremap_cache(). The problem is that 
we do support architectures with the TTM as well as amndgpu code where the 
__iomem annotation is mandatory and correct.

Regards,
Christian.
Am 02.03.21 um 23:45 schrieb Zeng, Oak:

[AMD Official Use Only - Internal Distribution Only]

Hi Daniel, Thomas, Dan,

Does below message mean the calling ioremap_cache failed intel's driver build? 
I can see both ioremap_cache and ioremap_wc are defined in 
arch/x86/mm/ioremap.c - why ioremap_wc doesn't break intel driver's build?

Are we supposed to use memremap (offset, size, MEMREMAP_WB) to replace 
ioremap_cache? When I read here 
https://lwn.net/Articles/653585/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F653585%2F=04%7C01%7COak.Zeng%40amd.com%7Cc047ecb316df47cde7ed08d8de3188d4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503651624296472%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=ljCFrIfrYbb%2FXmKKS2TJ7dSQ7oCRNWoUhWS4gEBv%2FW4%3D=0>
 I felt that ioremap_cache returns an address annotated with _iomem while 
memremap returns an address without __iomem annotation. In our use case, GPU 
memory is treated as UEFI SPM (specific purpose memory). I am not very sure 
whether memremap (thus no __iomem annotation) is the right thing to do. What I 
am sure is, we have tested ioremap_cache and it works fine on AMD system.

I will send out a test patch replacing ioremap_cache with ioremap_wc, to 
trigger Intel build robot to see whether it fails Intel build. I suppose it 
will not fail Intel build.

Regards,
Oak

From: Christian König 
<mailto:ckoenig.leichtzumer...@gmail.com>
Sent: Tuesday, March 2, 2021 6:31 AM
To: amd-...@lists.freedesktop.org<mailto:amd-...@lists.freedesktop.org>; 
dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>; Daniel 
Vetter <mailto:dan...@ffwll.ch>; Dave Airlie 
<mailto:airl...@redhat.com>; Thomas Hellström (Intel) 
<mailto:thomas...@shipmail.org>
Cc: Zeng, Oak <mailto:oak.z...@amd.com>; 
kbuild-...@lists.01.org<mailto:kbuild-...@lists.01.org>; Kuehling, Felix 
<mailto:felix.kuehl...@amd.com>; Kasiviswanathan, 
Harish <mailto:harish.kasiviswanat...@amd.com>; 
Deucher, Alexander 
<mailto:alexander.deuc...@amd.com>; Huang, 
JinHuiEric <mailto:jinhuieric.hu...@amd.com>; Koenig, 
Christian <mailto:christian.koe...@amd.com>
Subject: Re: [PATCH] drm/ttm: ioremap buffer according to TTM mem caching 
setting

Hi guys,

adding the usual suspects direct. Does anybody of hand know how to check if an 
architecture supports ioremap_cache()?

For now we only need this on X86, but I would feel better if we don't use an 
#ifdef here.

Regards,
Christian.
Am 02.03.21 um 05:12 schrieb kernel test robot:

Hi Oak,



Thank you for the patch! Yet something to improve:



[auto build test ERROR on drm-intel/for-linux-next]

[also build test ERROR on drm-tip/drm-tip linus/master v5.12-rc1 next-20210302]

[cannot apply to tegra-drm/drm/tegra/for-next drm-exynos/exynos-drm-next 
drm/drm-next]

[If your patch is applied to the wrong git tree, kindly drop us a note.

And when submitting patch, we suggest to use '--base' as documented in

https://git-scm.com/docs/git-format-patch<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit-scm.com%2Fdocs%2Fgit-format-patch=04%7C01%7COak.Zeng%40amd.com%7Cc047ecb316df47cde7ed08d8de3188d4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503651624306464%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=3f5ib%2FlZ6DXXF%2Bk1rXPKGu1IkOhXHdkUmX3obtuIRtA%3D=0>]



url:
https://github.com/0day-ci/linux/commits/Oak-Zeng/drm-ttm-ioremap-buffer-according-to-TTM-mem-caching-setting/20210302-064500<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommits%2FOak

RE: [PATCH] drm/ttm: ioremap buffer according to TTM mem caching setting

2021-03-02 Thread Zeng, Oak
[AMD Official Use Only - Internal Distribution Only]

Hi Daniel, Thomas, Dan,

Does below message mean the calling ioremap_cache failed intel's driver build? 
I can see both ioremap_cache and ioremap_wc are defined in 
arch/x86/mm/ioremap.c - why ioremap_wc doesn't break intel driver's build?

Are we supposed to use memremap (offset, size, MEMREMAP_WB) to replace 
ioremap_cache? When I read here https://lwn.net/Articles/653585/ I felt that 
ioremap_cache returns an address annotated with _iomem while memremap returns 
an address without __iomem annotation. In our use case, GPU memory is treated 
as UEFI SPM (specific purpose memory). I am not very sure whether memremap 
(thus no __iomem annotation) is the right thing to do. What I am sure is, we 
have tested ioremap_cache and it works fine on AMD system.

I will send out a test patch replacing ioremap_cache with ioremap_wc, to 
trigger Intel build robot to see whether it fails Intel build. I suppose it 
will not fail Intel build.

Regards,
Oak

From: Christian König 
Sent: Tuesday, March 2, 2021 6:31 AM
To: amd-...@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Daniel 
Vetter ; Dave Airlie ; Thomas Hellström 
(Intel) 
Cc: Zeng, Oak ; kbuild-...@lists.01.org; Kuehling, Felix 
; Kasiviswanathan, Harish 
; Deucher, Alexander 
; Huang, JinHuiEric ; 
Koenig, Christian 
Subject: Re: [PATCH] drm/ttm: ioremap buffer according to TTM mem caching 
setting

Hi guys,

adding the usual suspects direct. Does anybody of hand know how to check if an 
architecture supports ioremap_cache()?

For now we only need this on X86, but I would feel better if we don't use an 
#ifdef here.

Regards,
Christian.
Am 02.03.21 um 05:12 schrieb kernel test robot:

Hi Oak,



Thank you for the patch! Yet something to improve:



[auto build test ERROR on drm-intel/for-linux-next]

[also build test ERROR on drm-tip/drm-tip linus/master v5.12-rc1 next-20210302]

[cannot apply to tegra-drm/drm/tegra/for-next drm-exynos/exynos-drm-next 
drm/drm-next]

[If your patch is applied to the wrong git tree, kindly drop us a note.

And when submitting patch, we suggest to use '--base' as documented in

https://git-scm.com/docs/git-format-patch<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit-scm.com%2Fdocs%2Fgit-format-patch=04%7C01%7COak.Zeng%40amd.com%7C08f51e87e36c4de858bc08d8dd6eb16b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637502814793168696%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=p4iynMPvZGknfSGSyZnXV3kLwScMLbPDB8zVsmxhtk0%3D=0>]



url:
https://github.com/0day-ci/linux/commits/Oak-Zeng/drm-ttm-ioremap-buffer-according-to-TTM-mem-caching-setting/20210302-064500<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommits%2FOak-Zeng%2Fdrm-ttm-ioremap-buffer-according-to-TTM-mem-caching-setting%2F20210302-064500=04%7C01%7COak.Zeng%40amd.com%7C08f51e87e36c4de858bc08d8dd6eb16b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637502814793178689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=2sc4jZR3bVRF0xDDqNOtUcNR9qiJMF2ATmCDAX%2BSWrQ%3D=0>

base:   git://anongit.freedesktop.org/drm-intel for-linux-next

config: parisc-randconfig-r012-20210302 (attached as .config)

compiler: hppa-linux-gcc (GCC) 9.3.0

reproduce (this is a W=1 build):

wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fintel%2Flkp-tests%2Fmaster%2Fsbin%2Fmake.cross=04%7C01%7COak.Zeng%40amd.com%7C08f51e87e36c4de858bc08d8dd6eb16b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637502814793178689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=uILcLE%2F24bhSU%2Bo5GmWGAK6s6xDFivP6lrm6JgtM50Y%3D=0>
 -O ~/bin/make.cross

chmod +x ~/bin/make.cross

# 
https://github.com/0day-ci/linux/commit/225bb3711439ec559dd72ae5af8e62d34ea60a64<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommit%2F225bb3711439ec559dd72ae5af8e62d34ea60a64=04%7C01%7COak.Zeng%40amd.com%7C08f51e87e36c4de858bc08d8dd6eb16b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637502814793188685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=2TOSPuKEMRcZjMfxO9lxgwFxgXwHqERCOgRednI7OE8%3D=0>

git remote add linux-review 
https://github.com/0day-ci/linux<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux=04%7C01%7COak.Zeng%40amd.com%7C08f51e87e36c4de858bc08d8dd6eb16b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637502814793188685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=TlXvs5mxH0RV9qQFaUF2B1LZisTWbnt4hfFd2OC7gGw%3D=0>

git fetch --no-tags linux-review 
Oak-Zeng/dr

RE: [PATCH] drm/ttm: fix ttm_bo_unreserve

2019-06-05 Thread Zeng, Oak


Regards,
Oak

-Original Message-
From: Christian König  
Sent: Wednesday, June 5, 2019 7:25 AM
To: Zeng, Oak ; Kuehling, Felix ; 
dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org
Subject: Re: [PATCH] drm/ttm: fix ttm_bo_unreserve

Am 04.06.19 um 21:03 schrieb Zeng, Oak:
>
> Regards,
> Oak
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Kuehling, Felix
> Sent: Tuesday, June 4, 2019 2:47 PM
> To: Christian König ; 
> dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org
> Subject: Re: [PATCH] drm/ttm: fix ttm_bo_unreserve
>
> On 2019-06-04 11:23, Christian König wrote:
>
>> Since we now keep BOs on the LRU we need to make sure that they are 
>> removed when they are pinned.
>>
>> Signed-off-by: Christian König 
>> ---
>>include/drm/ttm/ttm_bo_driver.h | 14 ++
>>1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/drm/ttm/ttm_bo_driver.h 
>> b/include/drm/ttm/ttm_bo_driver.h index 9f54cf9c60df..c9b8ba492f24
>> 100644
>> --- a/include/drm/ttm/ttm_bo_driver.h
>> +++ b/include/drm/ttm/ttm_bo_driver.h
>> @@ -767,14 +767,12 @@ static inline int ttm_bo_reserve_slowpath(struct 
>> ttm_buffer_object *bo,
>> */
>>static inline void ttm_bo_unreserve(struct ttm_buffer_object *bo)
>>{
>> -if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
>> -spin_lock(>bdev->glob->lru_lock);
>> -if (list_empty(>lru))
>> -ttm_bo_add_to_lru(bo);
>> -else
>> -ttm_bo_move_to_lru_tail(bo, NULL);
>> -spin_unlock(>bdev->glob->lru_lock);
>> -}
>> +spin_lock(>bdev->glob->lru_lock);
>> +if (list_empty(>lru))
>> +ttm_bo_add_to_lru(bo);
>> +else
>> +ttm_bo_move_to_lru_tail(bo, NULL);
> Going just by the function names, this seems to do the exact opposite of what 
> the change description says.
>
> [Oak] +1, when I read the description, I also get lost...So please do add a 
> more accurate description.

I'm puzzled why you are confused. We now keep the BOs on the LRU while they are 
reserved, so on unreserve we now need to explicitly remove them from the LRU 
when they are pinned.

[Oak] When I read the description, I though you meant to remove bo from LRU on 
a pin action, but from codes, it is done on unreserve. In other words, it is 
better to say "if it is pinned" than  "when it is pinned". Sorry being 
pickyAlso from codes before your change, there was a condition 
"!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)". Is this condition to check 
whether bo is no pinned? How do you check whether bo is pinned in the new 
codes? To me condition " list_empty(>lru)" only means this bo is currently 
not on LRU list, I am not sure whether this also means it is not pinned. Also, 
can ttm_bo_move_to_lru_tail be replaced with ttm_bo_del_from_lru - from your 
description, this is more like a function to remove it from LRU. Sorry too many 
questions. I really don't know the context here...

>
> Anway, this patch is Reviewed-by: Felix Kuehling 
>
> BTW, this fix is needed for KFD. It fixes our eviction test that was broken 
> by your previous patch series. This test specifically triggers interactions 
> between KFD and graphics under memory pressure. It's something we rarely see 
> in real world compute application testing without a targeted test. But when 
> it breaks it leads to some painful intermittent failures that are hard to 
> regress and debug.
>
> Do you have any targeted tests to trigger evictions when you work on TTM 
> internals?

Cat amdgpu_evict_gtt in debugfs is a good test for this.

Christian.

>
> Regards,
>     Felix
>
>
>> +spin_unlock(>bdev->glob->lru_lock);
>>  reservation_object_unlock(bo->resv);
>>}
>>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

RE: [PATCH] drm/ttm: fix ttm_bo_unreserve

2019-06-04 Thread Zeng, Oak


Regards,
Oak

-Original Message-
From: amd-gfx  On Behalf Of Kuehling, 
Felix
Sent: Tuesday, June 4, 2019 2:47 PM
To: Christian König ; 
dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org
Subject: Re: [PATCH] drm/ttm: fix ttm_bo_unreserve

On 2019-06-04 11:23, Christian König wrote:

> Since we now keep BOs on the LRU we need to make sure that they are 
> removed when they are pinned.
>
> Signed-off-by: Christian König 
> ---
>   include/drm/ttm/ttm_bo_driver.h | 14 ++
>   1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/include/drm/ttm/ttm_bo_driver.h 
> b/include/drm/ttm/ttm_bo_driver.h index 9f54cf9c60df..c9b8ba492f24 
> 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -767,14 +767,12 @@ static inline int ttm_bo_reserve_slowpath(struct 
> ttm_buffer_object *bo,
>*/
>   static inline void ttm_bo_unreserve(struct ttm_buffer_object *bo)
>   {
> - if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
> - spin_lock(>bdev->glob->lru_lock);
> - if (list_empty(>lru))
> - ttm_bo_add_to_lru(bo);
> - else
> - ttm_bo_move_to_lru_tail(bo, NULL);
> - spin_unlock(>bdev->glob->lru_lock);
> - }
> + spin_lock(>bdev->glob->lru_lock);
> + if (list_empty(>lru))
> + ttm_bo_add_to_lru(bo);
> + else
> + ttm_bo_move_to_lru_tail(bo, NULL);

Going just by the function names, this seems to do the exact opposite of what 
the change description says.

[Oak] +1, when I read the description, I also get lost...So please do add a 
more accurate description.

Anway, this patch is Reviewed-by: Felix Kuehling 

BTW, this fix is needed for KFD. It fixes our eviction test that was broken by 
your previous patch series. This test specifically triggers interactions 
between KFD and graphics under memory pressure. It's something we rarely see in 
real world compute application testing without a targeted test. But when it 
breaks it leads to some painful intermittent failures that are hard to regress 
and debug.

Do you have any targeted tests to trigger evictions when you work on TTM 
internals?

Regards,
   Felix


> + spin_unlock(>bdev->glob->lru_lock);
>   reservation_object_unlock(bo->resv);
>   }
>   
___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel