Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-09-05 Thread Chen, Xiaogang



On 9/5/2023 9:02 AM, Philip Yang wrote:



On 2023-08-31 17:29, Chen, Xiaogang wrote:


On 8/31/2023 3:59 PM, Felix Kuehling wrote:

On 2023-08-31 16:33, Chen, Xiaogang wrote:
That said, I'm not actually sure why we're freeing the DMA 
address array after migration to RAM at all. I think we still 
need it even when we're using VRAM. We call svm_range_dma_map 
in svm_range_validate_and_map regardless of whether the range 
is in VRAM or system memory. So it will just allocate a new 
array the next time the range is validated anyway. VRAM pages 
use a special address encoding to indicate VRAM pages to the 
GPUVM code.


I think we do not need free DMA address array as you said, it 
is another thing though.


We need unmap dma address(dma_unmap_page) after migrate from 
ram to vram because we always do dma_map_page at 
svm_range_validate_and_map. If not we would have multiple dma 
maps for same sys ram page.


svm_range_dma_map_dev calls dma_unmap_page before overwriting an 
existing valid entry in the dma_addr array. Anyway, dma 
unmapping the old pages in bulk may still be cleaner. And it 
avoids delays in cleaning up DMA mappings after migrations.


Regards,
  Felix


then we may not need do dma_unmap after migrate from ram to vram 
since svm_range_dma_map_dev always do dma_unmap_page if the 
address is valid dma address for sys ram, and after migrate from 
ram to vram we always do gpu mapping?


I think with XNACK enabled, the DMA mapping may be delayed until a 
page fault. For example on a multi-GPU system, GPU1 page faults 
and migrates data from system memory to its VRAM. Immediately 
afterwards, the page fault handler should use svm_validate_and_map 
to update GPU1 page tables. But GPU2 page tables are not updated 
immediately. So the now stale DMA mappings for GPU2 would continue 
to exist until the next page fault on GPU2.


Regards,
  Felix

If I understand correctly: when user call svm_range_set_attr, if 
p->xnack_enabled is true, we can skip call 
svm_range_validate_and_map. We postpone the buffer validating and 
gpu mapping until page fault or the time the buffer really got used 
by a GPU, and only dma map and gpu map for this GPU.


The current implementation of svm_range_set_attr skips the 
validation after migration if XNACK is off, because it is handled by 
svm_range_restore_work that gets scheduled by the MMU notifier 
triggered by the migration.


With XNACK on, svm_range_set_attr currently validates and maps after 
migration assuming that the data will be used by the GPU(s) soon. 
That is something we could change and let page faults take care of 
the mappings as needed.


Yes, with xnack on, my understanding is we can skip 
svm_range_validate_and_map at svm_range_set_attr after migration, 
then page fault handler will do dma and gpu mapping. That would save 
the first time dma and gpu mapping which apply to all GPUs that user 
ask for access. Then current gpu page fault handler just does dma and 
gpu mapping for the GPU that triggered page fault. Is that ok?


With xnack on, after prefetch the range to GPU, need 
svm_range_validate_and_map to update the mapping of the GPU migrated 
to (also the mapping of GPUs with access_in_place), because app 
prefetch to GPU to avoid GPU page fault.


With xnack on postpone gpu mapping to page fault handler may save some 
operations since we update mapping only on gpu that need the mapping, 
but that is not for this patch any way.


After migrating to VRAM, we only need dma_unmap_page from 
prange->dma_addr array, don't need to free the dma_addr array itself, 
as it can be reused to store VRAM address to map to GPU.


yes, we do not need free dma array, only need dma_unmpa_page at 
svm_range_free_dma_mappings. The array stores both system ram dma 
address and vram physical address. We can free this dma array at 
svm_range_free.


Regards

Xiaogang


Regards,

Philip



Regards

Xiaogang


Regards,
  Felix






Regards

Xiaogang 


Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-09-05 Thread Philip Yang

  


On 2023-08-31 17:29, Chen, Xiaogang
  wrote:


  
  On 8/31/2023 3:59 PM, Felix Kuehling wrote:
  
  On 2023-08-31 16:33, Chen, Xiaogang wrote:


  

  

  That said, I'm not actually
sure why we're freeing the DMA address array after
migration to RAM at all. I think we still need it
even when we're using VRAM. We call
svm_range_dma_map in svm_range_validate_and_map
regardless of whether the range is in VRAM or system
memory. So it will just allocate a new array the
next time the range is validated anyway. VRAM pages
use a special address encoding to indicate VRAM
pages to the GPUVM code.

  
  
  I think we do not need free DMA address array as you
  said, it is another thing though.
  
  
  We need unmap dma address(dma_unmap_page) after
  migrate from ram to vram because we always do
  dma_map_page at svm_range_validate_and_map. If not we
  would have multiple dma maps for same sys ram page.
  


svm_range_dma_map_dev calls dma_unmap_page before
overwriting an existing valid entry in the dma_addr
array. Anyway, dma unmapping the old pages in bulk may
still be cleaner. And it avoids delays in cleaning up
DMA mappings after migrations.


Regards,

  Felix



  
  then we may not need do dma_unmap after migrate from ram
  to vram since svm_range_dma_map_dev always do
  dma_unmap_page if the address is valid dma address for sys
  ram, and after migrate from ram to vram we always do gpu
  mapping?
  


I think with XNACK enabled, the DMA mapping may be delayed
until a page fault. For example on a multi-GPU system, GPU1
page faults and migrates data from system memory to its
VRAM. Immediately afterwards, the page fault handler should
use svm_validate_and_map to update GPU1 page tables. But
GPU2 page tables are not updated immediately. So the now
stale DMA mappings for GPU2 would continue to exist until
the next page fault on GPU2.


Regards,

  Felix


  
  If I understand correctly: when user call svm_range_set_attr,
  if p->xnack_enabled is true, we can skip call
  svm_range_validate_and_map. We postpone the buffer validating
  and gpu mapping until page fault or the time the buffer really
  got used by a GPU, and only dma map and gpu map for this GPU.
  


The current implementation of svm_range_set_attr skips the
validation after migration if XNACK is off, because it is
handled by svm_range_restore_work that gets scheduled by the MMU
notifier triggered by the migration.


With XNACK on, svm_range_set_attr currently validates and maps
after migration assuming that the data will be used by the
GPU(s) soon. That is something we could change and let page
faults take care of the mappings as needed.


  
  Yes, with xnack on, my understanding is we can skip
  svm_range_validate_and_map at svm_range_set_attr after migration,
  then page fault handler will do dma and gpu mapping. That would
  save the first time dma and gpu mapping which apply to all GPUs
  that user ask for access. Then current gpu page fault handler just
  does dma and gpu mapping for the GPU that triggered page fault. Is
  that ok?
  

With xnack on, after prefetch the range to GPU, need
  svm_range_validate_and_map to update the mapping of the GPU
  migrated to (also the mapping of GPUs with access_in_place),
  because app prefetch to GPU to avoid GPU page fault.
After migrating to VRAM, we only need dma_unmap_page from
  prange->dma_addr array, don't need to free the dma_addr array
  itself, as it can be reused to store VRAM address to map to GPU. 

Regards,
Philip


  
  Regards
  
  
  Xiaogang
  
  
  Regards,

  Felix



  
  
  

  
  Regards
  
  
 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Chen, Xiaogang



On 8/31/2023 3:59 PM, Felix Kuehling wrote:

On 2023-08-31 16:33, Chen, Xiaogang wrote:
That said, I'm not actually sure why we're freeing the DMA 
address array after migration to RAM at all. I think we still 
need it even when we're using VRAM. We call svm_range_dma_map in 
svm_range_validate_and_map regardless of whether the range is in 
VRAM or system memory. So it will just allocate a new array the 
next time the range is validated anyway. VRAM pages use a 
special address encoding to indicate VRAM pages to the GPUVM code.


I think we do not need free DMA address array as you said, it is 
another thing though.


We need unmap dma address(dma_unmap_page) after migrate from ram 
to vram because we always do dma_map_page at 
svm_range_validate_and_map. If not we would have multiple dma 
maps for same sys ram page.


svm_range_dma_map_dev calls dma_unmap_page before overwriting an 
existing valid entry in the dma_addr array. Anyway, dma unmapping 
the old pages in bulk may still be cleaner. And it avoids delays 
in cleaning up DMA mappings after migrations.


Regards,
  Felix


then we may not need do dma_unmap after migrate from ram to vram 
since svm_range_dma_map_dev always do dma_unmap_page if the address 
is valid dma address for sys ram, and after migrate from ram to 
vram we always do gpu mapping?


I think with XNACK enabled, the DMA mapping may be delayed until a 
page fault. For example on a multi-GPU system, GPU1 page faults and 
migrates data from system memory to its VRAM. Immediately 
afterwards, the page fault handler should use svm_validate_and_map 
to update GPU1 page tables. But GPU2 page tables are not updated 
immediately. So the now stale DMA mappings for GPU2 would continue 
to exist until the next page fault on GPU2.


Regards,
  Felix

If I understand correctly: when user call svm_range_set_attr, if 
p->xnack_enabled is true, we can skip call 
svm_range_validate_and_map. We postpone the buffer validating and gpu 
mapping until page fault or the time the buffer really got used by a 
GPU, and only dma map and gpu map for this GPU.


The current implementation of svm_range_set_attr skips the validation 
after migration if XNACK is off, because it is handled by 
svm_range_restore_work that gets scheduled by the MMU notifier 
triggered by the migration.


With XNACK on, svm_range_set_attr currently validates and maps after 
migration assuming that the data will be used by the GPU(s) soon. That 
is something we could change and let page faults take care of the 
mappings as needed.


Yes, with xnack on, my understanding is we can skip 
svm_range_validate_and_map at svm_range_set_attr after migration, then 
page fault handler will do dma and gpu mapping. That would save the 
first time dma and gpu mapping which apply to all GPUs that user ask for 
access. Then current gpu page fault handler just does dma and gpu 
mapping for the GPU that triggered page fault. Is that ok?


Regards

Xiaogang


Regards,
  Felix






Regards

Xiaogang 


Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Felix Kuehling

On 2023-08-31 16:33, Chen, Xiaogang wrote:
That said, I'm not actually sure why we're freeing the DMA 
address array after migration to RAM at all. I think we still 
need it even when we're using VRAM. We call svm_range_dma_map in 
svm_range_validate_and_map regardless of whether the range is in 
VRAM or system memory. So it will just allocate a new array the 
next time the range is validated anyway. VRAM pages use a special 
address encoding to indicate VRAM pages to the GPUVM code.


I think we do not need free DMA address array as you said, it is 
another thing though.


We need unmap dma address(dma_unmap_page) after migrate from ram 
to vram because we always do dma_map_page at 
svm_range_validate_and_map. If not we would have multiple dma maps 
for same sys ram page.


svm_range_dma_map_dev calls dma_unmap_page before overwriting an 
existing valid entry in the dma_addr array. Anyway, dma unmapping 
the old pages in bulk may still be cleaner. And it avoids delays in 
cleaning up DMA mappings after migrations.


Regards,
  Felix


then we may not need do dma_unmap after migrate from ram to vram 
since svm_range_dma_map_dev always do dma_unmap_page if the address 
is valid dma address for sys ram, and after migrate from ram to vram 
we always do gpu mapping?


I think with XNACK enabled, the DMA mapping may be delayed until a 
page fault. For example on a multi-GPU system, GPU1 page faults and 
migrates data from system memory to its VRAM. Immediately afterwards, 
the page fault handler should use svm_validate_and_map to update GPU1 
page tables. But GPU2 page tables are not updated immediately. So the 
now stale DMA mappings for GPU2 would continue to exist until the 
next page fault on GPU2.


Regards,
  Felix

If I understand correctly: when user call svm_range_set_attr, if 
p->xnack_enabled is true, we can skip call svm_range_validate_and_map. 
We postpone the buffer validating and gpu mapping until page fault or 
the time the buffer really got used by a GPU, and only dma map and gpu 
map for this GPU.


The current implementation of svm_range_set_attr skips the validation 
after migration if XNACK is off, because it is handled by 
svm_range_restore_work that gets scheduled by the MMU notifier triggered 
by the migration.


With XNACK on, svm_range_set_attr currently validates and maps after 
migration assuming that the data will be used by the GPU(s) soon. That 
is something we could change and let page faults take care of the 
mappings as needed.


Regards,
  Felix




Regards

Xiaogang 


Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Chen, Xiaogang



On 8/31/2023 1:00 PM, Felix Kuehling wrote:


On 2023-08-30 19:02, Chen, Xiaogang wrote:


On 8/30/2023 3:56 PM, Felix Kuehling wrote:


On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page 
fault handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",
- prange->svms, prange->start, prange->last, 
best_loc);

+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to 
sys ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,
  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", 
prange->svms,

   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should 
not free the DMA mapping array at all. The array is needed as 
long as there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array 
at this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used 
bitmap to record that.  I used bitmap to record which pages were 
migrated at each migration functions. Here I do not need use hmm 
function to get that info,  inside each migration function we can 
know which pages got migrated, then update the bitmap accordingly 
inside each migration function.




I think the condition above with cpages should be updated. 
Instead of cpages, we need to keep track of a count of pages in 
VRAM in struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how 
many pages are in vram side for each svm_range, instead of 
bitmap. There is a problem I saw: when we need split a 
prange(such as user uses set_attr api) how do we know how many 
pages in vram for each splitted prange?


Right, that's a bit problematic. But it should be a relatively 
rare corner case. It may be good enough to make a "pessimistic" 
assumption when splitting ranges that have some pages in VRAM, 
that everything is in VRAM. And update that to 0 after 
migrate_to_ram for the entire range, to allow the BO reference to 
be released.


migrate_to_ram is partial migration too that only 2MB vram got 
migrated. After split if we assume all pages are 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Felix Kuehling



On 2023-08-30 19:02, Chen, Xiaogang wrote:


On 8/30/2023 3:56 PM, Felix Kuehling wrote:


On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page 
fault handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to 
sys ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,
  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", 
prange->svms,

   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should 
not free the DMA mapping array at all. The array is needed as 
long as there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array 
at this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used 
bitmap to record that.  I used bitmap to record which pages were 
migrated at each migration functions. Here I do not need use hmm 
function to get that info,  inside each migration function we can 
know which pages got migrated, then update the bitmap accordingly 
inside each migration function.




I think the condition above with cpages should be updated. 
Instead of cpages, we need to keep track of a count of pages in 
VRAM in struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how 
many pages are in vram side for each svm_range, instead of bitmap. 
There is a problem I saw: when we need split a prange(such as user 
uses set_attr api) how do we know how many pages in vram for each 
splitted prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" 
assumption when splitting ranges that have some pages in VRAM, that 
everything is in VRAM. And update that to 0 after migrate_to_ram 
for the entire range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vram got 
migrated. After split if we assume all pages are vram(pessimistic) 
we will give the new vram 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-30 Thread Chen, Xiaogang



On 8/30/2023 3:56 PM, Felix Kuehling wrote:


On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page 
fault handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,
  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", 
prange->svms,

   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should 
not free the DMA mapping array at all. The array is needed as long 
as there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array 
at this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used 
bitmap to record that.  I used bitmap to record which pages were 
migrated at each migration functions. Here I do not need use hmm 
function to get that info,  inside each migration function we can 
know which pages got migrated, then update the bitmap accordingly 
inside each migration function.




I think the condition above with cpages should be updated. Instead 
of cpages, we need to keep track of a count of pages in VRAM in 
struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how 
many pages are in vram side for each svm_range, instead of bitmap. 
There is a problem I saw: when we need split a prange(such as user 
uses set_attr api) how do we know how many pages in vram for each 
splitted prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" 
assumption when splitting ranges that have some pages in VRAM, that 
everything is in VRAM. And update that to 0 after migrate_to_ram for 
the entire range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vram got 
migrated. After split if we assume all pages are vram(pessimistic) we 
will give the new vram pages number as all page number(in each 
spitted 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-30 Thread Felix Kuehling



On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used bitmap 
to record that.  I used bitmap to record which pages were migrated 
at each migration functions. Here I do not need use hmm function to 
get that info,  inside each migration function we can know which 
pages got migrated, then update the bitmap accordingly inside each 
migration function.




I think the condition above with cpages should be updated. Instead 
of cpages, we need to keep track of a count of pages in VRAM in 
struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There 
is a problem I saw: when we need split a prange(such as user uses 
set_attr api) how do we know how many pages in vram for each 
splitted prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" assumption 
when splitting ranges that have some pages in VRAM, that everything 
is in VRAM. And update that to 0 after migrate_to_ram for the entire 
range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vram got 
migrated. After split if we assume all pages are vram(pessimistic) we 
will give the new vram pages number as all page number(in each spitted 
prange). Then after 2MB vram migrated to ram, 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-30 Thread Chen, Xiaogang



On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according 
to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.
Now a svm range may have pages from both system ram and vram of one 
gpu.

These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how many 
pages in vram after partial migration. Originally I used bitmap to 
record that.  I used bitmap to record which pages were migrated at 
each migration functions. Here I do not need use hmm function to get 
that info,  inside each migration function we can know which pages 
got migrated, then update the bitmap accordingly inside each 
migration function.




I think the condition above with cpages should be updated. Instead 
of cpages, we need to keep track of a count of pages in VRAM in 
struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There 
is a problem I saw: when we need split a prange(such as user uses 
set_attr api) how do we know how many pages in vram for each splitted 
prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" assumption 
when splitting ranges that have some pages in VRAM, that everything is 
in VRAM. And update that to 0 after migrate_to_ram for the entire 
range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vram got migrated. 
After split if we assume all pages are vram(pessimistic) we will give 
the new vram pages number as all page number(in each spitted prange). 
Then after 2MB vram migrated to ram, we still do not know if we can 
release BO 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-28 Thread Felix Kuehling



On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according 
to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.
Now a svm range may have pages from both system ram and vram of one 
gpu.

These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true,  start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how many 
pages in vram after partial migration. Originally I used bitmap to 
record that.  I used bitmap to record which pages were migrated at 
each migration functions. Here I do not need use hmm function to get 
that info,  inside each migration function we can know which pages got 
migrated, then update the bitmap accordingly inside each migration 
function.




I think the condition above with cpages should be updated. Instead of 
cpages, we need to keep track of a count of pages in VRAM in struct 
svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There is 
a problem I saw: when we need split a prange(such as user uses 
set_attr api) how do we know how many pages in vram for each splitted 
prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" assumption 
when splitting ranges that have some pages in VRAM, that everything is 
in VRAM. And update that to 0 after migrate_to_ram for the entire range, 
to allow the BO reference to be released.


So in the worst case, you keep your DMA addresses and BOs allocated 
slightly longer than necessary. If that doesn't work, I agree that we 
need a bitmap with one bit per 4KB page. But I hope that can be avoided.


That said, I'm not actually sure why we're freeing the DMA address array 
after migration to RAM at all. I think we still need it even 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-28 Thread Chen, Xiaogang



On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according 
to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.

Now a svm range may have pages from both system ram and vram of one gpu.
These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 +++
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true,  start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how many 
pages in vram after partial migration. Originally I used bitmap to 
record that.  I used bitmap to record which pages were migrated at each 
migration functions. Here I do not need use hmm function to get that 
info,  inside each migration function we can know which pages got 
migrated, then update the bitmap accordingly inside each migration 
function.




I think the condition above with cpages should be updated. Instead of 
cpages, we need to keep track of a count of pages in VRAM in struct 
svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There is a 
problem I saw: when we need split a prange(such as user uses set_attr 
api) how do we know how many pages in vram for each splitted prange?



+    } else if (!prange->actual_loc)
+    /* if all pages from prange are at sys ram */
  svm_range_vram_node_free(prange);
-    }
    return r < 0 ? r : 0;
  }
@@ -762,6 +767,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, 
struct svm_range *prange,

   * svm_migrate_vram_to_ram - migrate svm range from device to system
   * @prange: range structure
   * @mm: process mm, use current->mm if NULL
+ * @start_mgr: start page need be migrated to sys ram
+ * @last_mgr: last page need be migrated to sys ram
   * @trigger: reason of migration
   * @fault_page: is from vmf->page, svm_migrate_to_ram(), this is 
CPU page fault callback

   *
@@ -771,7 +778,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, 

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-28 Thread Felix Kuehling



On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according to migration
granularity(default 2MB) and not split svm range in cpu page fault handling.
Now a svm range may have pages from both system ram and vram of one gpu.
These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 +++
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+   unsigned long start_mgr, unsigned long last_mgr,
struct mm_struct *mm, uint32_t trigger)
  {
unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,
unsigned long cpages = 0;
long r = 0;
  
-	if (prange->actual_loc == best_loc) {

-   pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
-prange->svms, prange->start, prange->last, best_loc);
+   if (!best_loc) {
+   pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n",
+prange->svms, start_mgr, last_mgr);
return 0;
}
  
@@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,

pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
 prange->start, prange->last, best_loc);
  
-	start = prange->start << PAGE_SHIFT;

-   end = (prange->last + 1) << PAGE_SHIFT;
+   start = start_mgr << PAGE_SHIFT;
+   end = (last_mgr + 1) << PAGE_SHIFT;
  
  	r = svm_range_vram_node_new(node, prange, true);

if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,
  
  	if (cpages) {

prange->actual_loc = best_loc;
-   svm_range_free_dma_mappings(prange, true);
-   } else {
+   /* only free dma mapping in the migrated range */
+   svm_range_free_dma_mappings(prange, true,  start_mgr - 
prange->start,
+last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not free 
the DMA mapping array at all. The array is needed as long as there are 
any valid DMA mappings in it.


I think the condition above with cpages should be updated. Instead of 
cpages, we need to keep track of a count of pages in VRAM in struct 
svm_range. See more below.




+   } else if (!prange->actual_loc)
+   /* if all pages from prange are at sys ram */
svm_range_vram_node_free(prange);
-   }
  
  	return r < 0 ? r : 0;

  }
@@ -762,6 +767,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct 
svm_range *prange,
   * svm_migrate_vram_to_ram - migrate svm range from device to system
   * @prange: range structure
   * @mm: process mm, use current->mm if NULL
+ * @start_mgr: start page need be migrated to sys ram
+ * @last_mgr: last page need be migrated to sys ram
   * @trigger: reason of migration
   * @fault_page: is from vmf->page, svm_migrate_to_ram(), this is CPU page 
fault callback
   *
@@ -771,7 +778,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct 
svm_range *prange,
   * 0 - OK, otherwise error code
   */
  int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
-   uint32_t trigger, struct page *fault_page)
+   unsigned long start_mgr, unsigned long 
last_mgr,
+   uint32_t trigger, struct page 
*fault_page)
  {
struct kfd_node *node;
struct vm_area_struct *vma;
@@ -781,23 +789,30 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, 
struct mm_struct *mm,
unsigned long upages = 0;
long r = 0;
  
+	/* this pragne has no any vram page to migrate to sys ram */

if