Re: [PATCH] drm/amdkfd: An adaptive data migration scheme for page fault recovery at svm

Chen, Xiaogang Sun, 16 Nov 2025 22:10:42 -0800

Hi Felix:

This patch uses "data migration rate(per second, DMR)" to decide if datamigration is too intense already(higher than data_migrate_rate_thresholddefined at kfd_migrate.c). The data migration amount and time arerecorded for each page fault that already happened. These data are usedto calculate DMR(at update_pf_migration_data)

Stopping migration is decided by DMR. When DMR is too high migrationwill be stopped(at processing_pf_migration_data) : all data accessingfrom no-current devices to that range will be remote. If it is cpu pagefault we need move data to system ram, then all following dataaccessing(from all GPUs) will be remote.

The hot device is decided/*after*/ when DMR is higher than athreshold(data_migrate_rate_threshold). The condition is: the datamigration amount is mostly caused by this device(more than 1/2).

If DMR is lower than data_migrate_rate_threshold the drive still usescurrent migration scheme.

Let's use thrash happening case to see what will happen with this patch:When data keep migrated among devices eventually the DMR will be higherthan data_migrate_rate_threshold(it can be adjusted through debugfs).Then will not migrate data for following page faults. Any following pagefault will access data remotely. If also find there is a device accesseddata mostly(again it is after DMR is too high) will have data park atthis device, all other devices need access the data remotely. Then thedata will not be moved around.

From your comments below I think you may miss this part of this newdata migration operation above. So clarify that first.

The patch also make this scheme dynamic. The DMR is calculated with timeevolving: the page fault can happen intensively during some time, thenbecome relative quietly, then become intensively again. Also, the hotdevice can also be switched to another device. This patch keeps checkingDMR and hot device dynamically to decide when to stop/resume datamigration and if there is device that driver should give favor to.

The code changes may seem more. I think we can first check if thescheme's ideas above are ok.


On 11/14/2025 7:51 PM, Felix Kuehling wrote:

I have lots of detailed comments inline, most of which were just madeas I was reading through the patch. But I want to put my conclusionsabout the overall scheme up top:
I think with your logic, you're replacing thrashing with "sloshing".Maybe you could call that an improvement, but it's far from ideal.
I also think that your logic cannot reliably detect thrashing betweentwo devices because one of those devices will always have slightlymore than 1/2 of the page faults. So one of the devices will always beallowed to migrate data, although it would be better to leave it inone place that's good for both devices.

As said above the DMR decides if stop data migration. If DMR is higherthan threshold there will be no migration. If also find hot deviceexists driver gives favor to the device, all other device will accessthe data remotely. There will not data migration when thrashinghappened. The data is either parked at current location or hot device.

Your logic in svm_range_restore_pages is also not helping. If you hada case of true thrashing and detected it correctly by not designatinga hot device, things would continue to migrate unchanged. So I'mreally doubtful that this scheme, as is, would meaningfully improvethe migration logic in the most problematic cases.

hot device does not decide stopping migration. hot device is decidedafter find DMR is too high(hot device can either exists or not) . DMRdecides stop/resume data migration.

On the other hand, I could see regressions on real workloads. Imaginea range that's bigger than num_pf_samples*migration_granularity. Therange is created on the GPU first by e.g. hipMallocManaged. Then theCPU is used to initialize it, that migrates it to the CPU and makesthe CPU the hot device. Now the GPU starts the computation, butbecause the CPU is hot, the first few page faults won't migrate thedata to the GPU. That will cause performance regressions.

Hot device is not decided by a few pagefaults. data_migrate_rate_threshold is the threshold that can also bechanged by debugfs. DMR uses data_migrate_rate_threshold to decide ifmigration rate is too high. Before driver "think" DMR is too high driverstill uses current migration scheme that migrates data to gpu whoaccesses the data. Hot device may exists only after driver detects DMRis too high.

As talked above DMR and hot device are decided dynamically with timeevolving. In this case driver think cpu is the hot device after multiplecpu page faults handling the first few gpu page faults will use remoteaccessing from gpus to cpu. With time evolving the patch calculates newDMR and decide new hot device during each new page fault. When dataaccessing from gpus is more the DMR will be more than the threshold andhot device will not be cpu, but a gpu and data will be moved to the gpu.

This scheme looks at "trend" of data accessing with time evolving. Ifdevices access data remotely for a few page faults it would not affectperformance too much overall. We cannot predict accurately which devicewill access data in future. This patch predicts data accessing patternbased on previous data accessing samples.

Unless I see some data showing an improvement in some realistic orcontrived scenarios, and negligible performance regressions in realworkloads, I'm not going to approve this patch.

I did not find real app that demonstrate bad data accessing betweencpu<->gpu or gpu<->gpu. I made test cases through kdtest:

1: trigger data keep moving between cpu and a gpu: the drivercalculates DMR and decide hot device. Then stop migration after DMR istoo high, then has all following page faults access data remotely.

2: trigger data keep moving between 2 gpus: same as above, haveintensive data migrations rate, that stop migration/use remote access.Then let test sleep some time, so no page fault happen during. Thentrigger page faults again. Now DMR is smaller because time interval ismore but data migration amount is not changed. Now data migration happenagain and devices access data locally. With time evolving DMR becomeshigh again, then data migration got stopped, devices access dataremotely again.

The test repeats the procedure above to simulate page faulthappening(device access data) frequency became high->low->high ....,then date migration got stop->resume->stop... That demonstrates howdriver handles data accessing locally or remotely based on run timescenario. The driver does not have data migration too much that can hangapp, and have data access locally when there is no much data gotmigrated. data_migration_rate_threshold is used to adjust this new datamigration behavior.


Regards

Xiaogang


More comments inline ...


On 2025-10-20 17:24, Xiaogang.Chen wrote:

From: Xiaogang Chen <[email protected]>

Data migration between cpu and gpu or between gpus is a commonoperation atkfd/svm during page fault recovery. Certainly, it adds overhead andconsumes

bus bandwidth. So, becomes crucial for compute performance.

There are two objectives that are considered mostly during dataaccessing:


1: Locality:  move data to compute devices local memory which is fast.
2: Zero copy: not have data moved, compute devices access data remotely.

They are usually conflict: local accessing would require data copy;avoiding

data copy would have device access data remotely.

The new scheme balances the two accessing patterns by usingstatistical samplesthat driver records from previous page fault data to predict whatdata accessingpattern would be for future page faults, then use the prediction tohandle new

page fault.

The scheme behavior is also dynamically adjusted by followingapplication dataaccessing pattern overtime. The new decision is based on most recentpage fault

handling data that are saved at a ring buffer.

Signed-off-by: Xiaogang Chen <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c | 102 ++++++++++-
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 212 ++++++++++++++++++++++-
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |  12 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 151 +++++++++++++---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |  30 ++++
  5 files changed, 473 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.cb/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c

index 9bde2c64540f..d495af31fa3c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debugfs.c
@@ -38,6 +38,9 @@ struct debugfs_proc_entry {
    #define MAX_DEBUGFS_FILENAME_LEN 32
  +extern uint32_t num_pf_sample;
+extern uint32_t data_migrate_rate_threshold;
+
  static int kfd_debugfs_open(struct inode *inode, struct file *file)
  {
      int (*show)(struct seq_file *, void *) = inode->i_private;

@@ -50,6 +53,80 @@ static int kfd_debugfs_hang_hws_read(structseq_file *m, void *data)

      return 0;
  }

+static int kfd_debugfs_num_pf_samples_read(struct seq_file *m,void *data)

+{
+    seq_printf(m, "number of page fault samples per svm range: %d\n",
+               num_pf_sample);
+    seq_puts(m, "echo 'num' > num_pf_sample to change it\n");
+    return 0;
+}
+

+static int kfd_debugfs_data_migration_rate_threshold(struct seq_file*m,

+                              void *data)
+{
+    seq_printf(m, "data migration rate threshold in 2MB unit: %d\n",
+               data_migrate_rate_threshold);

+ seq_puts(m, "echo 'num' > data_migrate_rate_threshold to changeit\n");

+    return 0;
+}
+
+static ssize_t kfd_debugfs_num_pf_samples_write(struct file *file,
+    const char __user *user_buf, size_t size, loff_t *ppos)
+{
+    char tmp[16];
+    int ret = -EINVAL;
+
+    memset(tmp, 0, 16);
+    if (size >= 16) {
+        pr_err("Invalid input for num of page fault samples\n");
+        goto out;
+    }
+
+    if (copy_from_user(tmp, user_buf, size)) {
+        ret = -EFAULT;
+        goto out;
+    }
+
+    if (kstrtoint(tmp, 10, &num_pf_sample)) {
+        pr_err("Invalid input for num of page fault samples.\n");
+        goto out;
+    }
+
+    ret = size;
+
+out:
+    return ret;
+}
+
+static ssize_t kfd_debugfs_dmrt_write(struct file *file,
+    const char __user *user_buf, size_t size, loff_t *ppos)
+{
+    char tmp[16];
+    int ret = -EINVAL;
+
+    memset(tmp, 0, 16);
+    if (size >= 16) {
+        pr_err("Invalid input for num of page fault samples\n");
+        goto out;
+    }
+
+    if (copy_from_user(tmp, user_buf, size)) {
+        ret = -EFAULT;
+        goto out;
+    }
+
+    if (kstrtoint(tmp, 10, &data_migrate_rate_threshold)) {
+        pr_err("Invalid input for num of page fault samples.\n");
+        goto out;
+    }
+
+    ret = size;
+
+out:
+    return ret;
+}
+
+
  static ssize_t kfd_debugfs_hang_hws_write(struct file *file,
      const char __user *user_buf, size_t size, loff_t *ppos)
  {

@@ -99,6 +176,24 @@ static const struct file_operationskfd_debugfs_hang_hws_fops = {

      .release = single_release,
  };

+static const struct file_operationskfd_debugfs_num_pf_samples_fops = {

+    .owner = THIS_MODULE,
+    .open = kfd_debugfs_open,
+    .read = seq_read,
+    .write = kfd_debugfs_num_pf_samples_write,
+    .llseek = seq_lseek,
+    .release = single_release,
+};
+
+static const struct file_operations kfd_debugfs_dmrt_fops = {
+    .owner = THIS_MODULE,
+    .open = kfd_debugfs_open,
+    .read = seq_read,
+    .write = kfd_debugfs_dmrt_write,
+    .llseek = seq_lseek,
+    .release = single_release,
+};
+
  void kfd_debugfs_init(void)
  {
      debugfs_root = debugfs_create_dir("kfd", NULL);
@@ -111,10 +206,15 @@ void kfd_debugfs_init(void)
                  kfd_debugfs_hqds_by_device, &kfd_debugfs_fops);
      debugfs_create_file("rls", S_IFREG | 0444, debugfs_root,
                  kfd_debugfs_rls_by_device, &kfd_debugfs_fops);
-    debugfs_create_file("hang_hws", S_IFREG | 0200, debugfs_root,
+    debugfs_create_file("hang_hws", S_IFREG | 0600, debugfs_root,

kfd_debugfs_hang_hws_read,&kfd_debugfs_hang_hws_fops);

      debugfs_create_file("mem_limit", S_IFREG | 0200, debugfs_root,
                  kfd_debugfs_kfd_mem_limits, &kfd_debugfs_fops);
+    debugfs_create_file("num_pf_samples", S_IFREG | 0644, debugfs_root,

+ kfd_debugfs_num_pf_samples_read,&kfd_debugfs_num_pf_samples_fops);+ debugfs_create_file("data_migration_rate_threshold", S_IFREG |0644,

+             debugfs_root, kfd_debugfs_data_migration_rate_threshold,
+             &kfd_debugfs_dmrt_fops);
  }
    void kfd_debugfs_fini(void)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.cb/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 5d7eb0234002..8c9ffb258062 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -39,6 +39,16 @@
  #endif
  #define dev_fmt(fmt) "kfd_migrate: " fmt
  +/* threshold of data migration rate(per second, DMR) per svm range
+ * in 2MB unit, so 512 is 1GB */
+uint32_t data_migrate_rate_threshold = 512;

These global variables should have a proper prefix. svm_... to matchthe conventions in this file.

+
+/* nubmer of most recent page fault related data driver keeps persvm range+ * driver uses them to check how data got accessed after page faultrecovery
+ * then decide migration scheme for new page fault recoveries.
+ */
+uint32_t num_pf_sample = 20;
+
  static uint64_t
svm_migrate_direct_mapping_addr(struct amdgpu_device *adev,uint64_t addr)
  {
@@ -476,7 +486,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node,struct svm_range *prange,
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
              unsigned long start_mgr, unsigned long last_mgr,
-            struct mm_struct *mm, uint32_t trigger)
+            struct mm_struct *mm, uint32_t trigger,
+            unsigned long* migration_amount)
  {
      unsigned long addr, start, end;
      struct vm_area_struct *vma;
@@ -554,6 +565,10 @@ svm_migrate_ram_to_vram(struct svm_range*prange, uint32_t best_loc,
                      prange->npages * PAGE_SIZE,
                      KFD_IOC_ALLOC_MEM_FLAGS_VRAM,
                      node->xcp ? node->xcp->id : 0);
+
+    if (migration_amount)
+        *migration_amount += mpages;
+
      return r < 0 ? r : 0;
  }
@@ -778,6 +793,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node,struct svm_range *prange,
   * @last_mgr: last page need be migrated to sys ram
   * @trigger: reason of migration
* @fault_page: is from vmf->page, svm_migrate_to_ram(), this isCPU page fault callback
+ * @migration_amount: return number of pages got migrated after call
+ *


Extra blank line.

* Context: Process context, caller hold mmap read lock,prange->migrate_mutex

@@ -786,7 +803,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node,struct svm_range *prange,

*/

int svm_migrate_vram_to_ram(struct svm_range *prange, structmm_struct *mm,

                  unsigned long start_mgr, unsigned long last_mgr,
-                uint32_t trigger, struct page *fault_page)
+                uint32_t trigger, struct page *fault_page,
+                unsigned long* migration_amount)
  {
      struct kfd_node *node;
      struct vm_area_struct *vma;

@@ -856,6 +874,9 @@ int svm_migrate_vram_to_ram(struct svm_range*prange, struct mm_struct *mm,

              prange->actual_loc = 0;
              svm_range_vram_node_free(prange);
          }
+
+        if (migration_amount)
+            *migration_amount += mpages;
      }
        return r < 0 ? r : 0;

@@ -881,7 +902,8 @@ int svm_migrate_vram_to_ram(struct svm_range*prange, struct mm_struct *mm,

  static int
  svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc,
              unsigned long start, unsigned long last,
-            struct mm_struct *mm, uint32_t trigger)
+            struct mm_struct *mm, uint32_t trigger,
+            unsigned long* migration_amount)
  {
      int r, retries = 3;

@@ -893,8 +915,11 @@ svm_migrate_vram_to_vram(struct svm_range*prange, uint32_t best_loc, pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc,best_loc);

        do {
+        if (migration_amount)
+            *migration_amount = 0;
+

r = svm_migrate_vram_to_ram(prange, mm, prange->start,prange->last,

-                        trigger, NULL);
+                        trigger, NULL, migration_amount);
          if (r)
              return r;
      } while (prange->actual_loc && --retries);

@@ -902,22 +927,179 @@ svm_migrate_vram_to_vram(struct svm_range*prange, uint32_t best_loc,

      if (prange->actual_loc)
          return -EDEADLK;

- return svm_migrate_ram_to_vram(prange, best_loc, start, last,mm, trigger);+ return svm_migrate_ram_to_vram(prange, best_loc, start, last,mm, trigger,

+                       migration_amount);
  }
    int
  svm_migrate_to_vram(struct svm_range *prange, uint32_t best_loc,
              unsigned long start, unsigned long last,
-            struct mm_struct *mm, uint32_t trigger)
+            struct mm_struct *mm, uint32_t trigger,
+            unsigned long* migration_amount)
  {
+    if (migration_amount)
+        *migration_amount = 0;
+
      if  (!prange->actual_loc || prange->actual_loc == best_loc)
          return svm_migrate_ram_to_vram(prange, best_loc, start, last,
-                           mm, trigger);
+                           mm, trigger, migration_amount);
        else
          return svm_migrate_vram_to_vram(prange, best_loc, start, last,
-                        mm, trigger);
+                        mm, trigger, migration_amount);
+


Extra blank line.

+}
+
+/**
+ * @prange: range structure
+ * @start: start addr in page of page fault triggred migration
+ * @migration_amount:  migration data amount in 2MB unit
+ * @src:   from which page fault happened
+ * @migration_or_remote: migrate data or remote access
+ *
+ * update svm_range->page_fault_data after some pages got
+ * migrated
+ *
+ * need be under svm_range->migrate_mutex protection
+ */
+
+void update_pf_migration_data(struct svm_range* prange, int32_t pf_src,

+ bool migration_or_remote, unsigned longmigration_amount)


Missing name prefix.

+{
+    uint8_t pf_data_index = prange->pf_data_index & 0x7f;
+
+    lockdep_assert_held(&prange->migrate_mutex);
+
+    prange->page_fault_data[pf_data_index].ktime = ktime_get();
+    prange->page_fault_data[pf_data_index].migration_info =

+ MIGRATIOM_INFO(migration_or_remote, pf_src,migration_amount);

MIGRATION_INFO can only store 16 bits of migration_amount (up to 64K *2MB = 128GB), but you're passing in an unsigned long that can bebigger than that. If that ever happens, you'll overwrite the "device"field. You need a range check here.

+
+    /* check if page_fault_data buf is full */
+    if (!(prange->pf_data_index & 0x80) &&
+        (prange->pf_data_index + 1 >= num_pf_sample))
+            prange->pf_data_index |= 0x80;
+
+    prange->pf_data_index = (prange->pf_data_index & 0x80) |
+                (((prange->pf_data_index & 0x7f) + 1) % num_pf_sample);

Num_pf_samples can be changed through debugfs after thepage_fault_data of a range was allocated. That can lead toout-of-bounds accesses. You probably need to store the allocatedpage_fault_data size in the prange or some other place that can't bechanged at runtime to be safe.

+}
+
+/**
+ * @prange: range structure
+ * @hot_device: device who accesses data(by migration or remote)
+ *            mostly


This is an output parameter and should be documented as such.

+ *
+ * Return:
+ *
+ * false: data migration rate is not higher than
+ * DATA_MIGRATE_RATE_THRESHOLD
+ *
+ * true: data migration rate is higher than
+ * DATA_MIGRATE_RATE_THRESHOLD
+ *
+ * return value will be used to decide migration scheme in
+ * following operations.
+ */
+bool processing_pf_migration_data(struct svm_range* prange,
+                   uint32_t* hot_device)


Missing name prefix.

+{
+    uint8_t start_index, last_index, next_index, device_idx;
+    bool ret = false;
+    uint32_t i;
  +    /* in nanosecond */
+    ktime_t time_last, time_first;
+    s64 time_interval;
+
+    /* in 2MB unit */
+    /* data for cpu is saved at MAX_GPU_INSTANCE-th entry */
+    uint32_t migration_amount_device[MAX_GPU_INSTANCE + 1] = {0};
+    uint32_t remote_amount_device[MAX_GPU_INSTANCE + 1] = {0};
+
+    uint64_t total_migration_amount = 0;
+    uint64_t total_remote_amount = 0;
+
+    /* page fault data buf is not full */
+    if (!(prange->pf_data_index & 0x80))
+        return false;

You probably don't need this check. If your buffer was not filled, thetime_first will be 0 and your time interval will be very long. Soyou'll be below the migration rate threshold. Not treating thehigh-bit special could simplify your handling of the pf_data_indexeverywhere.

+
+    for (i = 0; i < num_pf_sample; i++) {
+
+        bool is_migrartion =
+ MIGRATIOM_OR_MAP(prange->page_fault_data[i].migration_info);
+
+        uint16_t migration_or_map_amount =
+ MIGRATIOM_AMOUNT(prange->page_fault_data[i].migration_info);
+
+        device_idx =
+ PF_DEVICE(prange->page_fault_data[i].migration_info);
+        /* convert to array index */
+ device_idx = (device_idx != CPU_DEVICE) ? device_idx :MAX_GPU_INSTANCE;
+
+        if (is_migrartion) {
+ migration_amount_device[device_idx] +=migration_or_map_amount;
+            total_migration_amount += migration_or_map_amount;
+        } else {
+ remote_amount_device[device_idx] +=migration_or_map_amount;
+            total_remote_amount += migration_or_map_amount;
+        }
+
+    }
+
+    *hot_device = INVALID_COMPUTE_DEVICE;
+
+    /* prange->page_fault_data's next index is prange->pf_data_index
+ * it is cycular buff. get start and last entries to cal time(innanosec)
+     * during which these page faults happened
+     */
+    next_index = prange->pf_data_index & 0x7f;
+ last_index = (next_index == 0) ? (num_pf_sample - 1) :(next_index - 1);
+    start_index = (next_index == 0) ? 0 : next_index;


This is just next_index.

+
+    time_last  = prange->page_fault_data[last_index].ktime;
+    time_first = prange->page_fault_data[start_index].ktime;
+
+    time_interval = ktime_to_ns(ktime_sub(time_last, time_first));
+
+    /* should not happend as ktime_get is monotonically increasing */
+    if (time_interval < 0) {
+        pr_warn("Negative time difference detected\n");
+        return  0;
+    }
+
+    /* page fault is very intensive, had NUM_PF_SAMPLE page faults
+     * in less 1 nanosecond, treat that as 1 nanosecond
+     */
+    if (time_interval == 0)
+        time_interval = 1;
+
+    /* check migration data rate first*/
+    if (total_migration_amount * NSEC_PER_SEC >
+        data_migrate_rate_threshold * time_interval) {
+
+        /* check if there is a significant device */
+        for (i = 0; i <= MAX_GPU_INSTANCE; i++) {

+ if (migration_amount_device[i] > total_migration_amount/ 2) {

+                *hot_device = i;
+                break;
+            }
+        }
+
+        ret = true;
+
+    } else if (total_remote_amount * NSEC_PER_SEC >
+        data_migrate_rate_threshold * time_interval) {
+
+        for (i = 0; i <= MAX_GPU_INSTANCE; i++) {

+ if (migration_amount_device[i] > total_migration_amount/ 2) {


Did you mean to check remote_amount_device and total_remote_amount here?

I'm not sure why you're treating those separately anyway. I'm tryingto imagine what will happen. E.g. device 1 has a lot of page faults,migrates data and becomes the hot_device. Now device 2 has startshaving lots of page faults. Because it's not hot and below thethreshold at first, it is not migrating and is counted inremote_amount. As long as device 1 keeps faulting enough that it staysabove the threshold, device 2 will keep accessing data remotely, evenif its rate of page faults is much higher than device 1. Eventually itbecomes hot and starts migrating. Now device 1 experiences lots ofpage faults and so the cycle repeats.

I think with this logic, you're replacing thrashing with "sloshing".Maybe you could call that an improvement, but it's far from ideal.

+                *hot_device = i;
+                break;
+            }
+        }
+
+        ret = true;
+    }
+
+    return ret;
  }
    /**

@@ -932,6 +1114,7 @@ svm_migrate_to_vram(struct svm_range *prange,uint32_t best_loc,

   */
  static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
  {
+    unsigned long migration_amount_pages = 0;
      unsigned long start, last, size;
      unsigned long addr = vmf->address;
      struct svm_range_bo *svm_bo;

@@ -989,10 +1172,21 @@ static vm_fault_t svm_migrate_to_ram(structvm_fault *vmf)

      last = min(ALIGN(addr + 1, size) - 1, prange->last);

r = svm_migrate_vram_to_ram(prange, vmf->vma->vm_mm, start,last,

-                    KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, vmf->page);
+                    KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, vmf->page,
+                    &migration_amount_pages);
+
      if (r)

pr_debug("failed %d migrate svms 0x%p range 0x%p [0x%lx0x%lx]\n",

              r, prange->svms, prange, start, last);
+    else {


The braces are not needed here.

+        /* When cpu page fault happen always migrate data to sys ram

+ * cpu cannot access device ram remotely, record page faultdata

+         * migration amount is in 2MB unit, migration_amount_pages is
+         * in page. If it is smaller than 2MB, choose 2MB
+         */
+        update_pf_migration_data(prange, CPU_DEVICE, true,
+                    ALIGN(migration_amount_pages, 1UL << 9) >> 9);
+    }
    out_unlock_prange:
      mutex_unlock(&prange->migrate_mutex);

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.hb/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h

index 2eebf67f9c2c..160d4c1bd07f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
@@ -42,15 +42,23 @@ enum MIGRATION_COPY_DIR {

int svm_migrate_to_vram(struct svm_range *prange, uint32_tbest_loc,

              unsigned long start, unsigned long last,
-            struct mm_struct *mm, uint32_t trigger);
+            struct mm_struct *mm, uint32_t trigger,
+            unsigned long *migration_amount);

int svm_migrate_vram_to_ram(struct svm_range *prange, structmm_struct *mm,

                  unsigned long start, unsigned long last,
-                uint32_t trigger, struct page *fault_page);
+                uint32_t trigger, struct page *fault_page,
+                unsigned long *migration_amount);
    unsigned long

svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned longaddr); +void update_pf_migration_data(struct svm_range *prange, int32_tpf_src,+ bool migration_or_remote, unsigned longmigration_amount);

+
+bool processing_pf_migration_data(struct svm_range *prange,
+                   uint32_t *hot_device);
+
  #endif /* IS_ENABLED(CONFIG_HSA_AMD_SVM) */
    #endif /* KFD_MIGRATE_H_ */

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.cb/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 521c14c7a789..c2843836661e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -63,6 +63,8 @@
   */
  static uint64_t max_svm_range_pages;
  +extern uint32_t num_pf_sample;
+
  struct criu_svm_metadata {
      struct list_head list;
      struct kfd_criu_svm_range_priv_data data;

@@ -303,6 +305,9 @@ static void svm_range_free(struct svm_range*prange, bool do_unmap)

          }
      }
  +    kvfree(prange->page_fault_data);
+    prange->page_fault_data = NULL;
+
      mutex_destroy(&prange->lock);
      mutex_destroy(&prange->migrate_mutex);
      kfree(prange);

@@ -355,6 +360,14 @@ svm_range *svm_range_new(struct svm_range_list*svms, uint64_t start,

      mutex_init(&prange->migrate_mutex);
      mutex_init(&prange->lock);
  +    prange->pf_data_index = 0;
+    prange->page_fault_data = kvmalloc_array(num_pf_sample,
+                         sizeof(struct svm_pf_info), GFP_KERNEL);
+    for (int i = 0; i < num_pf_sample; i++) {
+        prange->page_fault_data[i].ktime = 0;

+ prange->page_fault_data[i].migration_info = 0 |INVALID_COMPUTE_DEVICE;


This is just INVALID_COMPUTE_DEVICE.

+    }
+
      if (p->xnack_enabled)
          bitmap_copy(prange->bitmap_access, svms->bitmap_supported,
                  MAX_GPU_INSTANCE);

@@ -1068,9 +1081,27 @@ svm_range_split_adjust(struct svm_range *new,struct svm_range *old,

              return r;
      }
  +    for (int i = 0; i < num_pf_sample; i++) {
+        uint16_t old_amount =
+ MIGRATIOM_AMOUNT(old->page_fault_data[i].migration_info);
+        unsigned long temp = (new->last - new->start + 1) * old_amount;
+        uint16_t new_amount = temp / (old->last - old->start + 1);
+

+ /* proportionally devide data amount betwee new and oldpranges */

+        new->page_fault_data[i].migration_info =

+ (new->page_fault_data[i].migration_info & (~0xffff)) |new_amount;

+
+        old->page_fault_data[i].migration_info =
+            (old->page_fault_data[i].migration_info & (~0xffff)) |
+            (old_amount - new_amount);
+
+        new->page_fault_data[i].ktime = old->page_fault_data[i].ktime;
+    }
+
      old->npages = last - start + 1;
      old->start = start;
      old->last = last;
+    new->pf_data_index = old->pf_data_index;
      new->flags = old->flags;
      new->preferred_loc = old->preferred_loc;
      new->prefetch_loc = old->prefetch_loc;

@@ -2973,6 +3004,8 @@ svm_range_restore_pages(struct amdgpu_device*adev, unsigned int pasid,

              uint32_t vmid, uint32_t node_id,
              uint64_t addr, uint64_t ts, bool write_fault)
  {
+    unsigned long migration_amount_pages;
+    bool migration_or_remote_high;
      unsigned long start, last, size;
      struct mm_struct *mm = NULL;
      struct svm_range_list *svms;

@@ -2985,6 +3018,7 @@ svm_range_restore_pages(struct amdgpu_device*adev, unsigned int pasid,

      bool write_locked = false;
      struct vm_area_struct *vma;
      bool migration = false;
+    uint32_t hot_device;
      int r = 0;
        if (!KFD_IS_SVM_API_SUPPORTED(adev)) {

@@ -3135,33 +3169,103 @@ svm_range_restore_pages(struct amdgpu_device*adev, unsigned int pasid,

      size = 1UL << prange->granularity;

start = max_t(unsigned long, ALIGN_DOWN(addr, size),prange->start); last = min_t(unsigned long, ALIGN(addr + 1, size) - 1,prange->last);

-    if (prange->actual_loc != 0 || best_loc != 0) {
-        if (best_loc) {
-            r = svm_migrate_to_vram(prange, best_loc, start, last,
-                    mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU);
+
+    /* check previous page fault data to decide how to migrate data */
+    hot_device = INVALID_COMPUTE_DEVICE;
+    migration_or_remote_high =
+            processing_pf_migration_data(prange, &hot_device);
+
+    /* migrated or remote accessing data rate is not high
+     * use automatic migration scheme
+     */
+    if (!migration_or_remote_high) {
+        migration_amount_pages = 0;
+
+        if (prange->actual_loc != 0 || best_loc != 0) {
+
+            if (best_loc) {
+                r = svm_migrate_to_vram(prange, best_loc, start, last,
+                        mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
+                        &migration_amount_pages);
+                if (r) {
+                    pr_debug("svm_migrate_to_vram failed (%d) at %llx,"
+                        " falling back to system memory\n",
+                         r, addr);
+                    /* Fallback to system memory if migration to
+                     * VRAM failed
+                     */

+ if (prange->actual_loc && prange->actual_loc !=best_loc)+ r = svm_migrate_vram_to_ram(prange, mm,start, last,+ KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL,NULL);

+                    else
+                        r = 0;
+                }
+
+            } else {
+                r = svm_migrate_vram_to_ram(prange, mm, start, last,
+                        KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL,
+                        &migration_amount_pages);
+            }
              if (r) {

- pr_debug("svm_migrate_to_vram failed (%d) at %llx,falling back to system memory\n",

-                     r, addr);
-                /* Fallback to system memory if migration to
-                 * VRAM failed

+ pr_debug("failed %d to migrate svms %p [0x%lx0x%lx]\n",

+                     r, svms, start, last);
+                goto out_migrate_fail;
+            } else {
+                migration = true;
+                /* migration amount is in 2MB unit.
+                 * if it is smaller than 2MB, use 2MB.
                   */

- if (prange->actual_loc && prange->actual_loc !=best_loc)- r = svm_migrate_vram_to_ram(prange, mm, start,last,

-                        KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
-                else
-                    r = 0;
+                update_pf_migration_data(prange, gpuidx, true,

+ ALIGN(migration_amount_pages, 1UL << 9)>> 9);

              }
-        } else {
-            r = svm_migrate_vram_to_ram(prange, mm, start, last,
-                    KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL);
          }
+    } else if (hot_device == INVALID_COMPUTE_DEVICE) {
+        migration = false;
+        /* there is no significant device, use AIP */
+        /* no data migration for page recovery
+         * the gpu(gpuidx) will access data remotely after map
+         * count data amount that this gpu remotely access
+         */
+        update_pf_migration_data(prange, gpuidx, false,
+                     ALIGN((last - start + 1), 1UL << 9) >> 9);
+
+    } else if (gpuidx == hot_device) {
+        /* gpuidx(where page fault happened) is same as hot_device
+         * move data to gpuidx gpu by svm_migrate_to_vram
+         * it either moves prange to system ram, then 2MB to gpuidx
+         * or move 2MB from sys ram to gpuidx vram
+         */
+        r = svm_migrate_to_vram(prange, gpuid, start, last,
+                    mm, KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU,
+                    &migration_amount_pages);
+        if (r) {
+            pr_debug("svm_migrate_to_vram failed (%d) at %llx, "
+                "falling back to system memory\n",
+                    r, addr);
+            /* Fallback to system memory if migration to
+             * VRAM failed
+             */
+            if (prange->actual_loc && prange->actual_loc != gpuidx)
+                r = svm_migrate_vram_to_ram(prange, mm, start, last,
+ KFD_MIGRATE_TRIGGER_PAGEFAULT_GPU, NULL, NULL);
+                else
+                    r = 0;
+        } else
+            update_pf_migration_data(prange, gpuidx, true,
+                ALIGN(migration_amount_pages, 1UL << 9) >> 9);

This looks like you duplicated a lot of migration and error handlingcode for different cases. It would be cleaner if you took thehot_device into account in svm_range_best_restore_location. Then thecode that does the actual migration wouldn't need to change at all(other than adding calls to update_pf_migration_data).

You're also currently not considering whether the faulting GPU canremotely access the memory in question.svm_range_best_restore_location does consider this in its choice ofthe best location. Therefore I believe svm_range_best_restore_locationis the right place for your policy logic for changing migrationdecisions based on the migration rates.

+
          if (r) {
              pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n",
-                 r, svms, start, last);
+                         r, svms, start, last);
              goto out_migrate_fail;
-        } else {
+        } else
              migration = true;
-        }
+
+    } else {

+ /* gpu gpuidx is not hot device, have it access dataremotely */

+        update_pf_migration_data(prange, gpuidx, false,
+                     ALIGN((last - start + 1), 1UL << 9) >> 9);
+        migration = false;
      }

r = svm_range_validate_and_map(mm, start, last, prange,gpuidx, false,@@ -3535,14 +3639,16 @@ svm_range_trigger_migration(struct mm_struct*mm, struct svm_range *prange,

          return 0;
        if (!best_loc) {
+        /* not count migration amount for prefetch */

r = svm_migrate_vram_to_ram(prange, mm, prange->start,prange->last,

-                    KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
+                    KFD_MIGRATE_TRIGGER_PREFETCH, NULL, NULL);
          *migrated = !r;
          return r;
      }
  +    /* not count migration amount for prefetch */

r = svm_migrate_to_vram(prange, best_loc, prange->start,prange->last,

-                mm, KFD_MIGRATE_TRIGGER_PREFETCH);
+                mm, KFD_MIGRATE_TRIGGER_PREFETCH, NULL);
      *migrated = !r;
        return 0;

@@ -3599,9 +3705,10 @@ static voidsvm_range_evict_svm_bo_worker(struct work_struct *work)

              /* migrate all vram pages in this prange to sys ram
               * after that prange->actual_loc should be zero
               */
+            /* not count migration amount for eviction now */
              r = svm_migrate_vram_to_ram(prange, mm,
                      prange->start, prange->last,
-                    KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL);
+                    KFD_MIGRATE_TRIGGER_TTM_EVICTION, NULL, NULL);
          } while (!r && prange->actual_loc && --retries);
            if (!r && prange->actual_loc)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.hb/drivers/gpu/drm/amd/amdkfd/kfd_svm.h

index 01c7a4877904..5e6f07eba0c6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -39,6 +39,31 @@
  #define SVM_ADEV_PGMAP_OWNER(adev)\
              ((adev)->hive ? (void *)(adev)->hive : (void *)(adev))
  +/* indicator for cpu device */
+#define CPU_DEVICE 0xFF
+
+/* indicator for invalid device */
+#define INVALID_COMPUTE_DEVICE (1 << 30)
+

+#define GETMASK32(h, l) ((((uint32_t)(~0)) << (l)) & ((uint32_t)(~0)>> (32 - 1 - (h))))

+#define MIGRATIOM_OR_MAP(migration_info) (((migration_info) &GETMASK32(31, 31)) >> 31)+#define PF_DEVICE(migration_info) (((migration_info) & GETMASK32(23,16)) >> 16)+#define MIGRATIOM_AMOUNT(migration_info) ((migration_info) &GETMASK32(15, 0))

This use of GET_MASK is not helping readability. It took me twominutes to figure out what this was doing. It's much more obvious andreadable to just spell out your masks in hexadecimal. If you do theshift before the AND mask, it's even easier:


#define MIGRATION_OR_MAP(migration_info) (((migration_info) >> 31) & 0x1)
#define PF_DEVICE(migration_info) (((migration_info) >> 16) & 0xff)
#define MIGRATION_AMOUNT(migration_info) ((migration_info) & 0xffff)

Also, some typos: MIGRATIOM->MIGRATION.

+
+#define MIGRATIOM_INFO(flag, device, amount) (((flag) << 31) |((device) << 16) | (amount))
+
+struct svm_pf_info {
+    /* kernel time when page fault happen */
+    ktime_t ktime;
+    /* enclosing info:
+ * 1: data manipulation for page fault recovery: local or remoteaccess
+     * 2: device where page fault happened, use 0xff for cpu
+     * 3: data amount involved
+     */
+    uint32_t migration_info;

Since ktime is 64-bit, the structure size will get aligned to 64-bitanyway. So you may as well make the migration info 64-bit and giveyourself more space. To simplify access to the fields, you could alsodefine the elements as separate fields in the struct.

+};
+
  struct svm_range_bo {
      struct amdgpu_bo        *bo;
      struct kref            kref;
@@ -138,6 +163,11 @@ struct svm_range {
      DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
      bool                mapped_to_gpu;
      atomic_t            queue_refcount;
+    struct svm_pf_info *page_fault_data;
+    /* most significant bit indicates if page_fault_data is full
+     * so the max size of page_fault_data is 127
+     */
+    uint8_t pf_data_index;
  };
    static inline void svm_range_lock(struct svm_range *prange)

Re: [PATCH] drm/amdkfd: An adaptive data migration scheme for page fault recovery at svm

Reply via email to