Hey Oded,

Where can I find a repo with kfdtest?

I tried looking here bit couldn't find it:

https://cgit.freedesktop.org/~gabbayo/

-Andres


On 2017-02-10 05:35 AM, Oded Gabbay wrote:
So the warning in dmesg is gone of course, but the test (that I
mentioned in previous email) still fails, and this time it caused the
kernel to crash. In addition, now other tests fail as well, e.g.
KFDEventTest.SignalEvent

I honestly suggest to take some time to debug this patch-set on an
actual Kaveri machine and then re-send the patches.

Thanks,
Oded

log of crash from KFDQMTest.CreateMultipleCpQueues:

[  160.900137] kfd: qcm fence wait loop timeout expired
[  160.900143] kfd: the cp might be in an unrecoverable state due to
an unsuccessful queues preemption
[  160.916765] show_signal_msg: 36 callbacks suppressed
[  160.916771] kfdtest[2498]: segfault at 100007f8a ip
00007f8ae932ee5d sp 00007ffc52219cd0 error 4 in
libhsakmt-1.so.0.0.1[7f8ae932b000+8000]
[  163.152229] kfd: qcm fence wait loop timeout expired
[  163.152250] BUG: unable to handle kernel NULL pointer dereference
at 000000000000005a
[  163.152299] IP: kfd_get_process_device_data+0x6/0x30 [amdkfd]
[  163.152323] PGD 2333aa067
[  163.152323] PUD 230f64067
[  163.152335] PMD 0

[  163.152364] Oops: 0000 [#1] SMP
[  163.152379] Modules linked in: joydev edac_mce_amd edac_core
input_leds kvm_amd snd_hda_codec_realtek kvm irqbypass
snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core
snd_hwdep pcbc snd_pcm aesni_intel snd_seq_midi snd_seq_midi_event
snd_rawmidi snd_seq aes_x86_64 crypto_simd snd_seq_device glue_helper
cryptd snd_timer snd fam15h_power k10temp soundcore i2c_piix4 shpchp
tpm_infineon mac_hid parport_pc ppdev nfsd auth_rpcgss nfs_acl lockd
lp grace sunrpc parport autofs4 hid_logitech_hidpp hid_logitech_dj
hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon
i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect sysimgblt
libahci fb_sys_fops drm r8169 mii fjes video
[  163.152668] CPU: 3 PID: 2498 Comm: kfdtest Not tainted 4.10.0-rc5+ #3
[  163.152695] Hardware name: Gigabyte Technology Co., Ltd. To be
filled by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014
[  163.152735] task: ffff995e73d16580 task.stack: ffffb41144458000
[  163.152764] RIP: 0010:kfd_get_process_device_data+0x6/0x30 [amdkfd]
[  163.152790] RSP: 0018:ffffb4114445bab0 EFLAGS: 00010246
[  163.152812] RAX: ffffffffffffffea RBX: ffff995e75909c00 RCX: 0000000000000000
[  163.152841] RDX: 0000000000000000 RSI: ffffffffffffffea RDI: ffff995e75909600
[  163.152869] RBP: ffffb4114445bae0 R08: 00000000000252a5 R09: 0000000000000414
[  163.152898] R10: 0000000000000000 R11: ffffffffb412d38d R12: 00000000ffffffc2
[  163.152926] R13: 0000000000000000 R14: ffff995e75909ca8 R15: ffff995e75909c00
[  163.152956] FS:  00007f8ae975e740(0000) GS:ffff995e7ed80000(0000)
knlGS:0000000000000000
[  163.152988] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  163.153012] CR2: 000000000000005a CR3: 00000002216ab000 CR4: 00000000000406e0
[  163.153041] Call Trace:
[  163.153059]  ? destroy_queues_cpsch+0x166/0x190 [amdkfd]
[  163.153086]  execute_queues_cpsch+0x2e/0xc0 [amdkfd]
[  163.153113]  destroy_queue_cpsch+0xbd/0x140 [amdkfd]
[  163.153139]  pqm_destroy_queue+0x111/0x1d0 [amdkfd]
[  163.153164]  pqm_uninit+0x3f/0xb0 [amdkfd]
[  163.153186]  kfd_unbind_process_from_device+0x51/0xd0 [amdkfd]
[  163.153214]  iommu_pasid_shutdown_callback+0x20/0x30 [amdkfd]
[  163.153239]  mn_release+0x37/0x70 [amd_iommu_v2]
[  163.153261]  __mmu_notifier_release+0x44/0xc0
[  163.153281]  exit_mmap+0x15a/0x170
[  163.153297]  ? __wake_up+0x44/0x50
[  163.153314]  ? exit_robust_list+0x5c/0x110
[  163.153333]  mmput+0x57/0x140
[  163.153347]  do_exit+0x26b/0xb30
[  163.153362]  do_group_exit+0x43/0xb0
[  163.153379]  get_signal+0x293/0x620
[  163.153396]  do_signal+0x37/0x760
[  163.153411]  ? print_vma_addr+0x82/0x100
[  163.153429]  ? vprintk_default+0x29/0x50
[  163.153447]  ? bad_area+0x46/0x50
[  163.153463]  ? __do_page_fault+0x3c7/0x4e0
[  163.153481]  exit_to_usermode_loop+0x76/0xb0
[  163.153500]  prepare_exit_to_usermode+0x2f/0x40
[  163.153521]  retint_user+0x8/0x10
[  163.153536] RIP: 0033:0x7f8ae932ee5d
[  163.153551] RSP: 002b:00007ffc52219cd0 EFLAGS: 00010202
[  163.153573] RAX: 0000000000000003 RBX: 0000000100007f8a RCX: 00007ffc52219d00
[  163.153602] RDX: 00007f8ae9534220 RSI: 00007f8ae8b5eb28 RDI: 0000000100007f8a
[  163.153630] RBP: 00007ffc52219d20 R08: 0000000001cc1890 R09: 0000000000000000
[  163.153659] R10: 0000000000000027 R11: 00007f8ae932ee10 R12: 0000000001cc52a0
[  163.153687] R13: 00007ffc5221a200 R14: 0000000000000021 R15: 0000000000000000
[  163.153716] Code: e0 04 00 00 48 3b 91 f0 03 00 00 74 01 c3 55 48
89 e5 e8 2e f9 ff ff 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00 00 55 <48> 8b 46 70 48 83 c6 70 48 89 e5 48 39 f0 74 16 48 3b 78
10 75
[  163.153818] RIP: kfd_get_process_device_data+0x6/0x30 [amdkfd] RSP:
ffffb4114445bab0
[  163.153848] CR2: 000000000000005a
[  163.160389] ---[ end trace f6a8177c7119c1f5 ]---
[  163.160390] Fixing recursive fault but reboot is needed!

On Thu, Feb 9, 2017 at 10:38 PM, Andres Rodriguez <andre...@gmail.com> wrote:
Hey Oded,

Sorry to be a nuisance, but if you have everything still setup could you
give this fix a quick go?

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 5321d18..9f70ee0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -667,7 +667,7 @@ static int set_sched_resources(struct
device_queue_manager *dqm)
                 /* This situation may be hit in the future if a new HW
                  * generation exposes more than 64 queues. If so, the
                  * definition of res.queue_mask needs updating */
-               if (WARN_ON(i > sizeof(res.queue_mask))) {
+               if (WARN_ON(i > (sizeof(res.queue_mask)*8))) {
                         pr_err("Invalid queue enabled by amdgpu: %d\n", i);
                         break;
                 }

John/Felix,

Any chance I could borrow a carrizo/kaveri for a few days? Or maybe you
could help me run some final tests on this patch series?

- Andres



On 2017-02-09 03:11 PM, Oded Gabbay wrote:
   Andres,

I tried your patches on Kaveri with airlied's drm-next branch.
I used radeon+amdkfd

The following test failed: KFDQMTest.CreateMultipleCpQueues
However, I can't debug it because I don't have the sources of kfdtest.

In dmesg, I saw the following warning during boot:
WARNING: CPU: 0 PID: 150 at
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:670
start_cpsch+0xc5/0x220 [amdkfd]
[    4.393796] Modules linked in: hid_logitech_hidpp hid_logitech_dj
hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon(+)
i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect sysimgblt
libahci fb_sys_fops drm r8169 mii fjes video
[    4.393811] CPU: 0 PID: 150 Comm: systemd-udevd Not tainted 4.10.0-rc5+
#1
[    4.393811] Hardware name: Gigabyte Technology Co., Ltd. To be
filled by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014
[    4.393812] Call Trace:
[    4.393818]  dump_stack+0x63/0x90
[    4.393822]  __warn+0xcb/0xf0
[    4.393823]  warn_slowpath_null+0x1d/0x20
[    4.393830]  start_cpsch+0xc5/0x220 [amdkfd]
[    4.393836]  ? initialize_cpsch+0xa0/0xb0 [amdkfd]
[    4.393841]  kgd2kfd_device_init+0x375/0x490 [amdkfd]
[    4.393883]  radeon_kfd_device_init+0xaf/0xd0 [radeon]
[    4.393911]  radeon_driver_load_kms+0x11e/0x1f0 [radeon]
[    4.393933]  drm_dev_register+0x14a/0x200 [drm]
[    4.393946]  drm_get_pci_dev+0x9d/0x160 [drm]
[    4.393974]  radeon_pci_probe+0xb8/0xe0 [radeon]
[    4.393976]  local_pci_probe+0x45/0xa0
[    4.393978]  pci_device_probe+0x103/0x150
[    4.393981]  driver_probe_device+0x2bf/0x460
[    4.393982]  __driver_attach+0xdf/0xf0
[    4.393984]  ? driver_probe_device+0x460/0x460
[    4.393985]  bus_for_each_dev+0x6c/0xc0
[    4.393987]  driver_attach+0x1e/0x20
[    4.393988]  bus_add_driver+0x1fd/0x270
[    4.393989]  ? 0xffffffffc05c8000
[    4.393991]  driver_register+0x60/0xe0
[    4.393992]  ? 0xffffffffc05c8000
[    4.393993]  __pci_register_driver+0x4c/0x50
[    4.394007]  drm_pci_init+0xeb/0x100 [drm]
[    4.394008]  ? 0xffffffffc05c8000
[    4.394031]  radeon_init+0x98/0xb6 [radeon]
[    4.394034]  do_one_initcall+0x53/0x1a0
[    4.394037]  ? __vunmap+0x81/0xd0
[    4.394039]  ? kmem_cache_alloc_trace+0x152/0x1c0
[    4.394041]  ? vfree+0x2e/0x70
[    4.394044]  do_init_module+0x5f/0x1ff
[    4.394046]  load_module+0x24cc/0x29f0
[    4.394047]  ? __symbol_put+0x60/0x60
[    4.394050]  ? security_kernel_post_read_file+0x6b/0x80
[    4.394052]  SYSC_finit_module+0xdf/0x110
[    4.394054]  SyS_finit_module+0xe/0x10
[    4.394056]  entry_SYSCALL_64_fastpath+0x1e/0xad
[    4.394058] RIP: 0033:0x7f9cda77c8e9
[    4.394059] RSP: 002b:00007ffe195d3378 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    4.394060] RAX: ffffffffffffffda RBX: 00007f9cdb8dda7e RCX:
00007f9cda77c8e9
[    4.394061] RDX: 0000000000000000 RSI: 00007f9cdac7ce2a RDI:
0000000000000013
[    4.394062] RBP: 00007ffe195d2450 R08: 0000000000000000 R09:
0000000000000000
[    4.394063] R10: 0000000000000013 R11: 0000000000000246 R12:
00007ffe195d245a
[    4.394063] R13: 00007ffe195d1378 R14: 0000563f70cc93b0 R15:
0000563f70cba4d0
[    4.394091] ---[ end trace 9c5af17304d998bb ]---
[    4.394092] Invalid queue enabled by amdgpu: 9

I suggest you get a Kaveri/Carrizo machine to debug these issues.

Until that, I don't think we should merge this patch-set.

Oded

On Wed, Feb 8, 2017 at 9:47 PM, Andres Rodriguez <andre...@gmail.com>
wrote:
Thank you Oded.

- Andres


On 2017-02-08 02:32 PM, Oded Gabbay wrote:
On Wed, Feb 8, 2017 at 6:23 PM, Andres Rodriguez <andre...@gmail.com>
wrote:
Hey Felix,

Thanks for the pointer to the ROCm mqd commit. I like that the
workarounds
are easy to spot. I'll add that to a new patch series I'm working on
for
some bug-fixes for perf being lower on pipes other than pipe 0.

I haven't tested this yet on kaveri/carrizo. I'm hoping someone with
the
HW
will be able to give it a go. I put in a few small hacks to get KFD to
boot
but do nothing on polaris10.

Regards,
Andres


On 2017-02-06 03:20 PM, Felix Kuehling wrote:
Hi Andres,

Thank you for tackling this task. It's more involved than I expected,
mostly because I didn't have much awareness of the MQD management in
amdgpu.

I made one comment in a separate message about the unified MQD commit
function, if you want to bring that more in line with our latest ROCm
release on github.

Also, were you able to test the upstream KFD with your changes on a
Kaveri or Carrizo?

Regards,
     Felix


On 17-02-03 11:51 PM, Andres Rodriguez wrote:
The current queue/pipe split policy is for amdgpu to take the first
pipe
of
MEC0 and leave the rest for amdkfd to use. This policy is taken as an
assumption in a few areas of the implementation.

This patch series aims to allow for flexible/tunable queue/pipe split
policies
between kgd and kfd. It also updates the queue/pipe split policy to
one
that
allows better compute app concurrency for both drivers.

In the process some duplicate code and hardcoded constants were
removed.

Any suggestions or feedback on improvements welcome.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Hi Andres,
I will try to find sometime to test it on my Kaveri machine.

Oded


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to