On 02/16/2017 03:00 PM, Bridgman, John wrote: > Any objections to authorizing Oded to post the kfdtest binary he is using to > some public place (if not there already) so others (like Andres) can test > changes which touch on amdkfd ? > > We should check it for embarrassing symbols but otherwise it should be OK.
someone was up late for a dead line? lol > > That said, since we are getting perilously close to actually sending dGPU > support changes upstream we will need (IMO) to maintain a sanitized source > repo for kfdtest as well... sharing the binary just gets us started. > Hi John, Yes, this is the sort of thing I've been referring to for some time now. We definitely need some kind of centralized mechanism to test/validate kfd stuff so if you can get this out that would be great! A binary would be a start, I am sure we can made do and its certainly better than nothing, however source much like what happened with UMR would be of course ideal. I suggest to you that it would perhaps be good if we could arrange some kind of IRC meeting regarding kfd? Since it seems there is a bit of fragmented effort here. I have my own ioctl()'s locally for pinning for my own project which I am not sure are suitable to just upstream as AMD has its own take so what should we do? I heard so much about dGPU support for a couple of years now but only seen bits thrown over the wall. Can we begin a more serious incremental approach happening ASAP? I created #amdkfd on freenode some time ago which a couple of interested academics and users hang. Kind Regards, Edward. > Thanks, > John > >> -----Original Message----- >> From: Oded Gabbay [mailto:oded.gab...@gmail.com] >> Sent: Friday, February 10, 2017 12:57 PM >> To: Andres Rodriguez >> Cc: Kuehling, Felix; Bridgman, John; amd-gfx@lists.freedesktop.org; >> Deucher, Alexander; Jay Cornwall >> Subject: Re: Change queue/pipe split between amdkfd and amdgpu >> >> I don't have a repo, nor do I have the source code. >> It is a tool that we developed inside AMD (when I was working there), and >> after I left AMD I got permission to use the binary for regressions testing. >> >> Oded >> >> On Fri, Feb 10, 2017 at 6:33 PM, Andres Rodriguez <andre...@gmail.com> >> wrote: >>> Hey Oded, >>> >>> Where can I find a repo with kfdtest? >>> >>> I tried looking here bit couldn't find it: >>> >>> https://cgit.freedesktop.org/~gabbayo/ >>> >>> -Andres >>> >>> >>> >>> On 2017-02-10 05:35 AM, Oded Gabbay wrote: >>>> >>>> So the warning in dmesg is gone of course, but the test (that I >>>> mentioned in previous email) still fails, and this time it caused the >>>> kernel to crash. In addition, now other tests fail as well, e.g. >>>> KFDEventTest.SignalEvent >>>> >>>> I honestly suggest to take some time to debug this patch-set on an >>>> actual Kaveri machine and then re-send the patches. >>>> >>>> Thanks, >>>> Oded >>>> >>>> log of crash from KFDQMTest.CreateMultipleCpQueues: >>>> >>>> [ 160.900137] kfd: qcm fence wait loop timeout expired [ >>>> 160.900143] kfd: the cp might be in an unrecoverable state due to an >>>> unsuccessful queues preemption [ 160.916765] show_signal_msg: 36 >>>> callbacks suppressed [ 160.916771] kfdtest[2498]: segfault at >>>> 100007f8a ip 00007f8ae932ee5d sp 00007ffc52219cd0 error 4 in >>>> libhsakmt-1.so.0.0.1[7f8ae932b000+8000] >>>> [ 163.152229] kfd: qcm fence wait loop timeout expired [ >>>> 163.152250] BUG: unable to handle kernel NULL pointer dereference at >>>> 000000000000005a [ 163.152299] IP: >>>> kfd_get_process_device_data+0x6/0x30 [amdkfd] [ 163.152323] PGD >>>> 2333aa067 [ 163.152323] PUD 230f64067 [ 163.152335] PMD 0 >>>> >>>> [ 163.152364] Oops: 0000 [#1] SMP >>>> [ 163.152379] Modules linked in: joydev edac_mce_amd edac_core >>>> input_leds kvm_amd snd_hda_codec_realtek kvm irqbypass >>>> snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel >> snd_hda_codec >>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core >>>> snd_hwdep pcbc snd_pcm aesni_intel snd_seq_midi snd_seq_midi_event >>>> snd_rawmidi snd_seq aes_x86_64 crypto_simd snd_seq_device >> glue_helper >>>> cryptd snd_timer snd fam15h_power k10temp soundcore i2c_piix4 shpchp >>>> tpm_infineon mac_hid parport_pc ppdev nfsd auth_rpcgss nfs_acl lockd >>>> lp grace sunrpc parport autofs4 hid_logitech_hidpp hid_logitech_dj >>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon >>>> i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect >>>> sysimgblt libahci fb_sys_fops drm r8169 mii fjes video [ 163.152668] >>>> CPU: 3 PID: 2498 Comm: kfdtest Not tainted 4.10.0-rc5+ #3 [ >>>> 163.152695] Hardware name: Gigabyte Technology Co., Ltd. To be filled >>>> by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014 [ 163.152735] task: >>>> ffff995e73d16580 task.stack: ffffb41144458000 [ 163.152764] RIP: >>>> 0010:kfd_get_process_device_data+0x6/0x30 [amdkfd] [ 163.152790] >>>> RSP: 0018:ffffb4114445bab0 EFLAGS: 00010246 [ 163.152812] RAX: >>>> ffffffffffffffea RBX: ffff995e75909c00 RCX: >>>> 0000000000000000 >>>> [ 163.152841] RDX: 0000000000000000 RSI: ffffffffffffffea RDI: >>>> ffff995e75909600 >>>> [ 163.152869] RBP: ffffb4114445bae0 R08: 00000000000252a5 R09: >>>> 0000000000000414 >>>> [ 163.152898] R10: 0000000000000000 R11: ffffffffb412d38d R12: >>>> 00000000ffffffc2 >>>> [ 163.152926] R13: 0000000000000000 R14: ffff995e75909ca8 R15: >>>> ffff995e75909c00 >>>> [ 163.152956] FS: 00007f8ae975e740(0000) GS:ffff995e7ed80000(0000) >>>> knlGS:0000000000000000 >>>> [ 163.152988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ >>>> 163.153012] CR2: 000000000000005a CR3: 00000002216ab000 CR4: >>>> 00000000000406e0 >>>> [ 163.153041] Call Trace: >>>> [ 163.153059] ? destroy_queues_cpsch+0x166/0x190 [amdkfd] [ >>>> 163.153086] execute_queues_cpsch+0x2e/0xc0 [amdkfd] [ 163.153113] >>>> destroy_queue_cpsch+0xbd/0x140 [amdkfd] [ 163.153139] >>>> pqm_destroy_queue+0x111/0x1d0 [amdkfd] [ 163.153164] >>>> pqm_uninit+0x3f/0xb0 [amdkfd] [ 163.153186] >>>> kfd_unbind_process_from_device+0x51/0xd0 [amdkfd] [ 163.153214] >>>> iommu_pasid_shutdown_callback+0x20/0x30 [amdkfd] [ 163.153239] >>>> mn_release+0x37/0x70 [amd_iommu_v2] [ 163.153261] >>>> __mmu_notifier_release+0x44/0xc0 [ 163.153281] >>>> exit_mmap+0x15a/0x170 [ 163.153297] ? __wake_up+0x44/0x50 [ >>>> 163.153314] ? exit_robust_list+0x5c/0x110 [ 163.153333] >>>> mmput+0x57/0x140 [ 163.153347] do_exit+0x26b/0xb30 [ 163.153362] >>>> do_group_exit+0x43/0xb0 [ 163.153379] get_signal+0x293/0x620 [ >>>> 163.153396] do_signal+0x37/0x760 [ 163.153411] ? >>>> print_vma_addr+0x82/0x100 [ 163.153429] ? vprintk_default+0x29/0x50 >>>> [ 163.153447] ? bad_area+0x46/0x50 [ 163.153463] ? >>>> __do_page_fault+0x3c7/0x4e0 [ 163.153481] >>>> exit_to_usermode_loop+0x76/0xb0 [ 163.153500] >>>> prepare_exit_to_usermode+0x2f/0x40 >>>> [ 163.153521] retint_user+0x8/0x10 >>>> [ 163.153536] RIP: 0033:0x7f8ae932ee5d [ 163.153551] RSP: >>>> 002b:00007ffc52219cd0 EFLAGS: 00010202 [ 163.153573] RAX: >>>> 0000000000000003 RBX: 0000000100007f8a RCX: >>>> 00007ffc52219d00 >>>> [ 163.153602] RDX: 00007f8ae9534220 RSI: 00007f8ae8b5eb28 RDI: >>>> 0000000100007f8a >>>> [ 163.153630] RBP: 00007ffc52219d20 R08: 0000000001cc1890 R09: >>>> 0000000000000000 >>>> [ 163.153659] R10: 0000000000000027 R11: 00007f8ae932ee10 R12: >>>> 0000000001cc52a0 >>>> [ 163.153687] R13: 00007ffc5221a200 R14: 0000000000000021 R15: >>>> 0000000000000000 >>>> [ 163.153716] Code: e0 04 00 00 48 3b 91 f0 03 00 00 74 01 c3 55 48 >>>> 89 e5 e8 2e f9 ff ff 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f >>>> 44 00 00 55 <48> 8b 46 70 48 83 c6 70 48 89 e5 48 39 f0 74 16 48 3b >>>> 78 >>>> 10 75 >>>> [ 163.153818] RIP: kfd_get_process_device_data+0x6/0x30 [amdkfd] RSP: >>>> ffffb4114445bab0 >>>> [ 163.153848] CR2: 000000000000005a >>>> [ 163.160389] ---[ end trace f6a8177c7119c1f5 ]--- [ 163.160390] >>>> Fixing recursive fault but reboot is needed! >>>> >>>> On Thu, Feb 9, 2017 at 10:38 PM, Andres Rodriguez >>>> <andre...@gmail.com> >>>> wrote: >>>>> >>>>> Hey Oded, >>>>> >>>>> Sorry to be a nuisance, but if you have everything still setup could >>>>> you give this fix a quick go? >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>>> index 5321d18..9f70ee0 100644 >>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c >>>>> @@ -667,7 +667,7 @@ static int set_sched_resources(struct >>>>> device_queue_manager *dqm) >>>>> /* This situation may be hit in the future if a new HW >>>>> * generation exposes more than 64 queues. If so, the >>>>> * definition of res.queue_mask needs updating */ >>>>> - if (WARN_ON(i > sizeof(res.queue_mask))) { >>>>> + if (WARN_ON(i > (sizeof(res.queue_mask)*8))) { >>>>> pr_err("Invalid queue enabled by amdgpu: >>>>> %d\n", i); >>>>> break; >>>>> } >>>>> >>>>> John/Felix, >>>>> >>>>> Any chance I could borrow a carrizo/kaveri for a few days? Or maybe >>>>> you could help me run some final tests on this patch series? >>>>> >>>>> - Andres >>>>> >>>>> >>>>> >>>>> On 2017-02-09 03:11 PM, Oded Gabbay wrote: >>>>>> >>>>>> Andres, >>>>>> >>>>>> I tried your patches on Kaveri with airlied's drm-next branch. >>>>>> I used radeon+amdkfd >>>>>> >>>>>> The following test failed: KFDQMTest.CreateMultipleCpQueues >>>>>> However, I can't debug it because I don't have the sources of kfdtest. >>>>>> >>>>>> In dmesg, I saw the following warning during boot: >>>>>> WARNING: CPU: 0 PID: 150 at >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:670 >>>>>> start_cpsch+0xc5/0x220 [amdkfd] >>>>>> [ 4.393796] Modules linked in: hid_logitech_hidpp hid_logitech_dj >>>>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 >>>>>> radeon(+) i2c_algo_bit ttm drm_kms_helper syscopyarea ahci >>>>>> sysfillrect sysimgblt libahci fb_sys_fops drm r8169 mii fjes video >>>>>> [ 4.393811] CPU: 0 PID: 150 Comm: systemd-udevd Not tainted >>>>>> 4.10.0-rc5+ >>>>>> #1 >>>>>> [ 4.393811] Hardware name: Gigabyte Technology Co., Ltd. To be >>>>>> filled by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014 >>>>>> [ 4.393812] Call Trace: >>>>>> [ 4.393818] dump_stack+0x63/0x90 >>>>>> [ 4.393822] __warn+0xcb/0xf0 >>>>>> [ 4.393823] warn_slowpath_null+0x1d/0x20 >>>>>> [ 4.393830] start_cpsch+0xc5/0x220 [amdkfd] >>>>>> [ 4.393836] ? initialize_cpsch+0xa0/0xb0 [amdkfd] >>>>>> [ 4.393841] kgd2kfd_device_init+0x375/0x490 [amdkfd] >>>>>> [ 4.393883] radeon_kfd_device_init+0xaf/0xd0 [radeon] >>>>>> [ 4.393911] radeon_driver_load_kms+0x11e/0x1f0 [radeon] >>>>>> [ 4.393933] drm_dev_register+0x14a/0x200 [drm] >>>>>> [ 4.393946] drm_get_pci_dev+0x9d/0x160 [drm] >>>>>> [ 4.393974] radeon_pci_probe+0xb8/0xe0 [radeon] >>>>>> [ 4.393976] local_pci_probe+0x45/0xa0 >>>>>> [ 4.393978] pci_device_probe+0x103/0x150 >>>>>> [ 4.393981] driver_probe_device+0x2bf/0x460 >>>>>> [ 4.393982] __driver_attach+0xdf/0xf0 >>>>>> [ 4.393984] ? driver_probe_device+0x460/0x460 >>>>>> [ 4.393985] bus_for_each_dev+0x6c/0xc0 >>>>>> [ 4.393987] driver_attach+0x1e/0x20 >>>>>> [ 4.393988] bus_add_driver+0x1fd/0x270 >>>>>> [ 4.393989] ? 0xffffffffc05c8000 >>>>>> [ 4.393991] driver_register+0x60/0xe0 >>>>>> [ 4.393992] ? 0xffffffffc05c8000 >>>>>> [ 4.393993] __pci_register_driver+0x4c/0x50 >>>>>> [ 4.394007] drm_pci_init+0xeb/0x100 [drm] >>>>>> [ 4.394008] ? 0xffffffffc05c8000 >>>>>> [ 4.394031] radeon_init+0x98/0xb6 [radeon] >>>>>> [ 4.394034] do_one_initcall+0x53/0x1a0 >>>>>> [ 4.394037] ? __vunmap+0x81/0xd0 >>>>>> [ 4.394039] ? kmem_cache_alloc_trace+0x152/0x1c0 >>>>>> [ 4.394041] ? vfree+0x2e/0x70 >>>>>> [ 4.394044] do_init_module+0x5f/0x1ff >>>>>> [ 4.394046] load_module+0x24cc/0x29f0 >>>>>> [ 4.394047] ? __symbol_put+0x60/0x60 >>>>>> [ 4.394050] ? security_kernel_post_read_file+0x6b/0x80 >>>>>> [ 4.394052] SYSC_finit_module+0xdf/0x110 >>>>>> [ 4.394054] SyS_finit_module+0xe/0x10 >>>>>> [ 4.394056] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>> [ 4.394058] RIP: 0033:0x7f9cda77c8e9 >>>>>> [ 4.394059] RSP: 002b:00007ffe195d3378 EFLAGS: 00000246 ORIG_RAX: >>>>>> 0000000000000139 >>>>>> [ 4.394060] RAX: ffffffffffffffda RBX: 00007f9cdb8dda7e RCX: >>>>>> 00007f9cda77c8e9 >>>>>> [ 4.394061] RDX: 0000000000000000 RSI: 00007f9cdac7ce2a RDI: >>>>>> 0000000000000013 >>>>>> [ 4.394062] RBP: 00007ffe195d2450 R08: 0000000000000000 R09: >>>>>> 0000000000000000 >>>>>> [ 4.394063] R10: 0000000000000013 R11: 0000000000000246 R12: >>>>>> 00007ffe195d245a >>>>>> [ 4.394063] R13: 00007ffe195d1378 R14: 0000563f70cc93b0 R15: >>>>>> 0000563f70cba4d0 >>>>>> [ 4.394091] ---[ end trace 9c5af17304d998bb ]--- >>>>>> [ 4.394092] Invalid queue enabled by amdgpu: 9 >>>>>> >>>>>> I suggest you get a Kaveri/Carrizo machine to debug these issues. >>>>>> >>>>>> Until that, I don't think we should merge this patch-set. >>>>>> >>>>>> Oded >>>>>> >>>>>> On Wed, Feb 8, 2017 at 9:47 PM, Andres Rodriguez >>>>>> <andre...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Thank you Oded. >>>>>>> >>>>>>> - Andres >>>>>>> >>>>>>> >>>>>>> On 2017-02-08 02:32 PM, Oded Gabbay wrote: >>>>>>>> >>>>>>>> On Wed, Feb 8, 2017 at 6:23 PM, Andres Rodriguez >>>>>>>> <andre...@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hey Felix, >>>>>>>>> >>>>>>>>> Thanks for the pointer to the ROCm mqd commit. I like that the >>>>>>>>> workarounds are easy to spot. I'll add that to a new patch >>>>>>>>> series I'm working on for some bug-fixes for perf being lower on >>>>>>>>> pipes other than pipe 0. >>>>>>>>> >>>>>>>>> I haven't tested this yet on kaveri/carrizo. I'm hoping someone >>>>>>>>> with the HW will be able to give it a go. I put in a few small >>>>>>>>> hacks to get KFD to boot but do nothing on polaris10. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Andres >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2017-02-06 03:20 PM, Felix Kuehling wrote: >>>>>>>>>> >>>>>>>>>> Hi Andres, >>>>>>>>>> >>>>>>>>>> Thank you for tackling this task. It's more involved than I >>>>>>>>>> expected, mostly because I didn't have much awareness of the >>>>>>>>>> MQD management in amdgpu. >>>>>>>>>> >>>>>>>>>> I made one comment in a separate message about the unified MQD >>>>>>>>>> commit function, if you want to bring that more in line with >>>>>>>>>> our latest ROCm release on github. >>>>>>>>>> >>>>>>>>>> Also, were you able to test the upstream KFD with your changes >>>>>>>>>> on a Kaveri or Carrizo? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Felix >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 17-02-03 11:51 PM, Andres Rodriguez wrote: >>>>>>>>>>> >>>>>>>>>>> The current queue/pipe split policy is for amdgpu to take the >>>>>>>>>>> first pipe of >>>>>>>>>>> MEC0 and leave the rest for amdkfd to use. This policy is >>>>>>>>>>> taken as an assumption in a few areas of the implementation. >>>>>>>>>>> >>>>>>>>>>> This patch series aims to allow for flexible/tunable >>>>>>>>>>> queue/pipe split policies between kgd and kfd. It also updates >>>>>>>>>>> the queue/pipe split policy to one that allows better compute >>>>>>>>>>> app concurrency for both drivers. >>>>>>>>>>> >>>>>>>>>>> In the process some duplicate code and hardcoded constants >>>>>>>>>>> were removed. >>>>>>>>>>> >>>>>>>>>>> Any suggestions or feedback on improvements welcome. >>>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>>> Hi Andres, >>>>>>>> I will try to find sometime to test it on my Kaveri machine. >>>>>>>> >>>>>>>> Oded >>>>>>> >>>>>>> >>> > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx