Re: Change queue/pipe split between amdkfd and amdgpu

Edward O'Callaghan Wed, 15 Feb 2017 23:15:04 -0800


On 02/16/2017 03:00 PM, Bridgman, John wrote:
> Any objections to authorizing Oded to post the kfdtest binary he is using to 
> some public place (if not there already) so others (like Andres) can test 
> changes which touch on amdkfd ? 
> 
> We should check it for embarrassing symbols but otherwise it should be OK.


someone was up late for a dead line? lol

> 
> That said, since we are getting perilously close to actually sending dGPU 
> support changes upstream we will need (IMO) to maintain a sanitized source 
> repo for kfdtest as well... sharing the binary just gets us started.
> 

Hi John,

Yes, this is the sort of thing I've been referring to for some time now.
We definitely need some kind of centralized mechanism to test/validate
kfd stuff so if you can get this out that would be great! A binary would
be a start, I am sure we can made do and its certainly better than
nothing, however source much like what happened with UMR would be of
course ideal.

I suggest to you that it would perhaps be good if we could arrange some
kind of IRC meeting regarding kfd? Since it seems there is a bit of
fragmented effort here. I have my own ioctl()'s locally for pinning for
my own project which I am not sure are suitable to just upstream as AMD
has its own take so what should we do? I heard so much about dGPU
support for a couple of years now but only seen bits thrown over the
wall. Can we begin a more serious incremental approach happening ASAP?
I created #amdkfd on freenode some time ago which a couple of interested
academics and users hang.

Kind Regards,
Edward.

> Thanks,
> John
> 
>> -----Original Message-----
>> From: Oded Gabbay [mailto:[email protected]]
>> Sent: Friday, February 10, 2017 12:57 PM
>> To: Andres Rodriguez
>> Cc: Kuehling, Felix; Bridgman, John; [email protected];
>> Deucher, Alexander; Jay Cornwall
>> Subject: Re: Change queue/pipe split between amdkfd and amdgpu
>>
>> I don't have a repo, nor do I have the source code.
>> It is a tool that we developed inside AMD (when I was working there), and
>> after I left AMD I got permission to use the binary for regressions testing.
>>
>> Oded
>>
>> On Fri, Feb 10, 2017 at 6:33 PM, Andres Rodriguez <[email protected]>
>> wrote:
>>> Hey Oded,
>>>
>>> Where can I find a repo with kfdtest?
>>>
>>> I tried looking here bit couldn't find it:
>>>
>>> https://cgit.freedesktop.org/~gabbayo/
>>>
>>> -Andres
>>>
>>>
>>>
>>> On 2017-02-10 05:35 AM, Oded Gabbay wrote:
>>>>
>>>> So the warning in dmesg is gone of course, but the test (that I
>>>> mentioned in previous email) still fails, and this time it caused the
>>>> kernel to crash. In addition, now other tests fail as well, e.g.
>>>> KFDEventTest.SignalEvent
>>>>
>>>> I honestly suggest to take some time to debug this patch-set on an
>>>> actual Kaveri machine and then re-send the patches.
>>>>
>>>> Thanks,
>>>> Oded
>>>>
>>>> log of crash from KFDQMTest.CreateMultipleCpQueues:
>>>>
>>>> [  160.900137] kfd: qcm fence wait loop timeout expired [
>>>> 160.900143] kfd: the cp might be in an unrecoverable state due to an
>>>> unsuccessful queues preemption [  160.916765] show_signal_msg: 36
>>>> callbacks suppressed [  160.916771] kfdtest[2498]: segfault at
>>>> 100007f8a ip 00007f8ae932ee5d sp 00007ffc52219cd0 error 4 in
>>>> libhsakmt-1.so.0.0.1[7f8ae932b000+8000]
>>>> [  163.152229] kfd: qcm fence wait loop timeout expired [
>>>> 163.152250] BUG: unable to handle kernel NULL pointer dereference at
>>>> 000000000000005a [  163.152299] IP:
>>>> kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152323] PGD
>>>> 2333aa067 [  163.152323] PUD 230f64067 [  163.152335] PMD 0
>>>>
>>>> [  163.152364] Oops: 0000 [#1] SMP
>>>> [  163.152379] Modules linked in: joydev edac_mce_amd edac_core
>>>> input_leds kvm_amd snd_hda_codec_realtek kvm irqbypass
>>>> snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
>> snd_hda_codec
>>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core
>>>> snd_hwdep pcbc snd_pcm aesni_intel snd_seq_midi snd_seq_midi_event
>>>> snd_rawmidi snd_seq aes_x86_64 crypto_simd snd_seq_device
>> glue_helper
>>>> cryptd snd_timer snd fam15h_power k10temp soundcore i2c_piix4 shpchp
>>>> tpm_infineon mac_hid parport_pc ppdev nfsd auth_rpcgss nfs_acl lockd
>>>> lp grace sunrpc parport autofs4 hid_logitech_hidpp hid_logitech_dj
>>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon
>>>> i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect
>>>> sysimgblt libahci fb_sys_fops drm r8169 mii fjes video [  163.152668]
>>>> CPU: 3 PID: 2498 Comm: kfdtest Not tainted 4.10.0-rc5+ #3 [
>>>> 163.152695] Hardware name: Gigabyte Technology Co., Ltd. To be filled
>>>> by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014 [  163.152735] task:
>>>> ffff995e73d16580 task.stack: ffffb41144458000 [  163.152764] RIP:
>>>> 0010:kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152790]
>>>> RSP: 0018:ffffb4114445bab0 EFLAGS: 00010246 [  163.152812] RAX:
>>>> ffffffffffffffea RBX: ffff995e75909c00 RCX:
>>>> 0000000000000000
>>>> [  163.152841] RDX: 0000000000000000 RSI: ffffffffffffffea RDI:
>>>> ffff995e75909600
>>>> [  163.152869] RBP: ffffb4114445bae0 R08: 00000000000252a5 R09:
>>>> 0000000000000414
>>>> [  163.152898] R10: 0000000000000000 R11: ffffffffb412d38d R12:
>>>> 00000000ffffffc2
>>>> [  163.152926] R13: 0000000000000000 R14: ffff995e75909ca8 R15:
>>>> ffff995e75909c00
>>>> [  163.152956] FS:  00007f8ae975e740(0000) GS:ffff995e7ed80000(0000)
>>>> knlGS:0000000000000000
>>>> [  163.152988] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [
>>>> 163.153012] CR2: 000000000000005a CR3: 00000002216ab000 CR4:
>>>> 00000000000406e0
>>>> [  163.153041] Call Trace:
>>>> [  163.153059]  ? destroy_queues_cpsch+0x166/0x190 [amdkfd] [
>>>> 163.153086]  execute_queues_cpsch+0x2e/0xc0 [amdkfd] [  163.153113]
>>>> destroy_queue_cpsch+0xbd/0x140 [amdkfd] [  163.153139]
>>>> pqm_destroy_queue+0x111/0x1d0 [amdkfd] [  163.153164]
>>>> pqm_uninit+0x3f/0xb0 [amdkfd] [  163.153186]
>>>> kfd_unbind_process_from_device+0x51/0xd0 [amdkfd] [  163.153214]
>>>> iommu_pasid_shutdown_callback+0x20/0x30 [amdkfd] [  163.153239]
>>>> mn_release+0x37/0x70 [amd_iommu_v2] [  163.153261]
>>>> __mmu_notifier_release+0x44/0xc0 [  163.153281]
>>>> exit_mmap+0x15a/0x170 [  163.153297]  ? __wake_up+0x44/0x50 [
>>>> 163.153314]  ? exit_robust_list+0x5c/0x110 [  163.153333]
>>>> mmput+0x57/0x140 [  163.153347]  do_exit+0x26b/0xb30 [  163.153362]
>>>> do_group_exit+0x43/0xb0 [  163.153379]  get_signal+0x293/0x620 [
>>>> 163.153396]  do_signal+0x37/0x760 [  163.153411]  ?
>>>> print_vma_addr+0x82/0x100 [  163.153429]  ? vprintk_default+0x29/0x50
>>>> [  163.153447]  ? bad_area+0x46/0x50 [  163.153463]  ?
>>>> __do_page_fault+0x3c7/0x4e0 [  163.153481]
>>>> exit_to_usermode_loop+0x76/0xb0 [  163.153500]
>>>> prepare_exit_to_usermode+0x2f/0x40
>>>> [  163.153521]  retint_user+0x8/0x10
>>>> [  163.153536] RIP: 0033:0x7f8ae932ee5d [  163.153551] RSP:
>>>> 002b:00007ffc52219cd0 EFLAGS: 00010202 [  163.153573] RAX:
>>>> 0000000000000003 RBX: 0000000100007f8a RCX:
>>>> 00007ffc52219d00
>>>> [  163.153602] RDX: 00007f8ae9534220 RSI: 00007f8ae8b5eb28 RDI:
>>>> 0000000100007f8a
>>>> [  163.153630] RBP: 00007ffc52219d20 R08: 0000000001cc1890 R09:
>>>> 0000000000000000
>>>> [  163.153659] R10: 0000000000000027 R11: 00007f8ae932ee10 R12:
>>>> 0000000001cc52a0
>>>> [  163.153687] R13: 00007ffc5221a200 R14: 0000000000000021 R15:
>>>> 0000000000000000
>>>> [  163.153716] Code: e0 04 00 00 48 3b 91 f0 03 00 00 74 01 c3 55 48
>>>> 89 e5 e8 2e f9 ff ff 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
>>>> 44 00 00 55 <48> 8b 46 70 48 83 c6 70 48 89 e5 48 39 f0 74 16 48 3b
>>>> 78
>>>> 10 75
>>>> [  163.153818] RIP: kfd_get_process_device_data+0x6/0x30 [amdkfd] RSP:
>>>> ffffb4114445bab0
>>>> [  163.153848] CR2: 000000000000005a
>>>> [  163.160389] ---[ end trace f6a8177c7119c1f5 ]--- [  163.160390]
>>>> Fixing recursive fault but reboot is needed!
>>>>
>>>> On Thu, Feb 9, 2017 at 10:38 PM, Andres Rodriguez
>>>> <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hey Oded,
>>>>>
>>>>> Sorry to be a nuisance, but if you have everything still setup could
>>>>> you give this fix a quick go?
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> index 5321d18..9f70ee0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> @@ -667,7 +667,7 @@ static int set_sched_resources(struct
>>>>> device_queue_manager *dqm)
>>>>>                  /* This situation may be hit in the future if a new HW
>>>>>                   * generation exposes more than 64 queues. If so, the
>>>>>                   * definition of res.queue_mask needs updating */
>>>>> -               if (WARN_ON(i > sizeof(res.queue_mask))) {
>>>>> +               if (WARN_ON(i > (sizeof(res.queue_mask)*8))) {
>>>>>                          pr_err("Invalid queue enabled by amdgpu:
>>>>> %d\n", i);
>>>>>                          break;
>>>>>                  }
>>>>>
>>>>> John/Felix,
>>>>>
>>>>> Any chance I could borrow a carrizo/kaveri for a few days? Or maybe
>>>>> you could help me run some final tests on this patch series?
>>>>>
>>>>> - Andres
>>>>>
>>>>>
>>>>>
>>>>> On 2017-02-09 03:11 PM, Oded Gabbay wrote:
>>>>>>
>>>>>>    Andres,
>>>>>>
>>>>>> I tried your patches on Kaveri with airlied's drm-next branch.
>>>>>> I used radeon+amdkfd
>>>>>>
>>>>>> The following test failed: KFDQMTest.CreateMultipleCpQueues
>>>>>> However, I can't debug it because I don't have the sources of kfdtest.
>>>>>>
>>>>>> In dmesg, I saw the following warning during boot:
>>>>>> WARNING: CPU: 0 PID: 150 at
>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:670
>>>>>> start_cpsch+0xc5/0x220 [amdkfd]
>>>>>> [    4.393796] Modules linked in: hid_logitech_hidpp hid_logitech_dj
>>>>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2
>>>>>> radeon(+) i2c_algo_bit ttm drm_kms_helper syscopyarea ahci
>>>>>> sysfillrect sysimgblt libahci fb_sys_fops drm r8169 mii fjes video
>>>>>> [    4.393811] CPU: 0 PID: 150 Comm: systemd-udevd Not tainted
>>>>>> 4.10.0-rc5+
>>>>>> #1
>>>>>> [    4.393811] Hardware name: Gigabyte Technology Co., Ltd. To be
>>>>>> filled by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014
>>>>>> [    4.393812] Call Trace:
>>>>>> [    4.393818]  dump_stack+0x63/0x90
>>>>>> [    4.393822]  __warn+0xcb/0xf0
>>>>>> [    4.393823]  warn_slowpath_null+0x1d/0x20
>>>>>> [    4.393830]  start_cpsch+0xc5/0x220 [amdkfd]
>>>>>> [    4.393836]  ? initialize_cpsch+0xa0/0xb0 [amdkfd]
>>>>>> [    4.393841]  kgd2kfd_device_init+0x375/0x490 [amdkfd]
>>>>>> [    4.393883]  radeon_kfd_device_init+0xaf/0xd0 [radeon]
>>>>>> [    4.393911]  radeon_driver_load_kms+0x11e/0x1f0 [radeon]
>>>>>> [    4.393933]  drm_dev_register+0x14a/0x200 [drm]
>>>>>> [    4.393946]  drm_get_pci_dev+0x9d/0x160 [drm]
>>>>>> [    4.393974]  radeon_pci_probe+0xb8/0xe0 [radeon]
>>>>>> [    4.393976]  local_pci_probe+0x45/0xa0
>>>>>> [    4.393978]  pci_device_probe+0x103/0x150
>>>>>> [    4.393981]  driver_probe_device+0x2bf/0x460
>>>>>> [    4.393982]  __driver_attach+0xdf/0xf0
>>>>>> [    4.393984]  ? driver_probe_device+0x460/0x460
>>>>>> [    4.393985]  bus_for_each_dev+0x6c/0xc0
>>>>>> [    4.393987]  driver_attach+0x1e/0x20
>>>>>> [    4.393988]  bus_add_driver+0x1fd/0x270
>>>>>> [    4.393989]  ? 0xffffffffc05c8000
>>>>>> [    4.393991]  driver_register+0x60/0xe0
>>>>>> [    4.393992]  ? 0xffffffffc05c8000
>>>>>> [    4.393993]  __pci_register_driver+0x4c/0x50
>>>>>> [    4.394007]  drm_pci_init+0xeb/0x100 [drm]
>>>>>> [    4.394008]  ? 0xffffffffc05c8000
>>>>>> [    4.394031]  radeon_init+0x98/0xb6 [radeon]
>>>>>> [    4.394034]  do_one_initcall+0x53/0x1a0
>>>>>> [    4.394037]  ? __vunmap+0x81/0xd0
>>>>>> [    4.394039]  ? kmem_cache_alloc_trace+0x152/0x1c0
>>>>>> [    4.394041]  ? vfree+0x2e/0x70
>>>>>> [    4.394044]  do_init_module+0x5f/0x1ff
>>>>>> [    4.394046]  load_module+0x24cc/0x29f0
>>>>>> [    4.394047]  ? __symbol_put+0x60/0x60
>>>>>> [    4.394050]  ? security_kernel_post_read_file+0x6b/0x80
>>>>>> [    4.394052]  SYSC_finit_module+0xdf/0x110
>>>>>> [    4.394054]  SyS_finit_module+0xe/0x10
>>>>>> [    4.394056]  entry_SYSCALL_64_fastpath+0x1e/0xad
>>>>>> [    4.394058] RIP: 0033:0x7f9cda77c8e9
>>>>>> [    4.394059] RSP: 002b:00007ffe195d3378 EFLAGS: 00000246 ORIG_RAX:
>>>>>> 0000000000000139
>>>>>> [    4.394060] RAX: ffffffffffffffda RBX: 00007f9cdb8dda7e RCX:
>>>>>> 00007f9cda77c8e9
>>>>>> [    4.394061] RDX: 0000000000000000 RSI: 00007f9cdac7ce2a RDI:
>>>>>> 0000000000000013
>>>>>> [    4.394062] RBP: 00007ffe195d2450 R08: 0000000000000000 R09:
>>>>>> 0000000000000000
>>>>>> [    4.394063] R10: 0000000000000013 R11: 0000000000000246 R12:
>>>>>> 00007ffe195d245a
>>>>>> [    4.394063] R13: 00007ffe195d1378 R14: 0000563f70cc93b0 R15:
>>>>>> 0000563f70cba4d0
>>>>>> [    4.394091] ---[ end trace 9c5af17304d998bb ]---
>>>>>> [    4.394092] Invalid queue enabled by amdgpu: 9
>>>>>>
>>>>>> I suggest you get a Kaveri/Carrizo machine to debug these issues.
>>>>>>
>>>>>> Until that, I don't think we should merge this patch-set.
>>>>>>
>>>>>> Oded
>>>>>>
>>>>>> On Wed, Feb 8, 2017 at 9:47 PM, Andres Rodriguez
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you Oded.
>>>>>>>
>>>>>>> - Andres
>>>>>>>
>>>>>>>
>>>>>>> On 2017-02-08 02:32 PM, Oded Gabbay wrote:
>>>>>>>>
>>>>>>>> On Wed, Feb 8, 2017 at 6:23 PM, Andres Rodriguez
>>>>>>>> <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hey Felix,
>>>>>>>>>
>>>>>>>>> Thanks for the pointer to the ROCm mqd commit. I like that the
>>>>>>>>> workarounds are easy to spot. I'll add that to a new patch
>>>>>>>>> series I'm working on for some bug-fixes for perf being lower on
>>>>>>>>> pipes other than pipe 0.
>>>>>>>>>
>>>>>>>>> I haven't tested this yet on kaveri/carrizo. I'm hoping someone
>>>>>>>>> with the HW will be able to give it a go. I put in a few small
>>>>>>>>> hacks to get KFD to boot but do nothing on polaris10.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2017-02-06 03:20 PM, Felix Kuehling wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Andres,
>>>>>>>>>>
>>>>>>>>>> Thank you for tackling this task. It's more involved than I
>>>>>>>>>> expected, mostly because I didn't have much awareness of the
>>>>>>>>>> MQD management in amdgpu.
>>>>>>>>>>
>>>>>>>>>> I made one comment in a separate message about the unified MQD
>>>>>>>>>> commit function, if you want to bring that more in line with
>>>>>>>>>> our latest ROCm release on github.
>>>>>>>>>>
>>>>>>>>>> Also, were you able to test the upstream KFD with your changes
>>>>>>>>>> on a Kaveri or Carrizo?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>      Felix
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 17-02-03 11:51 PM, Andres Rodriguez wrote:
>>>>>>>>>>>
>>>>>>>>>>> The current queue/pipe split policy is for amdgpu to take the
>>>>>>>>>>> first pipe of
>>>>>>>>>>> MEC0 and leave the rest for amdkfd to use. This policy is
>>>>>>>>>>> taken as an assumption in a few areas of the implementation.
>>>>>>>>>>>
>>>>>>>>>>> This patch series aims to allow for flexible/tunable
>>>>>>>>>>> queue/pipe split policies between kgd and kfd. It also updates
>>>>>>>>>>> the queue/pipe split policy to one that allows better compute
>>>>>>>>>>> app concurrency for both drivers.
>>>>>>>>>>>
>>>>>>>>>>> In the process some duplicate code and hardcoded constants
>>>>>>>>>>> were removed.
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions or feedback on improvements welcome.
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>> Hi Andres,
>>>>>>>> I will try to find sometime to test it on my Kaveri machine.
>>>>>>>>
>>>>>>>> Oded
>>>>>>>
>>>>>>>
>>>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

signature.asc
Description: OpenPGP digital signature

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Change queue/pipe split between amdkfd and amdgpu

Reply via email to