Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 13.02.2018 um 20:22 schrieb Felix Kuehling: On 2018-02-13 01:45 PM, Christian König wrote: Am 13.02.2018 um 17:56 schrieb Felix Kuehling: [SNIP] Each process gets a whole page of the doorbell aperture assigned to it. The assumption is that amdgpu only uses the first page of the doorbell aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is used as the offset into the doorbell page. On GFX9 the hardware does some engine-specific doorbell routing, so we added another layer of doorbell management that's decoupled from the queue ID. Either way, an entire doorbell page gets mapped into user mode and user mode knows the offset of the doorbells for specific queues. The mapping is currently handled by kfd_mmap in kfd_chardev.c. Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost looks like you map different doorbells with the same offset depending on which process is calling this. Is that correct? If yes then that would be illegal and a problem if I'm not completely mistaken. Why is that a problem. Each process has its own file descriptor. The mapping is done using io_remap_pfn_range in kfd_doorbell_mmap. Yeah, but all share the same file address space from the inode, doesn't they? E.g. imagine that you map an offset from a normal file into process A and the same offset from the same file into process B, what do you get? Exactly, the same page mapped into both processes. Now take a look at the KFD interface, you map the same offset from the same file (or rather inode) into two different processes and get two different things. This is nothing new. It's been done like this forever even on Kaveri and Carrizo. It most likely doesn't cause any problems of hand or otherwise we would have noticed already, but at bare minimum that totally confuses the reverse mapping code. Need to think about it if that is really an issue or not and if yes how to fix/mitigate it. Regards, Christian. Do you simply assume that after evicting a process it always needs to be restarted without checking if it actually does something? Or how does that work? Exactly. Ok, understood. Well that limits the usefulness of the whole eviction drastically. With later addition of GPU self-dispatch a page-fault based mechanism wouldn't work any more. We have to restart the queues blindly with a timer. See evict_process_worker, which schedules the restore with a delayed worker. which was send either by the GPU o The user mode queue ABI specifies that user mode update both the doorbell and a WPTR in memory. When we restart queues we (or the CP firmware) use the WPTR to make sure we catch up with any work that was submitted while the queues were unmapped. Putting cross process work dispatch aside for a moment GPU self-dispatch works only when there is work on the GPU running. So you can still check if there are some work pending after you unmapped everything and only restart the queues when there is new work based on the page fault. In other words either there is work pending and it doesn't matter if it was send by the GPU or by the CPU or there is no work pending and we can delay restarting everything until there is. That sounds like a useful optimization. Regards, Felix Regards, Christian. Regards, Felix Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 14.02.2018 um 00:17 schrieb Felix Kuehling: On 2018-02-13 02:18 PM, Felix Kuehling wrote: On 2018-02-13 01:15 PM, Christian König wrote: Am 13.02.2018 um 18:18 schrieb Felix Kuehling: On 2018-02-13 12:06 PM, Christian König wrote: [SNIP] Ah, yeah that is also a point I wanted to to talk about with you. The approach of using the same buffer object with multiple amdgpu devices doesn't work in general. We need separate TTM object for each BO in each device or otherwise we break A+A laptops. I think it broke for VRAM BOs because we enabled P2P on systems that didn't support it properly. But at least system memory BOs can be shared quite well between devices and we do it all the time. Sharing VRAM BOs is one issue, but the problem goes deeper than just that. Starting with Carizzo we can scanout from system memory to avoid the extra copy on A+A laptops. For this to work we need the BO mapped to GART (and I mean a real VMID0 mapping, not just in the GTT domain). And for this to work in turn we need a TTM object per device and not a global one. I still don't understand. I think what you're talking about applies only to BOs used for scan-out. Every BO is allocated from a specific device and can only be GART-mapped on that device. Exactly that assumption is incorrect. BOs can be GART mapped into any device. What we do is map the same BO in VMs on other devices. It has no effect on GART mappings. Correct VM mapping is unaffected here. I don't see how you can have separate TTM objects referring to the same memory. Well that is trivial, we do this all the time with prime and I+A laptops. As I understand it, you use DMABuf to export/import buffers on multiple devices. I believe all devices share a single amdgpu_bo, which contains the ttm_buffer_object. That's incorrect as well. Normally multiple devices have multiple ttm_buffer_object, one for each device. Going a bit higher that actually makes sense because the status of each BO is deferent for each device. E.g. one device could have the BO in access while it could be idle on another device. Same is true for placement of the BO. E.g. a VRAM placement of one device is actually a GTT placement for another. The only way you can have a new TTM buffer object per device is by using SG tables and pinning the BOs. But I think we want to avoid pinning BOs. What we do does not involve pinning of BOs, even when they're shared between multiple devices' VMs. That is also the reason we had to disable this feature again in the hybrid branches. What you disabled on hybrid branches was P2P, which only affects large-BAR systems. Sharing of system memory BOs between devices still works fine. No, it doesn't. It completely breaks any scanout on Carizzo, Stoney and Raven. Additional to that we found that it breaks some aspects of the user space interface. Let me check that with my current patch series. The patches I submitted here shouldn't include anything that breaks the use cases you describe. But I'm quite sure it will support sharing BOs between multiple devices' VMs. Confirmed with the current patch series. I can allocate buffers on GPU 1 and map them into a VM on GPU 2. As I said VM mapping is not the problem here. BTW, this is the normal mode of operation for system memory BOs on multi-GPU systems. Logically system memory BOs belong to the CPU node in the KFD topology. So the Thunk API isn't told which GPU to allocate the system memory BO from. It just picks the first one. Then we can map those BOs on any GPUs we want. If we want to use them on GPU2, that's where they get mapped. Well keeping NUMA in mind that actually sounds like a design problem to me. On NUMA system some parts of the system memory can be "closer" to a GPU than other parts. So I just put two GPUs in a system and ran a test on GPU2. The system memory buffers are allocated from the GPU1 device, but mapped into the GPU2 VM. The tests work normally. If this is enabled by any changes that break existing buffer sharing for A+A or A+I systems, please point it out to me. I'm not aware that this patch series does anything to that effect. As I said it completely breaks scanout with A+I systems. Over all it looks like the change causes more problems than it solves. So I'm going to block upstreaming it until we have found a way to make it work for everybody. Regards, Christian. Regards, Felix Regards, Felix So end result is that we probably need to revert it and find a different solution. I'm already working on this for a couple of weeks now and should have something ready after I'm done with the PASID handling. Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
Thanks for all the inputs. Very helpful! I think I have a general understanding of the queue scheduling now and it's time for me to read more code and materials and do some experiments. I'll come back with more questions hopefully. :-) Hi David, please don't hesitate to share more documents. I might find helpful information from them eventually. People like me may benefit from them someway in the future. Best, Ming (Mark) On Tue, Feb 13, 2018 at 7:14 PM, Panariti, Davidwrote: > I found a bunch of doc whilst spelunking info for another project. > I'm not sure what's up-to-date, correct, useful, etc. > I've attached one. > Let me know if you want any more. > > davep > >> -Original Message- >> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf >> Of Bridgman, John >> Sent: Tuesday, February 13, 2018 6:45 PM >> To: Bridgman, John ; Ming Yang >> ; Kuehling, Felix >> Cc: Deucher, Alexander ; amd- >> g...@lists.freedesktop.org >> Subject: RE: Documentation about AMD's HSA implementation? >> >> >> >-Original Message- >> >From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf >> >Of Bridgman, John >> >Sent: Tuesday, February 13, 2018 6:42 PM >> >To: Ming Yang; Kuehling, Felix >> >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org >> >Subject: RE: Documentation about AMD's HSA implementation? >> > >> > >> > >> >>-Original Message- >> >>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On >> Behalf >> >>Of Ming Yang >> >>Sent: Tuesday, February 13, 2018 4:59 PM >> >>To: Kuehling, Felix >> >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org >> >>Subject: Re: Documentation about AMD's HSA implementation? >> >> >> >>That's very helpful, thanks! >> >> >> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling >> >> >> >>wrote: >> >>> On 2018-02-13 04:06 PM, Ming Yang wrote: >> Thanks for the suggestions! But I might ask several specific >> questions, as I can't find the answer in those documents, to give >> myself a quick start if that's okay. Pointing me to the >> files/functions would be good enough. Any explanations are >> appreciated. My purpose is to hack it with different scheduling >> policy with real-time and predictability consideration. >> >> - Where/How is the packet scheduler implemented? How are packets >> from multiple queues scheduled? What about scheduling packets from >> queues in different address spaces? >> >>> >> >>> This is done mostly in firmware. The CP engine supports up to 32 >> queues. >> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >> >>> micro engine. Within each pipe the queues are time-multiplexed. >> >> >> >>Please correct me if I'm wrong. CP is computing processor, like the >> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) >> >>scheduler multiplexing queues, in order to hide memory latency. >> > >> >CP is one step back from that - it's a "command processor" which reads >> >command packets from driver (PM4 format) or application (AQL format) >> >then manages the execution of each command on the GPU. A typical >> packet >> >might be "dispatch", which initiates a compute operation on an >> >N-dimensional array, or "draw" which initiates the rendering of an >> >array of triangles. Those compute and render commands then generate a >> >(typically) large number of wavefronts which are multiplexed on the >> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine >> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each >> pipe on an MEC block as an "ACE". >> >> I missed one important point - "CP" refers to the combination of ME, MEC(s) >> and a few other related blocks. >> >> >> >> >>> >> >>> If we need more than 24 queues, or if we have more than 8 processes, >> >>> the hardware scheduler (HWS) adds another layer scheduling, >> >>> basically round-robin between batches of 24 queues or 8 processes. >> >>> Once you get into such an over-subscribed scenario your performance >> >>> and GPU utilization can suffers quite badly. >> >> >> >>HWS is also implemented in the firmware that's closed-source? >> > >> >Correct - HWS is implemented in the MEC microcode. We also include a >> >simple SW scheduler in the open source driver code, however. >> >> >> >>> >> >> - I noticed the new support of concurrency of multi-processes in >> the archive of this mailing list. Could you point me to the code >> that implements this? >> >>> >> >>> That's basically just a switch that tells the firmware that it is >> >>> allowed to schedule queues from different processes at the same time. >> >>> The upper limit is the number of VMIDs that HWS can work with. It >> >>>
RE: Documentation about AMD's HSA implementation?
>-Original Message- >From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of >Bridgman, John >Sent: Tuesday, February 13, 2018 6:42 PM >To: Ming Yang; Kuehling, Felix >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org >Subject: RE: Documentation about AMD's HSA implementation? > > > >>-Original Message- >>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf >>Of Ming Yang >>Sent: Tuesday, February 13, 2018 4:59 PM >>To: Kuehling, Felix >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org >>Subject: Re: Documentation about AMD's HSA implementation? >> >>That's very helpful, thanks! >> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling >>>>wrote: >>> On 2018-02-13 04:06 PM, Ming Yang wrote: Thanks for the suggestions! But I might ask several specific questions, as I can't find the answer in those documents, to give myself a quick start if that's okay. Pointing me to the files/functions would be good enough. Any explanations are appreciated. My purpose is to hack it with different scheduling policy with real-time and predictability consideration. - Where/How is the packet scheduler implemented? How are packets from multiple queues scheduled? What about scheduling packets from queues in different address spaces? >>> >>> This is done mostly in firmware. The CP engine supports up to 32 queues. >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >>> micro engine. Within each pipe the queues are time-multiplexed. >> >>Please correct me if I'm wrong. CP is computing processor, like the >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler >>multiplexing queues, in order to hide memory latency. > >CP is one step back from that - it's a "command processor" which reads >command packets from driver (PM4 format) or application (AQL format) then >manages the execution of each command on the GPU. A typical packet might >be "dispatch", which initiates a compute operation on an N-dimensional array, >or "draw" which initiates the rendering of an array of triangles. Those >compute and render commands then generate a (typically) large number of >wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of >our recent GPUs have one micro engine for graphics ("ME") and two for >compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE". I missed one important point - "CP" refers to the combination of ME, MEC(s) and a few other related blocks. >> >>> >>> If we need more than 24 queues, or if we have more than 8 processes, >>> the hardware scheduler (HWS) adds another layer scheduling, basically >>> round-robin between batches of 24 queues or 8 processes. Once you get >>> into such an over-subscribed scenario your performance and GPU >>> utilization can suffers quite badly. >> >>HWS is also implemented in the firmware that's closed-source? > >Correct - HWS is implemented in the MEC microcode. We also include a simple >SW scheduler in the open source driver code, however. >> >>> - I noticed the new support of concurrency of multi-processes in the archive of this mailing list. Could you point me to the code that implements this? >>> >>> That's basically just a switch that tells the firmware that it is >>> allowed to schedule queues from different processes at the same time. >>> The upper limit is the number of VMIDs that HWS can work with. It >>> needs to assign a unique VMID to each process (each VMID representing >>> a separate address space, page table, etc.). If there are more >>> processes than VMIDs, the HWS has to time-multiplex. >> >>HWS dispatch packets in their order of becoming the head of the queue, >>i.e., being pointed by the read_index? So in this way it's FIFO. Or >>round-robin between queues? You mentioned round-robin over batches in >>the over- subscribed scenario. > >Round robin between sets of queues. The HWS logic generates sets as >follows: > >1. "set resources" packet from driver tells scheduler how many VMIDs and >HW queues it can use > >2. "runlist" packet from driver provides list of processes and list of queues >for >each process > >3. if multi-process switch not set, HWS schedules as many queues from the >first process in the runlist as it has HW queues (see #1) > >4. at the end of process quantum (set by driver) either switch to next process >(if all queues from first process have been scheduled) or schedule next set of >queues from the same process > >5. when all queues from all processes have been scheduled and run for a >process quantum, go back to the start of the runlist and repeat > >If the multi-process switch is set, and the number of queues for a process is >less than the number of HW queues available, then in step #3 above HWS will >start scheduling queues for additional processes, using a
RE: Documentation about AMD's HSA implementation?
>-Original Message- >From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of >Ming Yang >Sent: Tuesday, February 13, 2018 4:59 PM >To: Kuehling, Felix >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org >Subject: Re: Documentation about AMD's HSA implementation? > >That's very helpful, thanks! > >On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling>wrote: >> On 2018-02-13 04:06 PM, Ming Yang wrote: >>> Thanks for the suggestions! But I might ask several specific >>> questions, as I can't find the answer in those documents, to give >>> myself a quick start if that's okay. Pointing me to the >>> files/functions would be good enough. Any explanations are >>> appreciated. My purpose is to hack it with different scheduling >>> policy with real-time and predictability consideration. >>> >>> - Where/How is the packet scheduler implemented? How are packets >>> from multiple queues scheduled? What about scheduling packets from >>> queues in different address spaces? >> >> This is done mostly in firmware. The CP engine supports up to 32 queues. >> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >> micro engine. Within each pipe the queues are time-multiplexed. > >Please correct me if I'm wrong. CP is computing processor, like the Execution >Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing >queues, in order to hide memory latency. CP is one step back from that - it's a "command processor" which reads command packets from driver (PM4 format) or application (AQL format) then manages the execution of each command on the GPU. A typical packet might be "dispatch", which initiates a compute operation on an N-dimensional array, or "draw" which initiates the rendering of an array of triangles. Those compute and render commands then generate a (typically) large number of wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of our recent GPUs have one micro engine for graphics ("ME") and two for compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE". > >> >> If we need more than 24 queues, or if we have more than 8 processes, >> the hardware scheduler (HWS) adds another layer scheduling, basically >> round-robin between batches of 24 queues or 8 processes. Once you get >> into such an over-subscribed scenario your performance and GPU >> utilization can suffers quite badly. > >HWS is also implemented in the firmware that's closed-source? Correct - HWS is implemented in the MEC microcode. We also include a simple SW scheduler in the open source driver code, however. > >> >>> >>> - I noticed the new support of concurrency of multi-processes in the >>> archive of this mailing list. Could you point me to the code that >>> implements this? >> >> That's basically just a switch that tells the firmware that it is >> allowed to schedule queues from different processes at the same time. >> The upper limit is the number of VMIDs that HWS can work with. It >> needs to assign a unique VMID to each process (each VMID representing >> a separate address space, page table, etc.). If there are more >> processes than VMIDs, the HWS has to time-multiplex. > >HWS dispatch packets in their order of becoming the head of the queue, i.e., >being pointed by the read_index? So in this way it's FIFO. Or round-robin >between queues? You mentioned round-robin over batches in the over- >subscribed scenario. Round robin between sets of queues. The HWS logic generates sets as follows: 1. "set resources" packet from driver tells scheduler how many VMIDs and HW queues it can use 2. "runlist" packet from driver provides list of processes and list of queues for each process 3. if multi-process switch not set, HWS schedules as many queues from the first process in the runlist as it has HW queues (see #1) 4. at the end of process quantum (set by driver) either switch to next process (if all queues from first process have been scheduled) or schedule next set of queues from the same process 5. when all queues from all processes have been scheduled and run for a process quantum, go back to the start of the runlist and repeat If the multi-process switch is set, and the number of queues for a process is less than the number of HW queues available, then in step #3 above HWS will start scheduling queues for additional processes, using a different VMID for each process, and continue until it either runs out of VMIDs or HW queues (or reaches the end of the runlist). All of the queues and processes would then run together for a process quantum before switching to the next queue set. > >This might not be a big deal for performance, but it matters for predictability >and real-time analysis. Agreed. In general you would not want to overcommit either VMIDs or HW queues in a real-time scenario, and for hard real time you would probably
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 02:18 PM, Felix Kuehling wrote: > On 2018-02-13 01:15 PM, Christian König wrote: >> Am 13.02.2018 um 18:18 schrieb Felix Kuehling: >>> On 2018-02-13 12:06 PM, Christian König wrote: [SNIP] Ah, yeah that is also a point I wanted to to talk about with you. The approach of using the same buffer object with multiple amdgpu devices doesn't work in general. We need separate TTM object for each BO in each device or otherwise we break A+A laptops. >>> I think it broke for VRAM BOs because we enabled P2P on systems that >>> didn't support it properly. But at least system memory BOs can be shared >>> quite well between devices and we do it all the time. >> Sharing VRAM BOs is one issue, but the problem goes deeper than just >> that. >> >> Starting with Carizzo we can scanout from system memory to avoid the >> extra copy on A+A laptops. For this to work we need the BO mapped to >> GART (and I mean a real VMID0 mapping, not just in the GTT domain). >> And for this to work in turn we need a TTM object per device and not a >> global one. > I still don't understand. I think what you're talking about applies only > to BOs used for scan-out. Every BO is allocated from a specific device > and can only be GART-mapped on that device. What we do is map the same > BO in VMs on other devices. It has no effect on GART mappings. > >>> I don't see how you can have separate TTM objects referring to the >>> same memory. >> Well that is trivial, we do this all the time with prime and I+A laptops. > As I understand it, you use DMABuf to export/import buffers on multiple > devices. I believe all devices share a single amdgpu_bo, which contains > the ttm_buffer_object. The only way you can have a new TTM buffer object > per device is by using SG tables and pinning the BOs. But I think we > want to avoid pinning BOs. > > What we do does not involve pinning of BOs, even when they're shared > between multiple devices' VMs. > That is also the reason we had to disable this feature again in the hybrid branches. >>> What you disabled on hybrid branches was P2P, which only affects >>> large-BAR systems. Sharing of system memory BOs between devices still >>> works fine. >> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney >> and Raven. Additional to that we found that it breaks some aspects of >> the user space interface. > Let me check that with my current patch series. The patches I submitted > here shouldn't include anything that breaks the use cases you describe. > But I'm quite sure it will support sharing BOs between multiple devices' > VMs. Confirmed with the current patch series. I can allocate buffers on GPU 1 and map them into a VM on GPU 2. BTW, this is the normal mode of operation for system memory BOs on multi-GPU systems. Logically system memory BOs belong to the CPU node in the KFD topology. So the Thunk API isn't told which GPU to allocate the system memory BO from. It just picks the first one. Then we can map those BOs on any GPUs we want. If we want to use them on GPU2, that's where they get mapped. So I just put two GPUs in a system and ran a test on GPU2. The system memory buffers are allocated from the GPU1 device, but mapped into the GPU2 VM. The tests work normally. If this is enabled by any changes that break existing buffer sharing for A+A or A+I systems, please point it out to me. I'm not aware that this patch series does anything to that effect. Regards, Felix > > Regards, > Felix > >> So end result is that we probably need to revert it and find a >> different solution. I'm already working on this for a couple of weeks >> now and should have something ready after I'm done with the PASID >> handling. >> >> Regards, >> Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
On 2018-02-13 04:58 PM, Ming Yang wrote: > That's very helpful, thanks! > > On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling> wrote: >> On 2018-02-13 04:06 PM, Ming Yang wrote: >>> Thanks for the suggestions! But I might ask several specific >>> questions, as I can't find the answer in those documents, to give >>> myself a quick start if that's okay. Pointing me to the >>> files/functions would be good enough. Any explanations are >>> appreciated. My purpose is to hack it with different scheduling >>> policy with real-time and predictability consideration. >>> >>> - Where/How is the packet scheduler implemented? How are packets from >>> multiple queues scheduled? What about scheduling packets from queues >>> in different address spaces? >> This is done mostly in firmware. The CP engine supports up to 32 queues. >> We share those between KFD and AMDGPU. KFD gets 24 queues to use. >> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP >> micro engine. Within each pipe the queues are time-multiplexed. > Please correct me if I'm wrong. CP is computing processor, like the > Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) > scheduler multiplexing queues, in order to hide memory latency. CP stands for "command processor". There are multiple CP micro engines. CPG for graphics, CPC for compute. This is not related to warps or wave fronts. Those execute on the CUs (compute units). There are many CUs, but only one CPC. The scheduling or dispatching of wavefronts to CUs is yet another level of scheduling that I didn't talk about. Application submit AQL dispatch packets to user mode queues. The CP processes those dispatch packets and schedules the resulting wavefronts on CUs. > >> If we need more than 24 queues, or if we have more than 8 processes, the >> hardware scheduler (HWS) adds another layer scheduling, basically >> round-robin between batches of 24 queues or 8 processes. Once you get >> into such an over-subscribed scenario your performance and GPU >> utilization can suffers quite badly. > HWS is also implemented in the firmware that's closed-source? Yes. > >>> - I noticed the new support of concurrency of multi-processes in the >>> archive of this mailing list. Could you point me to the code that >>> implements this? >> That's basically just a switch that tells the firmware that it is >> allowed to schedule queues from different processes at the same time. >> The upper limit is the number of VMIDs that HWS can work with. It needs >> to assign a unique VMID to each process (each VMID representing a >> separate address space, page table, etc.). If there are more processes >> than VMIDs, the HWS has to time-multiplex. > HWS dispatch packets in their order of becoming the head of the queue, > i.e., being pointed by the read_index? So in this way it's FIFO. Or > round-robin between queues? You mentioned round-robin over batches in > the over-subscribed scenario. Commands within a queue are handled in FIFO order. Commands on different queues are not ordered with respect to each other. When I talk about round robin of batches of queues, it means that the pipes of the CP are executing up to 24 user mode queues. After the time slice is up, the HWS preempts those queues, and loads another batch of queues to the CP pipes. This goes on until all the queues in the runlist have had some time on the GPU. Then the whole process starts over. > > This might not be a big deal for performance, but it matters for > predictability and real-time analysis. > >>> - Also another related question -- where/how is the preemption/context >>> switch between packets/queues implemented? >> As long as you don't oversubscribe the available VMIDs, there is no real >> context switching. Everything can run concurrently. When you start >> oversubscribing HW queues or VMIDs, the HWS firmware will start >> multiplexing. This is all handled inside the firmware and is quite >> transparent even to KFD. > I see. So the preemption in at least AMD's implementation is not > switching out the executing kernel, but just letting new kernels to > run concurrently with the existing ones. This means the performance > is degraded when too many workloads are submitted. The running > kernels leave the GPU only when they are done. No, that's not what I meant. As long as nothing is oversubscribed, you don't have preemptions. But as soon as hardware queues or VMIDs are oversubscribed, the HWS will need to preempt queues in order to let other queues have some time on the hardware. Preempting a queue includes preempting all the wavefronts that were dispatched by that queue. The state of all the CUs is saved and later restored. We call this CWSR (compute wave save/restore). > > Is there any reason for not preempting/switching out the existing > kernel, besides context switch overheads? NVIDIA is not providing > this option either. Non-preemption hurts the real-time property in > terms of
Re: Documentation about AMD's HSA implementation?
That's very helpful, thanks! On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehlingwrote: > On 2018-02-13 04:06 PM, Ming Yang wrote: >> Thanks for the suggestions! But I might ask several specific >> questions, as I can't find the answer in those documents, to give >> myself a quick start if that's okay. Pointing me to the >> files/functions would be good enough. Any explanations are >> appreciated. My purpose is to hack it with different scheduling >> policy with real-time and predictability consideration. >> >> - Where/How is the packet scheduler implemented? How are packets from >> multiple queues scheduled? What about scheduling packets from queues >> in different address spaces? > > This is done mostly in firmware. The CP engine supports up to 32 queues. > We share those between KFD and AMDGPU. KFD gets 24 queues to use. > Usually that is 6 queues times 4 pipes. Pipes are threads in the CP > micro engine. Within each pipe the queues are time-multiplexed. Please correct me if I'm wrong. CP is computing processor, like the Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing queues, in order to hide memory latency. > > If we need more than 24 queues, or if we have more than 8 processes, the > hardware scheduler (HWS) adds another layer scheduling, basically > round-robin between batches of 24 queues or 8 processes. Once you get > into such an over-subscribed scenario your performance and GPU > utilization can suffers quite badly. HWS is also implemented in the firmware that's closed-source? > >> >> - I noticed the new support of concurrency of multi-processes in the >> archive of this mailing list. Could you point me to the code that >> implements this? > > That's basically just a switch that tells the firmware that it is > allowed to schedule queues from different processes at the same time. > The upper limit is the number of VMIDs that HWS can work with. It needs > to assign a unique VMID to each process (each VMID representing a > separate address space, page table, etc.). If there are more processes > than VMIDs, the HWS has to time-multiplex. HWS dispatch packets in their order of becoming the head of the queue, i.e., being pointed by the read_index? So in this way it's FIFO. Or round-robin between queues? You mentioned round-robin over batches in the over-subscribed scenario. This might not be a big deal for performance, but it matters for predictability and real-time analysis. > >> >> - Also another related question -- where/how is the preemption/context >> switch between packets/queues implemented? > > As long as you don't oversubscribe the available VMIDs, there is no real > context switching. Everything can run concurrently. When you start > oversubscribing HW queues or VMIDs, the HWS firmware will start > multiplexing. This is all handled inside the firmware and is quite > transparent even to KFD. I see. So the preemption in at least AMD's implementation is not switching out the executing kernel, but just letting new kernels to run concurrently with the existing ones. This means the performance is degraded when too many workloads are submitted. The running kernels leave the GPU only when they are done. Is there any reason for not preempting/switching out the existing kernel, besides context switch overheads? NVIDIA is not providing this option either. Non-preemption hurts the real-time property in terms of priority inversion. I understand preemption should not be massively used but having such an option may help a lot for real-time systems. > > KFD interacts with the HWS firmware through the HIQ (HSA interface > queue). It supports packets for unmapping queues, we can send it a new > runlist (basically a bunch of map-process and map-queue packets). The > interesting files to look at are kfd_packet_manager.c, > kfd_kernel_queue_.c and kfd_device_queue_manager.c. > So in this way, if we want to implement different scheduling policy, we should control the submission of packets to the queues in runtime/KFD, before getting to the firmware. Because it's out of access once it's submitted to the HWS in the firmware. Best, Mark > Regards, > Felix > >> >> Thanks in advance! >> >> Best, >> Mark >> >>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling wrote: >>> There is also this: https://gpuopen.com/professional-compute/, which >>> give pointer to several libraries and tools that built on top of ROCm. >>> >>> Another thing to keep in mind is, that ROCm is diverging from the strict >>> HSA standard in some important ways. For example the HSA standard >>> includes HSAIL as an intermediate representation that gets finalized on >>> the target system, whereas ROCm compiles directly to native GPU ISA. >>> >>> Regards, >>> Felix >>> >>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander >>> wrote: The ROCm documentation is probably a good place to start:
Re: Documentation about AMD's HSA implementation?
On 2018-02-13 04:06 PM, Ming Yang wrote: > Thanks for the suggestions! But I might ask several specific > questions, as I can't find the answer in those documents, to give > myself a quick start if that's okay. Pointing me to the > files/functions would be good enough. Any explanations are > appreciated. My purpose is to hack it with different scheduling > policy with real-time and predictability consideration. > > - Where/How is the packet scheduler implemented? How are packets from > multiple queues scheduled? What about scheduling packets from queues > in different address spaces? This is done mostly in firmware. The CP engine supports up to 32 queues. We share those between KFD and AMDGPU. KFD gets 24 queues to use. Usually that is 6 queues times 4 pipes. Pipes are threads in the CP micro engine. Within each pipe the queues are time-multiplexed. If we need more than 24 queues, or if we have more than 8 processes, the hardware scheduler (HWS) adds another layer scheduling, basically round-robin between batches of 24 queues or 8 processes. Once you get into such an over-subscribed scenario your performance and GPU utilization can suffers quite badly. > > - I noticed the new support of concurrency of multi-processes in the > archive of this mailing list. Could you point me to the code that > implements this? That's basically just a switch that tells the firmware that it is allowed to schedule queues from different processes at the same time. The upper limit is the number of VMIDs that HWS can work with. It needs to assign a unique VMID to each process (each VMID representing a separate address space, page table, etc.). If there are more processes than VMIDs, the HWS has to time-multiplex. > > - Also another related question -- where/how is the preemption/context > switch between packets/queues implemented? As long as you don't oversubscribe the available VMIDs, there is no real context switching. Everything can run concurrently. When you start oversubscribing HW queues or VMIDs, the HWS firmware will start multiplexing. This is all handled inside the firmware and is quite transparent even to KFD. KFD interacts with the HWS firmware through the HIQ (HSA interface queue). It supports packets for unmapping queues, we can send it a new runlist (basically a bunch of map-process and map-queue packets). The interesting files to look at are kfd_packet_manager.c, kfd_kernel_queue_.c and kfd_device_queue_manager.c. Regards, Felix > > Thanks in advance! > > Best, > Mark > >> On 13 Feb 2018, at 2:56 PM, Felix Kuehlingwrote: >> There is also this: https://gpuopen.com/professional-compute/, which >> give pointer to several libraries and tools that built on top of ROCm. >> >> Another thing to keep in mind is, that ROCm is diverging from the strict >> HSA standard in some important ways. For example the HSA standard >> includes HSAIL as an intermediate representation that gets finalized on >> the target system, whereas ROCm compiles directly to native GPU ISA. >> >> Regards, >> Felix >> >> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander >> wrote: >>> The ROCm documentation is probably a good place to start: >>> >>> https://rocm.github.io/documentation.html >>> >>> >>> Alex >>> >>> >>> From: amd-gfx on behalf of Ming Yang >>> >>> Sent: Tuesday, February 13, 2018 12:00 AM >>> To: amd-gfx@lists.freedesktop.org >>> Subject: Documentation about AMD's HSA implementation? >>> >>> Hi, >>> >>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm >>> supporting it. Before digging into the code, I wonder if there's any >>> documentation available about AMD's HSA implementation, either book, >>> whitepaper, paper, or documentation. >>> >>> I did find helpful materials about HSA, including HSA standards on this page >>> (http://www.hsafoundation.com/standards/) and a nice book about HSA >>> (Heterogeneous System Architecture A New Compute Platform Infrastructure). >>> But regarding the documentation about AMD's implementation, I haven't found >>> anything yet. >>> >>> Please let me know if there are ones publicly accessible. If no, any >>> suggestions on learning the implementation of specific system components, >>> e.g., queue scheduling. >>> >>> Best, >>> Mark ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
Thanks for the suggestions! But I might ask several specific questions, as I can't find the answer in those documents, to give myself a quick start if that's okay. Pointing me to the files/functions would be good enough. Any explanations are appreciated. My purpose is to hack it with different scheduling policy with real-time and predictability consideration. - Where/How is the packet scheduler implemented? How are packets from multiple queues scheduled? What about scheduling packets from queues in different address spaces? - I noticed the new support of concurrency of multi-processes in the archive of this mailing list. Could you point me to the code that implements this? - Also another related question -- where/how is the preemption/context switch between packets/queues implemented? Thanks in advance! Best, Mark > On 13 Feb 2018, at 2:56 PM, Felix Kuehlingwrote: > There is also this: https://gpuopen.com/professional-compute/, which > give pointer to several libraries and tools that built on top of ROCm. > > Another thing to keep in mind is, that ROCm is diverging from the strict > HSA standard in some important ways. For example the HSA standard > includes HSAIL as an intermediate representation that gets finalized on > the target system, whereas ROCm compiles directly to native GPU ISA. > > Regards, > Felix > > On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander > wrote: > > The ROCm documentation is probably a good place to start: > > > > https://rocm.github.io/documentation.html > > > > > > Alex > > > > > > From: amd-gfx on behalf of Ming Yang > > > > Sent: Tuesday, February 13, 2018 12:00 AM > > To: amd-gfx@lists.freedesktop.org > > Subject: Documentation about AMD's HSA implementation? > > > > Hi, > > > > I'm interested in HSA and excited when I found AMD's fully open-stack ROCm > > supporting it. Before digging into the code, I wonder if there's any > > documentation available about AMD's HSA implementation, either book, > > whitepaper, paper, or documentation. > > > > I did find helpful materials about HSA, including HSA standards on this page > > (http://www.hsafoundation.com/standards/) and a nice book about HSA > > (Heterogeneous System Architecture A New Compute Platform Infrastructure). > > But regarding the documentation about AMD's implementation, I haven't found > > anything yet. > > > > Please let me know if there are ones publicly accessible. If no, any > > suggestions on learning the implementation of specific system components, > > e.g., queue scheduling. > > > > Best, > > Mark ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
+ Dave Roberts Do you still have links to the HSA doc collected during NMI? From: amd-gfxon behalf of Felix Kuehling Sent: Tuesday, February 13, 2018 2:56:47 PM To: amd-gfx@lists.freedesktop.org Subject: Re: Documentation about AMD's HSA implementation? There is also this: https://gpuopen.com/professional-compute/, which give pointer to several libraries and tools that built on top of ROCm. Another thing to keep in mind is, that ROCm is diverging from the strict HSA standard in some important ways. For example the HSA standard includes HSAIL as an intermediate representation that gets finalized on the target system, whereas ROCm compiles directly to native GPU ISA. Regards, Felix On 2018-02-13 09:40 AM, Deucher, Alexander wrote: > > The ROCm documentation is probably a good place to start: > > https://rocm.github.io/documentation.html > > > Alex > > > *From:* amd-gfx on behalf of > Ming Yang > *Sent:* Tuesday, February 13, 2018 12:00 AM > *To:* amd-gfx@lists.freedesktop.org > *Subject:* Documentation about AMD's HSA implementation? > > Hi, > > I'm interested in HSA and excited when I found AMD's fully open-stack > ROCm supporting it. Before digging into the code, I wonder if there's > any documentation available about AMD's HSA implementation, either > book, whitepaper, paper, or documentation. > > I did find helpful materials about HSA, including HSA standards on > this page (http://www.hsafoundation.com/standards/) and a nice book > about HSA (Heterogeneous System Architecture A New Compute Platform > Infrastructure). But regarding the documentation about AMD's > implementation, I haven't found anything yet. > > Please let me know if there are ones publicly accessible. If no, any > suggestions on learning the implementation of specific system > components, e.g., queue scheduling. > > Best, > Mark > > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
There is also this: https://gpuopen.com/professional-compute/, which give pointer to several libraries and tools that built on top of ROCm. Another thing to keep in mind is, that ROCm is diverging from the strict HSA standard in some important ways. For example the HSA standard includes HSAIL as an intermediate representation that gets finalized on the target system, whereas ROCm compiles directly to native GPU ISA. Regards, Felix On 2018-02-13 09:40 AM, Deucher, Alexander wrote: > > The ROCm documentation is probably a good place to start: > > https://rocm.github.io/documentation.html > > > Alex > > > *From:* amd-gfxon behalf of > Ming Yang > *Sent:* Tuesday, February 13, 2018 12:00 AM > *To:* amd-gfx@lists.freedesktop.org > *Subject:* Documentation about AMD's HSA implementation? > > Hi, > > I'm interested in HSA and excited when I found AMD's fully open-stack > ROCm supporting it. Before digging into the code, I wonder if there's > any documentation available about AMD's HSA implementation, either > book, whitepaper, paper, or documentation. > > I did find helpful materials about HSA, including HSA standards on > this page (http://www.hsafoundation.com/standards/) and a nice book > about HSA (Heterogeneous System Architecture A New Compute Platform > Infrastructure). But regarding the documentation about AMD's > implementation, I haven't found anything yet. > > Please let me know if there are ones publicly accessible. If no, any > suggestions on learning the implementation of specific system > components, e.g., queue scheduling. > > Best, > Mark > > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 4/5] drm/amdgpu/cgs: add refresh rate checking to non-DC display code
Clamp the vblank period to 0 if the refresh rate is larger than 120 hz for non-DC. This allows us to remove the refresh rate checks from powerplay for mclk switching. Signed-off-by: Alex Deucher--- drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c index 71b4aec7f650..dc3360b16bda 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c @@ -953,6 +953,11 @@ static int amdgpu_cgs_get_active_displays_info(struct cgs_device *cgs_device, (amdgpu_crtc->v_border * 2); mode_info->vblank_time_us = vblank_lines * line_time_us; mode_info->refresh_rate = drm_mode_vrefresh(_crtc->hw_mode); + /* we have issues with mclk switching with refresh rates +* over 120 hz on the non-DC code. +*/ + if (mode_info->refresh_rate > 120) + mode_info->vblank_time_us = 0; mode_info = NULL; } } -- 2.13.6 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 2/5] drm/amd/powerplay/vega10: allow mclk switching with no displays
If there are no displays attached, there is no reason to disable mclk switching. Signed-off-by: Alex Deucher--- drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c index 6a153ad1b942..1d442a498bf6 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c @@ -3171,10 +3171,13 @@ static int vega10_apply_state_adjust_rules(struct pp_hwmgr *hwmgr, PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR); force_mclk_high = PP_CAP(PHM_PlatformCaps_ForceMclkHigh); - disable_mclk_switching = (info.display_count > 1) || - disable_mclk_switching_for_frame_lock || - disable_mclk_switching_for_vr || - force_mclk_high; + if (info.display_count == 0) + disable_mclk_switching = false; + else + disable_mclk_switching = (info.display_count > 1) || + disable_mclk_switching_for_frame_lock || + disable_mclk_switching_for_vr || + force_mclk_high; sclk = vega10_ps->performance_levels[0].gfx_clock; mclk = vega10_ps->performance_levels[0].mem_clock; -- 2.13.6 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 5/5] drm/amdgpu/powerplay/smu7: drop refresh rate checks for mclk switching
The logic has moved to cgs. mclk switching with DC at higher refresh rates should work. Signed-off-by: Alex DeucherCc: Harry Wentland --- drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c index 84600ff6f4de..0202841ae639 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c @@ -2909,8 +2909,7 @@ static int smu7_apply_state_adjust_rules(struct pp_hwmgr *hwmgr, else disable_mclk_switching = ((1 < info.display_count) || disable_mclk_switching_for_frame_lock || - smu7_vblank_too_short(hwmgr, mode_info.vblank_time_us) || - (mode_info.refresh_rate > 120)); + smu7_vblank_too_short(hwmgr, mode_info.vblank_time_us)); sclk = smu7_ps->performance_levels[0].engine_clock; mclk = smu7_ps->performance_levels[0].memory_clock; -- 2.13.6 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 1/5] drm/amd/powerplay: use PP_CAP macro for disable_mclk_switching_for_frame_lock
Rather than open coding it. Signed-off-by: Alex Deucher--- drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c index 03a93b0eff38..6a153ad1b942 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c @@ -3165,10 +3165,10 @@ static int vega10_apply_state_adjust_rules(struct pp_hwmgr *hwmgr, minimum_clocks.memoryClock = stable_pstate_mclk; } - disable_mclk_switching_for_frame_lock = phm_cap_enabled( - hwmgr->platform_descriptor.platformCaps, - PHM_PlatformCaps_DisableMclkSwitchingForFrameLock); - disable_mclk_switching_for_vr = PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR); + disable_mclk_switching_for_frame_lock = + PP_CAP(PHM_PlatformCaps_DisableMclkSwitchingForFrameLock); + disable_mclk_switching_for_vr = + PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR); force_mclk_high = PP_CAP(PHM_PlatformCaps_ForceMclkHigh); disable_mclk_switching = (info.display_count > 1) || -- 2.13.6 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 3/5] drm/amd/powerplay/smu7: allow mclk switching with no displays
If there are no displays attached, there is no reason to disable mclk switching. Signed-off-by: Alex Deucher--- drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c index 937971361b65..84600ff6f4de 100644 --- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c @@ -2904,10 +2904,13 @@ static int smu7_apply_state_adjust_rules(struct pp_hwmgr *hwmgr, PHM_PlatformCaps_DisableMclkSwitchingForFrameLock); - disable_mclk_switching = ((1 < info.display_count) || - disable_mclk_switching_for_frame_lock || - smu7_vblank_too_short(hwmgr, mode_info.vblank_time_us) || - (mode_info.refresh_rate > 120)); + if (info.display_count == 0) + disable_mclk_switching = false; + else + disable_mclk_switching = ((1 < info.display_count) || + disable_mclk_switching_for_frame_lock || + smu7_vblank_too_short(hwmgr, mode_info.vblank_time_us) || + (mode_info.refresh_rate > 120)); sclk = smu7_ps->performance_levels[0].engine_clock; mclk = smu7_ps->performance_levels[0].memory_clock; -- 2.13.6 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 01:45 PM, Christian König wrote: > Am 13.02.2018 um 17:56 schrieb Felix Kuehling: >> [SNIP] >> Each process gets a whole page of the doorbell aperture assigned to it. >> The assumption is that amdgpu only uses the first page of the doorbell >> aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is >> used as the offset into the doorbell page. On GFX9 the hardware does >> some engine-specific doorbell routing, so we added another layer of >> doorbell management that's decoupled from the queue ID. >> >> Either way, an entire doorbell page gets mapped into user mode and user >> mode knows the offset of the doorbells for specific queues. The mapping >> is currently handled by kfd_mmap in kfd_chardev.c. > > Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost > looks like you map different doorbells with the same offset depending > on which process is calling this. > > Is that correct? If yes then that would be illegal and a problem if > I'm not completely mistaken. Why is that a problem. Each process has its own file descriptor. The mapping is done using io_remap_pfn_range in kfd_doorbell_mmap. This is nothing new. It's been done like this forever even on Kaveri and Carrizo. > >>> Do you simply assume that after evicting a process it always needs to >>> be restarted without checking if it actually does something? Or how >>> does that work? >> Exactly. > > Ok, understood. Well that limits the usefulness of the whole eviction > drastically. > >> With later addition of GPU self-dispatch a page-fault based >> mechanism wouldn't work any more. We have to restart the queues blindly >> with a timer. See evict_process_worker, which schedules the restore with >> a delayed worker. >> which was send either by the GPU o >> The user mode queue ABI specifies that user mode update both the >> doorbell and a WPTR in memory. When we restart queues we (or the CP >> firmware) use the WPTR to make sure we catch up with any work that was >> submitted while the queues were unmapped. > > Putting cross process work dispatch aside for a moment GPU > self-dispatch works only when there is work on the GPU running. > > So you can still check if there are some work pending after you > unmapped everything and only restart the queues when there is new work > based on the page fault. > > In other words either there is work pending and it doesn't matter if > it was send by the GPU or by the CPU or there is no work pending and > we can delay restarting everything until there is. That sounds like a useful optimization. Regards, Felix > > Regards, > Christian. > >> >> Regards, >> Felix >> >>> Regards, >>> Christian. > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 01:15 PM, Christian König wrote: > Am 13.02.2018 um 18:18 schrieb Felix Kuehling: >> On 2018-02-13 12:06 PM, Christian König wrote: >>> [SNIP] >>> Ah, yeah that is also a point I wanted to to talk about with you. >>> >>> The approach of using the same buffer object with multiple amdgpu >>> devices doesn't work in general. >>> >>> We need separate TTM object for each BO in each device or otherwise we >>> break A+A laptops. >> I think it broke for VRAM BOs because we enabled P2P on systems that >> didn't support it properly. But at least system memory BOs can be shared >> quite well between devices and we do it all the time. > > Sharing VRAM BOs is one issue, but the problem goes deeper than just > that. > > Starting with Carizzo we can scanout from system memory to avoid the > extra copy on A+A laptops. For this to work we need the BO mapped to > GART (and I mean a real VMID0 mapping, not just in the GTT domain). > And for this to work in turn we need a TTM object per device and not a > global one. I still don't understand. I think what you're talking about applies only to BOs used for scan-out. Every BO is allocated from a specific device and can only be GART-mapped on that device. What we do is map the same BO in VMs on other devices. It has no effect on GART mappings. > >> I don't see how you can have separate TTM objects referring to the >> same memory. > > Well that is trivial, we do this all the time with prime and I+A laptops. As I understand it, you use DMABuf to export/import buffers on multiple devices. I believe all devices share a single amdgpu_bo, which contains the ttm_buffer_object. The only way you can have a new TTM buffer object per device is by using SG tables and pinning the BOs. But I think we want to avoid pinning BOs. What we do does not involve pinning of BOs, even when they're shared between multiple devices' VMs. > >>> That is also the reason we had to disable this feature again in the >>> hybrid branches. >> What you disabled on hybrid branches was P2P, which only affects >> large-BAR systems. Sharing of system memory BOs between devices still >> works fine. > > No, it doesn't. It completely breaks any scanout on Carizzo, Stoney > and Raven. Additional to that we found that it breaks some aspects of > the user space interface. Let me check that with my current patch series. The patches I submitted here shouldn't include anything that breaks the use cases you describe. But I'm quite sure it will support sharing BOs between multiple devices' VMs. Regards, Felix > > So end result is that we probably need to revert it and find a > different solution. I'm already working on this for a couple of weeks > now and should have something ready after I'm done with the PASID > handling. > > Regards, > Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH xf86-video-amdgpu 2/2] Don't call AMDGPUFreeRec from AMDGPUPreInit_KMS
On Tue, Feb 13, 2018 at 1:12 PM, Michel Dänzerwrote: > From: Michel Dänzer > > If the latter fails, Xorg will call AMDGPUFreeScreen_KMS, which calls > the former. > > Signed-off-by: Michel Dänzer Series is: Reviewed-by: Alex Deucher > --- > src/amdgpu_kms.c | 22 +- > 1 file changed, 9 insertions(+), 13 deletions(-) > > diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c > index 43c18d426..f5874d3af 100644 > --- a/src/amdgpu_kms.c > +++ b/src/amdgpu_kms.c > @@ -1352,7 +1352,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > && info->pEnt->location.type != BUS_PLATFORM > #endif > ) > - goto fail; > + return FALSE; > > pPriv = xf86GetEntityPrivate(pScrn->entityList[0], > getAMDGPUEntityIndex()); > @@ -1375,23 +1375,23 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > pScrn->monitor = pScrn->confScreen->monitor; > > if (!AMDGPUPreInitVisual(pScrn)) > - goto fail; > + return FALSE; > > xf86CollectOptions(pScrn, NULL); > if (!(info->Options = malloc(sizeof(AMDGPUOptions_KMS > - goto fail; > + return FALSE; > > memcpy(info->Options, AMDGPUOptions_KMS, sizeof(AMDGPUOptions_KMS)); > xf86ProcessOptions(pScrn->scrnIndex, pScrn->options, info->Options); > > if (!AMDGPUPreInitWeight(pScrn)) > - goto fail; > + return FALSE; > > memset(_info, 0, sizeof(gpu_info)); > amdgpu_query_gpu_info(pAMDGPUEnt->pDev, _info); > > if (!AMDGPUPreInitChipType_KMS(pScrn, _info)) > - goto fail; > + return FALSE; > > info->dri2.available = FALSE; > info->dri2.enabled = FALSE; > @@ -1399,7 +1399,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > if (info->dri2.pKernelDRMVersion == NULL) { > xf86DrvMsg(pScrn->scrnIndex, X_ERROR, >"AMDGPUDRIGetVersion failed to get the DRM > version\n"); > - goto fail; > + return FALSE; > } > > /* Get ScreenInit function */ > @@ -1407,7 +1407,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > return FALSE; > > if (!AMDGPUPreInitAccel_KMS(pScrn)) > - goto fail; > + return FALSE; > > amdgpu_drm_queue_init(); > > @@ -1466,7 +1466,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > FALSE) { > xf86DrvMsg(pScrn->scrnIndex, X_ERROR, >"Kernel modesetting setup failed\n"); > - goto fail; > + return FALSE; > } > > AMDGPUSetupCapabilities(pScrn); > @@ -1518,14 +1518,10 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) > #endif > ) { > xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "No modes.\n"); > - goto fail; > + return FALSE; > } > > return TRUE; > -fail: > - AMDGPUFreeRec(pScrn); > - return FALSE; > - > } > > static Bool AMDGPUCursorInit_KMS(ScreenPtr pScreen) > -- > 2.16.1 > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH xf86-video-amdgpu 4/4] Always use screen depth/bpp for KMS framebuffers
On Tue, Feb 13, 2018 at 12:53 PM, Michel Dänzerwrote: > From: Michel Dänzer > > DRI clients can use depth 32 pixmaps while the screen is depth 24, in > which case page flipping would fail. > > Reported-by: Mario Kleiner > (Ported from radeon commit 733f606dd6ca8350e6e7f0858bfff5454ddc98ed) > > Signed-off-by: Michel Dänzer Series is: Reviewed-by: Alex Deucher > --- > src/amdgpu_pixmap.h | 13 ++--- > src/drmmode_display.c | 5 ++--- > 2 files changed, 8 insertions(+), 10 deletions(-) > > diff --git a/src/amdgpu_pixmap.h b/src/amdgpu_pixmap.h > index eded17037..d744ca573 100644 > --- a/src/amdgpu_pixmap.h > +++ b/src/amdgpu_pixmap.h > @@ -104,8 +104,8 @@ static inline struct amdgpu_buffer > *amdgpu_get_pixmap_bo(PixmapPtr pPix) > } > > static inline struct drmmode_fb* > -amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t height, uint8_t depth, > -uint8_t bpp, uint32_t pitch, uint32_t handle) > +amdgpu_fb_create(ScrnInfoPtr scrn, int drm_fd, uint32_t width, uint32_t > height, > +uint32_t pitch, uint32_t handle) > { > struct drmmode_fb *fb = malloc(sizeof(*fb)); > > @@ -113,8 +113,8 @@ amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t > height, uint8_t depth, > return NULL; > > fb->refcnt = 1; > - if (drmModeAddFB(drm_fd, width, height, depth, bpp, pitch, handle, > ->handle) == 0) > + if (drmModeAddFB(drm_fd, width, height, scrn->depth, > scrn->bitsPerPixel, > +pitch, handle, >handle) == 0) > return fb; > > free(fb); > @@ -154,9 +154,8 @@ amdgpu_pixmap_get_fb(PixmapPtr pix) > ScrnInfoPtr scrn = > xf86ScreenToScrn(pix->drawable.pScreen); > AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(scrn); > > - *fb_ptr = amdgpu_fb_create(pAMDGPUEnt->fd, > pix->drawable.width, > - pix->drawable.height, > pix->drawable.depth, > - > pix->drawable.bitsPerPixel, pix->devKind, > + *fb_ptr = amdgpu_fb_create(scrn, pAMDGPUEnt->fd, > pix->drawable.width, > + pix->drawable.height, > pix->devKind, >handle); > } > } > diff --git a/src/drmmode_display.c b/src/drmmode_display.c > index e52b7a355..8ccbf735d 100644 > --- a/src/drmmode_display.c > +++ b/src/drmmode_display.c > @@ -885,9 +885,8 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr > mode, > union gbm_bo_handle bo_handle; > > bo_handle = > gbm_bo_get_handle(info->front_buffer->bo.gbm); > - fb = amdgpu_fb_create(pAMDGPUEnt->fd, pScrn->virtualX, > - pScrn->virtualY, pScrn->depth, > - pScrn->bitsPerPixel, > + fb = amdgpu_fb_create(pScrn, pAMDGPUEnt->fd, > + pScrn->virtualX, > pScrn->virtualY, > pScrn->displayWidth * > info->pixel_bytes, > bo_handle.u32); > /* Prevent refcnt of ad-hoc FBs from reaching 2 */ > -- > 2.16.1 > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 13.02.2018 um 17:56 schrieb Felix Kuehling: [SNIP] Each process gets a whole page of the doorbell aperture assigned to it. The assumption is that amdgpu only uses the first page of the doorbell aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is used as the offset into the doorbell page. On GFX9 the hardware does some engine-specific doorbell routing, so we added another layer of doorbell management that's decoupled from the queue ID. Either way, an entire doorbell page gets mapped into user mode and user mode knows the offset of the doorbells for specific queues. The mapping is currently handled by kfd_mmap in kfd_chardev.c. Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost looks like you map different doorbells with the same offset depending on which process is calling this. Is that correct? If yes then that would be illegal and a problem if I'm not completely mistaken. Do you simply assume that after evicting a process it always needs to be restarted without checking if it actually does something? Or how does that work? Exactly. Ok, understood. Well that limits the usefulness of the whole eviction drastically. With later addition of GPU self-dispatch a page-fault based mechanism wouldn't work any more. We have to restart the queues blindly with a timer. See evict_process_worker, which schedules the restore with a delayed worker. which was send either by the GPU o The user mode queue ABI specifies that user mode update both the doorbell and a WPTR in memory. When we restart queues we (or the CP firmware) use the WPTR to make sure we catch up with any work that was submitted while the queues were unmapped. Putting cross process work dispatch aside for a moment GPU self-dispatch works only when there is work on the GPU running. So you can still check if there are some work pending after you unmapped everything and only restart the queues when there is new work based on the page fault. In other words either there is work pending and it doesn't matter if it was send by the GPU or by the CPU or there is no work pending and we can delay restarting everything until there is. Regards, Christian. Regards, Felix Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 13.02.2018 um 18:18 schrieb Felix Kuehling: On 2018-02-13 12:06 PM, Christian König wrote: [SNIP] Ah, yeah that is also a point I wanted to to talk about with you. The approach of using the same buffer object with multiple amdgpu devices doesn't work in general. We need separate TTM object for each BO in each device or otherwise we break A+A laptops. I think it broke for VRAM BOs because we enabled P2P on systems that didn't support it properly. But at least system memory BOs can be shared quite well between devices and we do it all the time. Sharing VRAM BOs is one issue, but the problem goes deeper than just that. Starting with Carizzo we can scanout from system memory to avoid the extra copy on A+A laptops. For this to work we need the BO mapped to GART (and I mean a real VMID0 mapping, not just in the GTT domain). And for this to work in turn we need a TTM object per device and not a global one. I don't see how you can have separate TTM objects referring to the same memory. Well that is trivial, we do this all the time with prime and I+A laptops. That is also the reason we had to disable this feature again in the hybrid branches. What you disabled on hybrid branches was P2P, which only affects large-BAR systems. Sharing of system memory BOs between devices still works fine. No, it doesn't. It completely breaks any scanout on Carizzo, Stoney and Raven. Additional to that we found that it breaks some aspects of the user space interface. So end result is that we probably need to revert it and find a different solution. I'm already working on this for a couple of weeks now and should have something ready after I'm done with the PASID handling. Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 2/2] Don't call AMDGPUFreeRec from AMDGPUPreInit_KMS
From: Michel DänzerIf the latter fails, Xorg will call AMDGPUFreeScreen_KMS, which calls the former. Signed-off-by: Michel Dänzer --- src/amdgpu_kms.c | 22 +- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c index 43c18d426..f5874d3af 100644 --- a/src/amdgpu_kms.c +++ b/src/amdgpu_kms.c @@ -1352,7 +1352,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) && info->pEnt->location.type != BUS_PLATFORM #endif ) - goto fail; + return FALSE; pPriv = xf86GetEntityPrivate(pScrn->entityList[0], getAMDGPUEntityIndex()); @@ -1375,23 +1375,23 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) pScrn->monitor = pScrn->confScreen->monitor; if (!AMDGPUPreInitVisual(pScrn)) - goto fail; + return FALSE; xf86CollectOptions(pScrn, NULL); if (!(info->Options = malloc(sizeof(AMDGPUOptions_KMS - goto fail; + return FALSE; memcpy(info->Options, AMDGPUOptions_KMS, sizeof(AMDGPUOptions_KMS)); xf86ProcessOptions(pScrn->scrnIndex, pScrn->options, info->Options); if (!AMDGPUPreInitWeight(pScrn)) - goto fail; + return FALSE; memset(_info, 0, sizeof(gpu_info)); amdgpu_query_gpu_info(pAMDGPUEnt->pDev, _info); if (!AMDGPUPreInitChipType_KMS(pScrn, _info)) - goto fail; + return FALSE; info->dri2.available = FALSE; info->dri2.enabled = FALSE; @@ -1399,7 +1399,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) if (info->dri2.pKernelDRMVersion == NULL) { xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "AMDGPUDRIGetVersion failed to get the DRM version\n"); - goto fail; + return FALSE; } /* Get ScreenInit function */ @@ -1407,7 +1407,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) return FALSE; if (!AMDGPUPreInitAccel_KMS(pScrn)) - goto fail; + return FALSE; amdgpu_drm_queue_init(); @@ -1466,7 +1466,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) FALSE) { xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "Kernel modesetting setup failed\n"); - goto fail; + return FALSE; } AMDGPUSetupCapabilities(pScrn); @@ -1518,14 +1518,10 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags) #endif ) { xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "No modes.\n"); - goto fail; + return FALSE; } return TRUE; -fail: - AMDGPUFreeRec(pScrn); - return FALSE; - } static Bool AMDGPUCursorInit_KMS(ScreenPtr pScreen) -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 1/2] Guard against pAMDGPUEnt == NULL in AMDGPUFreeRec
From: Michel DänzerThis can happen if PreInit fails early. Signed-off-by: Michel Dänzer --- src/amdgpu_kms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c index 7dc9e22a9..43c18d426 100644 --- a/src/amdgpu_kms.c +++ b/src/amdgpu_kms.c @@ -137,7 +137,7 @@ static void AMDGPUFreeRec(ScrnInfoPtr pScrn) pPriv = xf86GetEntityPrivate(pEnt->index, gAMDGPUEntityIndex); pAMDGPUEnt = pPriv->ptr; - if (pAMDGPUEnt->fd > 0) { + if (pAMDGPUEnt && pAMDGPUEnt->fd > 0) { DevUnion *pPriv; AMDGPUEntPtr pAMDGPUEnt; pPriv = xf86GetEntityPrivate(pScrn->entityList[0], -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 4/4] Always use screen depth/bpp for KMS framebuffers
From: Michel DänzerDRI clients can use depth 32 pixmaps while the screen is depth 24, in which case page flipping would fail. Reported-by: Mario Kleiner (Ported from radeon commit 733f606dd6ca8350e6e7f0858bfff5454ddc98ed) Signed-off-by: Michel Dänzer --- src/amdgpu_pixmap.h | 13 ++--- src/drmmode_display.c | 5 ++--- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/src/amdgpu_pixmap.h b/src/amdgpu_pixmap.h index eded17037..d744ca573 100644 --- a/src/amdgpu_pixmap.h +++ b/src/amdgpu_pixmap.h @@ -104,8 +104,8 @@ static inline struct amdgpu_buffer *amdgpu_get_pixmap_bo(PixmapPtr pPix) } static inline struct drmmode_fb* -amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t height, uint8_t depth, -uint8_t bpp, uint32_t pitch, uint32_t handle) +amdgpu_fb_create(ScrnInfoPtr scrn, int drm_fd, uint32_t width, uint32_t height, +uint32_t pitch, uint32_t handle) { struct drmmode_fb *fb = malloc(sizeof(*fb)); @@ -113,8 +113,8 @@ amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t height, uint8_t depth, return NULL; fb->refcnt = 1; - if (drmModeAddFB(drm_fd, width, height, depth, bpp, pitch, handle, ->handle) == 0) + if (drmModeAddFB(drm_fd, width, height, scrn->depth, scrn->bitsPerPixel, +pitch, handle, >handle) == 0) return fb; free(fb); @@ -154,9 +154,8 @@ amdgpu_pixmap_get_fb(PixmapPtr pix) ScrnInfoPtr scrn = xf86ScreenToScrn(pix->drawable.pScreen); AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(scrn); - *fb_ptr = amdgpu_fb_create(pAMDGPUEnt->fd, pix->drawable.width, - pix->drawable.height, pix->drawable.depth, - pix->drawable.bitsPerPixel, pix->devKind, + *fb_ptr = amdgpu_fb_create(scrn, pAMDGPUEnt->fd, pix->drawable.width, + pix->drawable.height, pix->devKind, handle); } } diff --git a/src/drmmode_display.c b/src/drmmode_display.c index e52b7a355..8ccbf735d 100644 --- a/src/drmmode_display.c +++ b/src/drmmode_display.c @@ -885,9 +885,8 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr mode, union gbm_bo_handle bo_handle; bo_handle = gbm_bo_get_handle(info->front_buffer->bo.gbm); - fb = amdgpu_fb_create(pAMDGPUEnt->fd, pScrn->virtualX, - pScrn->virtualY, pScrn->depth, - pScrn->bitsPerPixel, + fb = amdgpu_fb_create(pScrn, pAMDGPUEnt->fd, + pScrn->virtualX, pScrn->virtualY, pScrn->displayWidth * info->pixel_bytes, bo_handle.u32); /* Prevent refcnt of ad-hoc FBs from reaching 2 */ -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 1/4] Define per x-screen individual drmmode_crtc_funcs
From: Mario KleinerThis allows to en-/disable some functions depending on individual screen settings. Prep work for more efficient depth 30 support. Suggested-by: Michel Dänzer Signed-off-by: Mario Kleiner (Ported from radeon commit 21f6753462464acfd3c452393328c977a375ce26) Signed-off-by: Michel Dänzer --- src/amdgpu_drv.h | 1 + src/drmmode_display.c | 14 ++ 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/src/amdgpu_drv.h b/src/amdgpu_drv.h index f60b192da..40157262e 100644 --- a/src/amdgpu_drv.h +++ b/src/amdgpu_drv.h @@ -340,6 +340,7 @@ typedef struct { SetSharedPixmapBackingProcPtr SavedSetSharedPixmapBacking; } glamor; + xf86CrtcFuncsRec drmmode_crtc_funcs; } AMDGPUInfoRec, *AMDGPUInfoPtr; diff --git a/src/drmmode_display.c b/src/drmmode_display.c index 047179449..be3deef20 100644 --- a/src/drmmode_display.c +++ b/src/drmmode_display.c @@ -1344,8 +1344,9 @@ drmmode_crtc_init(ScrnInfoPtr pScrn, drmmode_ptr drmmode, drmModeResPtr mode_res xf86CrtcPtr crtc; drmmode_crtc_private_ptr drmmode_crtc; AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(pScrn); + AMDGPUInfoPtr info = AMDGPUPTR(pScrn); - crtc = xf86CrtcCreate(pScrn, _crtc_funcs); + crtc = xf86CrtcCreate(pScrn, >drmmode_crtc_funcs); if (crtc == NULL) return 0; @@ -2348,11 +2349,16 @@ Bool drmmode_pre_init(ScrnInfoPtr pScrn, drmmode_ptr drmmode, int cpp) xf86DrvMsgVerb(pScrn->scrnIndex, X_INFO, AMDGPU_LOGLEVEL_DEBUG, "%d crtcs needed for screen.\n", crtcs_needed); + /* Need per-screen drmmode_crtc_funcs, based on our global template, +* so we can disable some functions, depending on screen settings. +*/ + info->drmmode_crtc_funcs = drmmode_crtc_funcs; + if (!info->use_glamor) { /* Rotation requires hardware acceleration */ - drmmode_crtc_funcs.shadow_allocate = NULL; - drmmode_crtc_funcs.shadow_create = NULL; - drmmode_crtc_funcs.shadow_destroy = NULL; + info->drmmode_crtc_funcs.shadow_allocate = NULL; + info->drmmode_crtc_funcs.shadow_create = NULL; + info->drmmode_crtc_funcs.shadow_destroy = NULL; } for (i = 0; i < mode_res->count_crtcs; i++) -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 3/4] Add 30bit RGB color format support
From: Hawking ZhangSigned-off-by: Hawking Zhang [ Michel Dänzer: * Require Xorg >= 1.19.99.1 for depth 30, otherwise it can't work with glamor * Update manpage, per radeon commit 574bfab4bf1fcd95163a8f33cea2889189429d30 ] Signed-off-by: Michel Dänzer --- man/amdgpu.man | 2 +- src/amdgpu_bo_helper.c | 2 ++ src/amdgpu_dri2.c | 1 + src/amdgpu_kms.c | 8 4 files changed, 12 insertions(+), 1 deletion(-) diff --git a/man/amdgpu.man b/man/amdgpu.man index 0d9dd93d9..390f4837b 100644 --- a/man/amdgpu.man +++ b/man/amdgpu.man @@ -18,7 +18,7 @@ following features: .PD 0 .TP 2 \(bu -Support for 24-bit pixel depth; +Support for 24- and 30-bit pixel depths; .TP \(bu RandR support up to version 1.4; diff --git a/src/amdgpu_bo_helper.c b/src/amdgpu_bo_helper.c index ba2212228..34880ff12 100644 --- a/src/amdgpu_bo_helper.c +++ b/src/amdgpu_bo_helper.c @@ -42,6 +42,8 @@ amdgpu_get_gbm_format(int depth, int bitsPerPixel) return GBM_FORMAT_RGB565; case 32: return GBM_FORMAT_ARGB; + case 30: + return GBM_FORMAT_XRGB2101010; case 24: if (bitsPerPixel == 32) return GBM_FORMAT_XRGB; diff --git a/src/amdgpu_dri2.c b/src/amdgpu_dri2.c index 4ffa34676..4a0f8bf24 100644 --- a/src/amdgpu_dri2.c +++ b/src/amdgpu_dri2.c @@ -120,6 +120,7 @@ amdgpu_dri2_create_buffer2(ScreenPtr pScreen, cpp = 2; break; case 24: + case 30: cpp = 4; break; default: diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c index 49044f514..7dc9e22a9 100644 --- a/src/amdgpu_kms.c +++ b/src/amdgpu_kms.c @@ -1085,6 +1085,14 @@ static Bool AMDGPUPreInitVisual(ScrnInfoPtr pScrn) case 24: break; + case 30: + if (xorgGetVersion() < XORG_VERSION_NUMERIC(1,19,99,1,0)) { + xf86DrvMsg(pScrn->scrnIndex, X_ERROR, + "Depth 30 requires Xorg >= 1.19.99.1\n"); + return FALSE; + } + break; + default: xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "Given depth (%d) is not supported by %s driver\n", -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH xf86-video-amdgpu 2/4] Disable gamma set when deep color
From: Qiang Yugamma set is disabled in kernel driver when deep color. Enable it will confuse the user. Signed-off-by: Qiang Yu [ Michel Dänzer: Align drmmode_pre_init change with radeon commit 1f1d4b1fa7d4b22dd8553f7e71251bf17ca7a7b1 ] Signed-off-by: Michel Dänzer --- src/drmmode_display.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/src/drmmode_display.c b/src/drmmode_display.c index be3deef20..e52b7a355 100644 --- a/src/drmmode_display.c +++ b/src/drmmode_display.c @@ -858,8 +858,11 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr mode, if (drmmode_crtc->tear_free) scanout_id = drmmode_crtc->scanout_id; - drmmode_crtc_gamma_do_set(crtc, crtc->gamma_red, crtc->gamma_green, - crtc->gamma_blue, crtc->gamma_size); + /* gamma is disabled in kernel driver for deep color */ + if (pScrn->depth != 30) + drmmode_crtc_gamma_do_set( + crtc, crtc->gamma_red, crtc->gamma_green, + crtc->gamma_blue, crtc->gamma_size); if (drmmode_crtc->prime_scanout_pixmap) { drmmode_crtc_prime_scanout_update(crtc, mode, scanout_id, @@ -2361,6 +2364,12 @@ Bool drmmode_pre_init(ScrnInfoPtr pScrn, drmmode_ptr drmmode, int cpp) info->drmmode_crtc_funcs.shadow_destroy = NULL; } + /* Hw gamma lut's are currently bypassed by the hw at color depth 30, +* so spare the server the effort to compute and update the cluts. +*/ + if (pScrn->depth == 30) + info->drmmode_crtc_funcs.gamma_set = NULL; + for (i = 0; i < mode_res->count_crtcs; i++) if (!xf86IsEntityShared(pScrn->entityList[0]) || (crtcs_needed && !(pAMDGPUEnt->assigned_crtcs & (1 << i @@ -2591,8 +2600,9 @@ Bool drmmode_setup_colormap(ScreenPtr pScreen, ScrnInfoPtr pScrn) "Initializing kms color map\n"); if (!miCreateDefColormap(pScreen)) return FALSE; - /* all amdgpus support 10 bit CLUTs */ - if (!xf86HandleColormaps(pScreen, 256, 10, + /* All radeons support 10 bit CLUTs. They get bypassed at depth 30. */ + if (pScrn->depth != 30 && + !xf86HandleColormaps(pScreen, 256, 10, NULL, NULL, CMAP_PALETTED_TRUECOLOR #if 0 /* This option messes up text mode! (e...@suse.de) */ -- 2.16.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 12:06 PM, Christian König wrote: > Am 13.02.2018 um 17:42 schrieb Felix Kuehling: >> On 2018-02-13 05:25 AM, Christian König wrote: >>> Am 13.02.2018 um 00:23 schrieb Felix Kuehling: On 2018-02-12 11:57 AM, Felix Kuehling wrote: >>> I just thought of a slightly different approach I would consider >>> more >>> realistic, without having thought through all the details: Adding >>> KFD-specific memory management ioctls to the amdgpu device. >>> Basically >>> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions >>> instead >>> of KFD ioctl functions. But we'd still have KFD ioctls for other >>> things, >>> and the new amdgpu ioctls would be KFD-specific and not useful for >>> graphics applications. It's probably still several weeks of >>> work, but >>> shouldn't have major teething issues because the internal logic and >>> functionality would be basically unchanged. It would just move the >>> ioctls from one device to another. >> My thinking went into a similar direction. But instead of >> exposing the >> KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD. >> >> And then use the DRM FD in the KFD for things like the buffer >> provider >> of a device, e.g. no separate IDR for BOs in the KFD but rather a >> reference to the DRM FD. > I'm not sure I understand this. With "DRM FD" I think you mean the > device file descriptor. Then have KFD call file operation on the > FD to > call AMDGPU ioctls? Is there any precedent for this that I can > look at > for an example? OK, I think this could work for finding a VM from a DRM file descriptor: struct file *filp = fget(fd); struct drm_file *drm_priv = filp->private_data; struct amdgpu_fpriv *drv_priv = file_priv->driver_priv; struct amdgpu_vm *vm = _priv->vm; That would let us use the DRM VM instead of creating our own and would avoid wasting a perfectly good VM that gets created by opening the DRM device. We'd need a new ioctl to import VMs into KFD. But that's about as far as I would take it. We'd still need to add KFD-specific data to the VM. If we use an existing VM, we can't wrap it in our own structure any more. We'd need to come up with a different solution or extend struct amdgpu_vm with what we need. >>> Well feel free to extend the VM structure with a pointer for KFD data. >> Some more thoughts about that: Currently the lifetime of the VM is tied >> to the file descriptor. If we import it into KFD, we either have to make >> struct amdgpu_vm reference counted, or we have to keep a reference to >> the file descriptor in KFD just to keep the VM alive until we drop our >> reference to it. >> And we still need our own BO creation ioctl. Importing DMABufs adds extra overhead (more system calls, file descriptors, GEM objects) and doesn't cover some of our use cases (userptr). We also need our own idr for managing buffer handles that are used with all our memory and VM management functions. >>> Yeah, well that is the second part of that idea. >>> >>> When you have the drm_file for the VM, you can also use it to call >>> drm_gem_object_lookup(). >> KFD only has one device, so we need a buffer ID that's globally unique, >> not per-device. Or we'd need to identify buffers by a pair of a GEM >> handle and a device ID. Our BO handles are 64-bit, so we could pack both >> into a single handle. > > Ah, yeah that is also a point I wanted to to talk about with you. > > The approach of using the same buffer object with multiple amdgpu > devices doesn't work in general. > > We need separate TTM object for each BO in each device or otherwise we > break A+A laptops. I think it broke for VRAM BOs because we enabled P2P on systems that didn't support it properly. But at least system memory BOs can be shared quite well between devices and we do it all the time. I don't see how you can have separate TTM objects referring to the same memory. > > That is also the reason we had to disable this feature again in the > hybrid branches. What you disabled on hybrid branches was P2P, which only affects large-BAR systems. Sharing of system memory BOs between devices still works fine. > > So we need BO handles per device here or some other solution. > >> Either way, that still requires a GEM object, which adds not just a >> minor bit of overhead. It includes a struct file * which points to a >> shared memory file that gets created for each BO in drm_gem_object_init. > > HUI WHAT? Why the heck do we still do this? > > Sorry something went wrong here and yes that shmen file is completely > superfluous even for the gfx side. > >>> And when you need to iterate over all the BOs you can just use the >>> object_idr or the VM structure as well. >> object_idr only gives us BOs allocated from a particular
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 13.02.2018 um 17:42 schrieb Felix Kuehling: On 2018-02-13 05:25 AM, Christian König wrote: Am 13.02.2018 um 00:23 schrieb Felix Kuehling: On 2018-02-12 11:57 AM, Felix Kuehling wrote: I just thought of a slightly different approach I would consider more realistic, without having thought through all the details: Adding KFD-specific memory management ioctls to the amdgpu device. Basically call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead of KFD ioctl functions. But we'd still have KFD ioctls for other things, and the new amdgpu ioctls would be KFD-specific and not useful for graphics applications. It's probably still several weeks of work, but shouldn't have major teething issues because the internal logic and functionality would be basically unchanged. It would just move the ioctls from one device to another. My thinking went into a similar direction. But instead of exposing the KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD. And then use the DRM FD in the KFD for things like the buffer provider of a device, e.g. no separate IDR for BOs in the KFD but rather a reference to the DRM FD. I'm not sure I understand this. With "DRM FD" I think you mean the device file descriptor. Then have KFD call file operation on the FD to call AMDGPU ioctls? Is there any precedent for this that I can look at for an example? OK, I think this could work for finding a VM from a DRM file descriptor: struct file *filp = fget(fd); struct drm_file *drm_priv = filp->private_data; struct amdgpu_fpriv *drv_priv = file_priv->driver_priv; struct amdgpu_vm *vm = _priv->vm; That would let us use the DRM VM instead of creating our own and would avoid wasting a perfectly good VM that gets created by opening the DRM device. We'd need a new ioctl to import VMs into KFD. But that's about as far as I would take it. We'd still need to add KFD-specific data to the VM. If we use an existing VM, we can't wrap it in our own structure any more. We'd need to come up with a different solution or extend struct amdgpu_vm with what we need. Well feel free to extend the VM structure with a pointer for KFD data. Some more thoughts about that: Currently the lifetime of the VM is tied to the file descriptor. If we import it into KFD, we either have to make struct amdgpu_vm reference counted, or we have to keep a reference to the file descriptor in KFD just to keep the VM alive until we drop our reference to it. And we still need our own BO creation ioctl. Importing DMABufs adds extra overhead (more system calls, file descriptors, GEM objects) and doesn't cover some of our use cases (userptr). We also need our own idr for managing buffer handles that are used with all our memory and VM management functions. Yeah, well that is the second part of that idea. When you have the drm_file for the VM, you can also use it to call drm_gem_object_lookup(). KFD only has one device, so we need a buffer ID that's globally unique, not per-device. Or we'd need to identify buffers by a pair of a GEM handle and a device ID. Our BO handles are 64-bit, so we could pack both into a single handle. Ah, yeah that is also a point I wanted to to talk about with you. The approach of using the same buffer object with multiple amdgpu devices doesn't work in general. We need separate TTM object for each BO in each device or otherwise we break A+A laptops. That is also the reason we had to disable this feature again in the hybrid branches. So we need BO handles per device here or some other solution. Either way, that still requires a GEM object, which adds not just a minor bit of overhead. It includes a struct file * which points to a shared memory file that gets created for each BO in drm_gem_object_init. HUI WHAT? Why the heck do we still do this? Sorry something went wrong here and yes that shmen file is completely superfluous even for the gfx side. And when you need to iterate over all the BOs you can just use the object_idr or the VM structure as well. object_idr only gives us BOs allocated from a particular device. The VM structure has various BO lists, but they only contain BOs that were added to the VM. The same BO can be added multiple times or to multiple VMs, or it may not be in any VM at all. Ok, yeah that probably isn't sufficient for the BO handling like you need it for eviction. In later patch series we also need to track additional information for KFD buffers in KFD itself. At that point we change the KFD IDR to point to our own structure, which in turn points to the BO. For example our BOs get their VA assigned at allocation time, and we later do reverse lookups from the address to the BO using an interval tree. You can do this with the standard VM structure as well, that is needed for UVD and VCE anyway. See amdgpu_vm_bo_lookup_mapping. Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 05:46 AM, Christian König wrote: > Am 12.02.2018 um 17:57 schrieb Felix Kuehling: >>> [SNIP] >>> I see that as an advantage rather than a problem, cause it fixes a >>> couple of problems with the KFD where the address space of the inode >>> is not managed correctly as far as I can see. >> I don't think GEM is involved in the management of address space. That's >> handled inside TTM and drm_vma_manager, both of which we are using. We >> have tested CPU mappings with evictions and this is working correctly >> now. > > Ok, correct. I was thinking about that as one large functionality. > When you see it like this GEM basically only implements looking up BOs > and reference counting them, but both are still useful features. > >> We had problems with this before, when we were CPU-mapping our buffers >> through the KFD device FD. >> >>> That implementation issues never caused problems right now because you >>> never tried to unmap doorbells. But with the new eviction code that is >>> about to change, isn't it? >> I don't understand this comment. What does this have to do with >> doorbells? > > I was assuming that doorbells are now unmapped to figure out when to > start a process again. Restart is done based on a timer. > > See what I'm missing is how does userspace figures out which address > to use to map the doorbell? E.g. I don't see a call to > drm_vma_offset_manager_init() or something like that. Each process gets a whole page of the doorbell aperture assigned to it. The assumption is that amdgpu only uses the first page of the doorbell aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is used as the offset into the doorbell page. On GFX9 the hardware does some engine-specific doorbell routing, so we added another layer of doorbell management that's decoupled from the queue ID. Either way, an entire doorbell page gets mapped into user mode and user mode knows the offset of the doorbells for specific queues. The mapping is currently handled by kfd_mmap in kfd_chardev.c. A later patch will add the ability to map doorbells into GPUVM address space so that GPUs can dispatch work to themselves (or each other) by creating a special doorbell BO that can be mapped both to the GPU and the CPU. It works by creating an amdgpu_bo with an SG describing the doorbell page of the process. > >> Doorbells are never unmapped. When queues are evicted doorbells remain >> mapped to the process. User mode can continue writing to doorbells, >> though they won't have any immediate effect while the corresponding >> queues are unmapped from the HQDs. > > Do you simply assume that after evicting a process it always needs to > be restarted without checking if it actually does something? Or how > does that work? Exactly. With later addition of GPU self-dispatch a page-fault based mechanism wouldn't work any more. We have to restart the queues blindly with a timer. See evict_process_worker, which schedules the restore with a delayed worker. The user mode queue ABI specifies that user mode update both the doorbell and a WPTR in memory. When we restart queues we (or the CP firmware) use the WPTR to make sure we catch up with any work that was submitted while the queues were unmapped. Regards, Felix > > Regards, > Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
On 2018-02-13 05:25 AM, Christian König wrote: > Am 13.02.2018 um 00:23 schrieb Felix Kuehling: >> On 2018-02-12 11:57 AM, Felix Kuehling wrote: > I just thought of a slightly different approach I would consider more > realistic, without having thought through all the details: Adding > KFD-specific memory management ioctls to the amdgpu device. Basically > call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions > instead > of KFD ioctl functions. But we'd still have KFD ioctls for other > things, > and the new amdgpu ioctls would be KFD-specific and not useful for > graphics applications. It's probably still several weeks of work, but > shouldn't have major teething issues because the internal logic and > functionality would be basically unchanged. It would just move the > ioctls from one device to another. My thinking went into a similar direction. But instead of exposing the KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD. And then use the DRM FD in the KFD for things like the buffer provider of a device, e.g. no separate IDR for BOs in the KFD but rather a reference to the DRM FD. >>> I'm not sure I understand this. With "DRM FD" I think you mean the >>> device file descriptor. Then have KFD call file operation on the FD to >>> call AMDGPU ioctls? Is there any precedent for this that I can look at >>> for an example? >> OK, I think this could work for finding a VM from a DRM file descriptor: >> >> struct file *filp = fget(fd); >> struct drm_file *drm_priv = filp->private_data; >> struct amdgpu_fpriv *drv_priv = file_priv->driver_priv; >> struct amdgpu_vm *vm = _priv->vm; >> >> That would let us use the DRM VM instead of creating our own and would >> avoid wasting a perfectly good VM that gets created by opening the DRM >> device. We'd need a new ioctl to import VMs into KFD. But that's about >> as far as I would take it. >> >> We'd still need to add KFD-specific data to the VM. If we use an >> existing VM, we can't wrap it in our own structure any more. We'd need >> to come up with a different solution or extend struct amdgpu_vm with >> what we need. > > Well feel free to extend the VM structure with a pointer for KFD data. Some more thoughts about that: Currently the lifetime of the VM is tied to the file descriptor. If we import it into KFD, we either have to make struct amdgpu_vm reference counted, or we have to keep a reference to the file descriptor in KFD just to keep the VM alive until we drop our reference to it. > >> And we still need our own BO creation ioctl. Importing DMABufs adds >> extra overhead (more system calls, file descriptors, GEM objects) and >> doesn't cover some of our use cases (userptr). We also need our own idr >> for managing buffer handles that are used with all our memory and VM >> management functions. > > Yeah, well that is the second part of that idea. > > When you have the drm_file for the VM, you can also use it to call > drm_gem_object_lookup(). KFD only has one device, so we need a buffer ID that's globally unique, not per-device. Or we'd need to identify buffers by a pair of a GEM handle and a device ID. Our BO handles are 64-bit, so we could pack both into a single handle. Either way, that still requires a GEM object, which adds not just a minor bit of overhead. It includes a struct file * which points to a shared memory file that gets created for each BO in drm_gem_object_init. > > And when you need to iterate over all the BOs you can just use the > object_idr or the VM structure as well. object_idr only gives us BOs allocated from a particular device. The VM structure has various BO lists, but they only contain BOs that were added to the VM. The same BO can be added multiple times or to multiple VMs, or it may not be in any VM at all. In later patch series we also need to track additional information for KFD buffers in KFD itself. At that point we change the KFD IDR to point to our own structure, which in turn points to the BO. For example our BOs get their VA assigned at allocation time, and we later do reverse lookups from the address to the BO using an interval tree. Regards, Felix > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> Or do you mean a DMABuf FD exported by GEM? After importing the buffer >>> and closing the FD, we still need a way to refer to the buffer for >>> managing the GPU mappings and for releasing the reference. So we still >>> need handles that are currently provided by our own IDR. I think we'll >>> still need that. >>> >>> Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for >>> creating userptr BOs. Those can't be exported as DMABufs, so we still >>> need our own ioctl for creating that type of BO. That was going to >>> be my >>> next patch series. >>> >>> Regards, >>> Felix >>> > > ___ > amd-gfx mailing list >
Re: [PATCH 31/34] drm/amd/display: Add logging for aux DPCD access
On 2018-02-12 04:31 PM, Andrey Grodzovsky wrote: > With this logger you should probably remove the Linux specific logger in > amdgpu_dm_mst_types.c, check log_dpcd function. > This currently only logs the first byte of the response. Once that's fixed you're right, we should rip out the one from amdgpu_dm_mst_types.c. Harry > Andrey > > > On 02/12/2018 12:16 PM, Harry Wentland wrote: >> From: Eric Yang>> >> Add basic logging for DPCD access. Does not print >> by default. >> >> Currently only prints first byte of the data accessed. >> >> Technical debt: Need to make it so that the entire >> data block accessed is printed. Also need to log >> address space that's not DPCD. >> >> Change-Id: I10ef7042c14d70508845ef827ebec2432d8d8176 >> Signed-off-by: Eric Yang >> Reviewed-by: Tony Cheng >> Acked-by: Harry Wentland >> --- >> drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c | 16 >> 1 file changed, 16 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c >> b/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c >> index fc7a7d4ebca5..0b1db48fef36 100644 >> --- a/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c >> +++ b/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c >> @@ -284,6 +284,14 @@ static bool read_command( >> msleep(engine->delay); >> } while (ctx.operation_succeeded && !ctx.transaction_complete); >> + if (request->payload.address_space == >> + I2CAUX_TRANSACTION_ADDRESS_SPACE_DPCD) { >> + dm_logger_write(engine->base.ctx->logger, LOG_I2C_AUX, "READ: >> addr:0x%x value:0x%x Result:%d", >> + request->payload.address, >> + request->payload.data[0], >> + ctx.operation_succeeded); >> + } >> + >> return ctx.operation_succeeded; >> } >> @@ -484,6 +492,14 @@ static bool write_command( >> msleep(engine->delay); >> } while (ctx.operation_succeeded && !ctx.transaction_complete); >> + if (request->payload.address_space == >> + I2CAUX_TRANSACTION_ADDRESS_SPACE_DPCD) { >> + dm_logger_write(engine->base.ctx->logger, LOG_I2C_AUX, "WRITE: >> addr:0x%x value:0x%x Result:%d", >> + request->payload.address, >> + request->payload.data[0], >> + ctx.operation_succeeded); >> + } >> + >> return ctx.operation_succeeded; >> } >> > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Documentation about AMD's HSA implementation?
The ROCm documentation is probably a good place to start: https://rocm.github.io/documentation.html Alex From: amd-gfxon behalf of Ming Yang Sent: Tuesday, February 13, 2018 12:00 AM To: amd-gfx@lists.freedesktop.org Subject: Documentation about AMD's HSA implementation? Hi, I'm interested in HSA and excited when I found AMD's fully open-stack ROCm supporting it. Before digging into the code, I wonder if there's any documentation available about AMD's HSA implementation, either book, whitepaper, paper, or documentation. I did find helpful materials about HSA, including HSA standards on this page (http://www.hsafoundation.com/standards/) and a nice book about HSA (Heterogeneous System Architecture A New Compute Platform Infrastructure). But regarding the documentation about AMD's implementation, I haven't found anything yet. Please let me know if there are ones publicly accessible. If no, any suggestions on learning the implementation of specific system components, e.g., queue scheduling. Best, Mark ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/2] drm/amdgpu: fix and cleanup UVD IB generation
Reviewed-by: Leo LiuOn 02/07/2018 02:48 PM, Christian König wrote: We didn't synced the BO after validating it. Also sart to use amdgpu_bo_create_reserved to simplify things. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 106 1 file changed, 38 insertions(+), 68 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c index 7cdbe0c14496..9cd5517a4fa9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c @@ -952,37 +952,28 @@ int amdgpu_uvd_ring_parse_cs(struct amdgpu_cs_parser *parser, uint32_t ib_idx) static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, bool direct, struct dma_fence **fence) { - struct ttm_operation_ctx ctx = { true, false }; - struct ttm_validate_buffer tv; - struct ww_acquire_ctx ticket; - struct list_head head; + struct amdgpu_device *adev = ring->adev; + struct dma_fence *f = NULL; struct amdgpu_job *job; struct amdgpu_ib *ib; - struct dma_fence *f = NULL; - struct amdgpu_device *adev = ring->adev; - uint64_t addr; uint32_t data[4]; - int i, r; - - memset(, 0, sizeof(tv)); - tv.bo = >tbo; - - INIT_LIST_HEAD(); - list_add(, ); + uint64_t addr; + long r; + int i; - r = ttm_eu_reserve_buffers(, , true, NULL); - if (r) - return r; + amdgpu_bo_kunmap(bo); + amdgpu_bo_unpin(bo); if (!ring->adev->uvd.address_64_bit) { + struct ttm_operation_ctx ctx = { true, false }; + amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM); amdgpu_uvd_force_into_uvd_segment(bo); + r = ttm_bo_validate(>tbo, >placement, ); + if (r) + goto err; } - r = ttm_bo_validate(>tbo, >placement, ); - if (r) - goto err; - r = amdgpu_job_alloc_with_ib(adev, 64, ); if (r) goto err; @@ -1014,6 +1005,14 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, ib->length_dw = 16; if (direct) { + r = reservation_object_wait_timeout_rcu(bo->tbo.resv, + true, false, + msecs_to_jiffies(10)); + if (r == 0) + r = -ETIMEDOUT; + if (r < 0) + goto err_free; + r = amdgpu_ib_schedule(ring, 1, ib, NULL, ); job->fence = dma_fence_get(f); if (r) @@ -1021,17 +1020,23 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, amdgpu_job_free(job); } else { + r = amdgpu_sync_resv(adev, >sync, bo->tbo.resv, +AMDGPU_FENCE_OWNER_UNDEFINED, false); + if (r) + goto err_free; + r = amdgpu_job_submit(job, ring, >uvd.entity, AMDGPU_FENCE_OWNER_UNDEFINED, ); if (r) goto err_free; } - ttm_eu_fence_buffer_objects(, , f); + amdgpu_bo_fence(bo, f, false); + amdgpu_bo_unreserve(bo); + amdgpu_bo_unref(); if (fence) *fence = dma_fence_get(f); - amdgpu_bo_unref(); dma_fence_put(f); return 0; @@ -1040,7 +1045,8 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, amdgpu_job_free(job); err: - ttm_eu_backoff_reservation(, ); + amdgpu_bo_unreserve(bo); + amdgpu_bo_unref(); return r; } @@ -1051,31 +1057,16 @@ int amdgpu_uvd_get_create_msg(struct amdgpu_ring *ring, uint32_t handle, struct dma_fence **fence) { struct amdgpu_device *adev = ring->adev; - struct amdgpu_bo *bo; + struct amdgpu_bo *bo = NULL; uint32_t *msg; int r, i; - r = amdgpu_bo_create(adev, 1024, PAGE_SIZE, true, -AMDGPU_GEM_DOMAIN_VRAM, -AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED | -AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS, -NULL, NULL, ); + r = amdgpu_bo_create_reserved(adev, 1024, PAGE_SIZE, + AMDGPU_GEM_DOMAIN_VRAM, + , NULL, (void **)); if (r) return r; - r = amdgpu_bo_reserve(bo, false); - if (r) { - amdgpu_bo_unref(); - return r; - } - - r = amdgpu_bo_kmap(bo, (void **)); - if (r) { - amdgpu_bo_unreserve(bo); -
Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation
On 02/13/2018 08:57 AM, Christian König wrote: It's always the obvious. Leo any more comments on this? If not can we get your rb on this patch as well? Tested-and-Reviewed-by: Leo LiuAndrey if Leo gives his ok can you commit both? I'm on vacation and don't want to mess with this at the moment. Thanks, Christian. Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky: Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not call amdgpu_bo_create. Attached updated patch. Thanks, Andrey On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote: Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec tests pass but when modprobing amdgpu I get a few warnings and BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch. Thanks, Andrey On 02/09/2018 07:31 AM, Christian König wrote: Question is rather can somebody please test this on Raven? Thanks, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation
Sure Andrey On 02/13/2018 08:57 AM, Christian König wrote: It's always the obvious. Leo any more comments on this? If not can we get your rb on this patch as well? Andrey if Leo gives his ok can you commit both? I'm on vacation and don't want to mess with this at the moment. Thanks, Christian. Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky: Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not call amdgpu_bo_create. Attached updated patch. Thanks, Andrey On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote: Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec tests pass but when modprobing amdgpu I get a few warnings and BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch. Thanks, Andrey On 02/09/2018 07:31 AM, Christian König wrote: Question is rather can somebody please test this on Raven? Thanks, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation
It's always the obvious. Leo any more comments on this? If not can we get your rb on this patch as well? Andrey if Leo gives his ok can you commit both? I'm on vacation and don't want to mess with this at the moment. Thanks, Christian. Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky: Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not call amdgpu_bo_create. Attached updated patch. Thanks, Andrey On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote: Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec tests pass but when modprobing amdgpu I get a few warnings and BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch. Thanks, Andrey On 02/09/2018 07:31 AM, Christian König wrote: Question is rather can somebody please test this on Raven? Thanks, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation
Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not call amdgpu_bo_create. Attached updated patch. Thanks, Andrey On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote: Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec tests pass but when modprobing amdgpu I get a few warnings and BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch. Thanks, Andrey On 02/09/2018 07:31 AM, Christian König wrote: Question is rather can somebody please test this on Raven? Thanks, Christian. >From 03d3d0195b4e672cefb2851e66f0fe03bd9089a3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20K=C3=B6nig?=Date: Wed, 7 Feb 2018 20:48:22 +0100 Subject: drm/amdgpu: cleanup VCN IB generation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Start to use amdgpu_bo_create_reserved v2. v2: Fix missing pointer init to NULL. Remove extra new lines. Signed-off-by: Christian König Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 91 - 1 file changed, 20 insertions(+), 71 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c index e86d0b2..6c5e25f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c @@ -270,43 +270,26 @@ int amdgpu_vcn_dec_ring_test_ring(struct amdgpu_ring *ring) return r; } -static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, - bool direct, struct dma_fence **fence) +static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, + struct amdgpu_bo *bo, bool direct, + struct dma_fence **fence) { - struct ttm_operation_ctx ctx = { true, false }; - struct ttm_validate_buffer tv; - struct ww_acquire_ctx ticket; - struct list_head head; + struct amdgpu_device *adev = ring->adev; + struct dma_fence *f = NULL; struct amdgpu_job *job; struct amdgpu_ib *ib; - struct dma_fence *f = NULL; - struct amdgpu_device *adev = ring->adev; uint64_t addr; int i, r; - memset(, 0, sizeof(tv)); - tv.bo = >tbo; - - INIT_LIST_HEAD(); - list_add(, ); - - r = ttm_eu_reserve_buffers(, , true, NULL); - if (r) - return r; - - r = ttm_bo_validate(>tbo, >placement, ); - if (r) - goto err; - r = amdgpu_job_alloc_with_ib(adev, 64, ); if (r) goto err; ib = >ibs[0]; addr = amdgpu_bo_gpu_offset(bo); - ib->ptr[0] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA0), 0); + ib->ptr[0] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA0),0); ib->ptr[1] = addr; - ib->ptr[2] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA1), 0); + ib->ptr[2] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA1),0); ib->ptr[3] = addr >> 32; ib->ptr[4] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_CMD), 0); ib->ptr[5] = 0; @@ -330,11 +313,12 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *b goto err_free; } - ttm_eu_fence_buffer_objects(, , f); + amdgpu_bo_fence(bo, f, false); + amdgpu_bo_unreserve(bo); + amdgpu_bo_unref(); if (fence) *fence = dma_fence_get(f); - amdgpu_bo_unref(); dma_fence_put(f); return 0; @@ -343,7 +327,8 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *b amdgpu_job_free(job); err: - ttm_eu_backoff_reservation(, ); + amdgpu_bo_unreserve(bo); + amdgpu_bo_unref(); return r; } @@ -351,31 +336,16 @@ static int amdgpu_vcn_dec_get_create_msg(struct amdgpu_ring *ring, uint32_t hand struct dma_fence **fence) { struct amdgpu_device *adev = ring->adev; - struct amdgpu_bo *bo; + struct amdgpu_bo *bo = NULL; uint32_t *msg; int r, i; - r = amdgpu_bo_create(adev, 1024, PAGE_SIZE, true, - AMDGPU_GEM_DOMAIN_VRAM, - AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED | - AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS, - NULL, NULL, ); + r = amdgpu_bo_create_reserved(adev, 1024, PAGE_SIZE, + AMDGPU_GEM_DOMAIN_VRAM, + , NULL, (void **)); if (r) return r; - r = amdgpu_bo_reserve(bo, false); - if (r) { - amdgpu_bo_unref(); - return r; - } - - r = amdgpu_bo_kmap(bo, (void **)); - if (r) { - amdgpu_bo_unreserve(bo); - amdgpu_bo_unref(); - return r; - } - msg[0] = cpu_to_le32(0x0028); msg[1] = cpu_to_le32(0x0038); msg[2] = cpu_to_le32(0x0001); @@ -393,9 +363,6 @@ static int amdgpu_vcn_dec_get_create_msg(struct amdgpu_ring *ring, uint32_t hand for (i = 14; i < 1024; ++i) msg[i] = cpu_to_le32(0x0); - amdgpu_bo_kunmap(bo); - amdgpu_bo_unreserve(bo); - return amdgpu_vcn_dec_send_msg(ring, bo, true, fence); } @@ -403,31 +370,16 @@ static int amdgpu_vcn_dec_get_destroy_msg(struct amdgpu_ring *ring, uint32_t han bool direct, struct dma_fence **fence) { struct
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 12.02.2018 um 17:57 schrieb Felix Kuehling: [SNIP] I see that as an advantage rather than a problem, cause it fixes a couple of problems with the KFD where the address space of the inode is not managed correctly as far as I can see. I don't think GEM is involved in the management of address space. That's handled inside TTM and drm_vma_manager, both of which we are using. We have tested CPU mappings with evictions and this is working correctly now. Ok, correct. I was thinking about that as one large functionality. When you see it like this GEM basically only implements looking up BOs and reference counting them, but both are still useful features. We had problems with this before, when we were CPU-mapping our buffers through the KFD device FD. That implementation issues never caused problems right now because you never tried to unmap doorbells. But with the new eviction code that is about to change, isn't it? I don't understand this comment. What does this have to do with doorbells? I was assuming that doorbells are now unmapped to figure out when to start a process again. See what I'm missing is how does userspace figures out which address to use to map the doorbell? E.g. I don't see a call to drm_vma_offset_manager_init() or something like that. Doorbells are never unmapped. When queues are evicted doorbells remain mapped to the process. User mode can continue writing to doorbells, though they won't have any immediate effect while the corresponding queues are unmapped from the HQDs. Do you simply assume that after evicting a process it always needs to be restarted without checking if it actually does something? Or how does that work? Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor
Am 13.02.2018 um 00:23 schrieb Felix Kuehling: On 2018-02-12 11:57 AM, Felix Kuehling wrote: I just thought of a slightly different approach I would consider more realistic, without having thought through all the details: Adding KFD-specific memory management ioctls to the amdgpu device. Basically call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead of KFD ioctl functions. But we'd still have KFD ioctls for other things, and the new amdgpu ioctls would be KFD-specific and not useful for graphics applications. It's probably still several weeks of work, but shouldn't have major teething issues because the internal logic and functionality would be basically unchanged. It would just move the ioctls from one device to another. My thinking went into a similar direction. But instead of exposing the KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD. And then use the DRM FD in the KFD for things like the buffer provider of a device, e.g. no separate IDR for BOs in the KFD but rather a reference to the DRM FD. I'm not sure I understand this. With "DRM FD" I think you mean the device file descriptor. Then have KFD call file operation on the FD to call AMDGPU ioctls? Is there any precedent for this that I can look at for an example? OK, I think this could work for finding a VM from a DRM file descriptor: struct file *filp = fget(fd); struct drm_file *drm_priv = filp->private_data; struct amdgpu_fpriv *drv_priv = file_priv->driver_priv; struct amdgpu_vm *vm = _priv->vm; That would let us use the DRM VM instead of creating our own and would avoid wasting a perfectly good VM that gets created by opening the DRM device. We'd need a new ioctl to import VMs into KFD. But that's about as far as I would take it. We'd still need to add KFD-specific data to the VM. If we use an existing VM, we can't wrap it in our own structure any more. We'd need to come up with a different solution or extend struct amdgpu_vm with what we need. Well feel free to extend the VM structure with a pointer for KFD data. And we still need our own BO creation ioctl. Importing DMABufs adds extra overhead (more system calls, file descriptors, GEM objects) and doesn't cover some of our use cases (userptr). We also need our own idr for managing buffer handles that are used with all our memory and VM management functions. Yeah, well that is the second part of that idea. When you have the drm_file for the VM, you can also use it to call drm_gem_object_lookup(). And when you need to iterate over all the BOs you can just use the object_idr or the VM structure as well. Regards, Christian. Regards, Felix Or do you mean a DMABuf FD exported by GEM? After importing the buffer and closing the FD, we still need a way to refer to the buffer for managing the GPU mappings and for releasing the reference. So we still need handles that are currently provided by our own IDR. I think we'll still need that. Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for creating userptr BOs. Those can't be exported as DMABufs, so we still need our own ioctl for creating that type of BO. That was going to be my next patch series. Regards, Felix ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Documentation about AMD's HSA implementation?
Hi, I'm interested in HSA and excited when I found AMD's fully open-stack ROCm supporting it. Before digging into the code, I wonder if there's any documentation available about AMD's HSA implementation, either book, whitepaper, paper, or documentation. I did find helpful materials about HSA, including HSA standards on this page (http://www.hsafoundation.com/standards/) and a nice book about HSA (Heterogeneous System Architecture A New Compute Platform Infrastructure). But regarding the documentation about AMD's implementation, I haven't found anything yet. Please let me know if there are ones publicly accessible. If no, any suggestions on learning the implementation of specific system components, e.g., queue scheduling. Best, Mark ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx