Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 20:22 schrieb Felix Kuehling:

On 2018-02-13 01:45 PM, Christian König wrote:

Am 13.02.2018 um 17:56 schrieb Felix Kuehling:

[SNIP]
Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c.

Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost
looks like you map different doorbells with the same offset depending
on which process is calling this.

Is that correct? If yes then that would be illegal and a problem if
I'm not completely mistaken.

Why is that a problem. Each process has its own file descriptor. The
mapping is done using io_remap_pfn_range in kfd_doorbell_mmap.


Yeah, but all share the same file address space from the inode, doesn't 
they?


E.g. imagine that you map an offset from a normal file into process A 
and the same offset from the same file into process B, what do you get? 
Exactly, the same page mapped into both processes.


Now take a look at the KFD interface, you map the same offset from the 
same file (or rather inode) into two different processes and get two 
different things.



This is nothing new. It's been done like this forever even on Kaveri and 
Carrizo.


It most likely doesn't cause any problems of hand or otherwise we would 
have noticed already, but at bare minimum that totally confuses the 
reverse mapping code.


Need to think about it if that is really an issue or not and if yes how 
to fix/mitigate it.


Regards,
Christian.




Do you simply assume that after evicting a process it always needs to
be restarted without checking if it actually does something? Or how
does that work?

Exactly.

Ok, understood. Well that limits the usefulness of the whole eviction
drastically.


With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.
which was send either by the GPU o
The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.

Putting cross process work dispatch aside for a moment GPU
self-dispatch works only when there is work on the GPU running.

So you can still check if there are some work pending after you
unmapped everything and only restart the queues when there is new work
based on the page fault.

In other words either there is work pending and it doesn't matter if
it was send by the GPU or by the CPU or there is no work pending and
we can delay restarting everything until there is.

That sounds like a useful optimization.

Regards,
   Felix


Regards,
Christian.


Regards,
    Felix


Regards,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 14.02.2018 um 00:17 schrieb Felix Kuehling:

On 2018-02-13 02:18 PM, Felix Kuehling wrote:

On 2018-02-13 01:15 PM, Christian König wrote:

Am 13.02.2018 um 18:18 schrieb Felix Kuehling:

On 2018-02-13 12:06 PM, Christian König wrote:

[SNIP]
Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu
devices doesn't work in general.

We need separate TTM object for each BO in each device or otherwise we
break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time.

Sharing VRAM BOs is one issue, but the problem goes deeper than just
that.

Starting with Carizzo we can scanout from system memory to avoid the
extra copy on A+A laptops. For this to work we need the BO mapped to
GART (and I mean a real VMID0 mapping, not just in the GTT domain).
And for this to work in turn we need a TTM object per device and not a
global one.

I still don't understand. I think what you're talking about applies only
to BOs used for scan-out. Every BO is allocated from a specific device
and can only be GART-mapped on that device.


Exactly that assumption is incorrect. BOs can be GART mapped into any 
device.



  What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.


Correct VM mapping is unaffected here.


I don't see how you can have separate TTM objects referring to the
same memory.

Well that is trivial, we do this all the time with prime and I+A laptops.

As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object.


That's incorrect as well. Normally multiple devices have multiple 
ttm_buffer_object, one for each device.


Going a bit higher that actually makes sense because the status of each 
BO is deferent for each device. E.g. one device could have the BO in 
access while it could be idle on another device.


Same is true for placement of the BO. E.g. a VRAM placement of one 
device is actually a GTT placement for another.



  The only way you can have a new TTM buffer object
per device is by using SG tables and pinning the BOs. But I think we
want to avoid pinning BOs.

What we do does not involve pinning of BOs, even when they're shared
between multiple devices' VMs.


That is also the reason we had to disable this feature again in the
hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.

No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
and Raven. Additional to that we found that it breaks some aspects of
the user space interface.

Let me check that with my current patch series. The patches I submitted
here shouldn't include anything that breaks the use cases you describe.
But I'm quite sure it will support sharing BOs between multiple devices'
VMs.

Confirmed with the current patch series. I can allocate buffers on GPU 1
and map them into a VM on GPU 2.


As I said VM mapping is not the problem here.


BTW, this is the normal mode of operation for system memory BOs on
multi-GPU systems. Logically system memory BOs belong to the CPU node in
the KFD topology. So the Thunk API isn't told which GPU to allocate the
system memory BO from. It just picks the first one. Then we can map
those BOs on any GPUs we want. If we want to use them on GPU2, that's
where they get mapped.


Well keeping NUMA in mind that actually sounds like a design problem to me.

On NUMA system some parts of the system memory can be "closer" to a GPU 
than other parts.



So I just put two GPUs in a system and ran a test on GPU2. The system
memory buffers are allocated from the GPU1 device, but mapped into the
GPU2 VM. The tests work normally.

If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.


As I said it completely breaks scanout with A+I systems.

Over all it looks like the change causes more problems than it solves. 
So I'm going to block upstreaming it until we have found a way to make 
it work for everybody.


Regards,
Christian.



Regards,
   Felix


Regards,
   Felix


So end result is that we probably need to revert it and find a
different solution. I'm already working on this for a couple of weeks
now and should have something ready after I'm done with the PASID
handling.

Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Ming Yang
Thanks for all the inputs.  Very helpful!  I think I have a general
understanding of the queue scheduling now and it's time for me to read
more code and materials and do some experiments.

I'll come back with more questions hopefully. :-)

Hi David, please don't hesitate to share more documents.  I might find
helpful information from them eventually.  People like me may benefit
from them someway in the future.


Best,
Ming (Mark)

On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David  wrote:
> I found a bunch of doc whilst spelunking info for another project.
> I'm not sure what's up-to-date, correct, useful, etc.
> I've attached one.
> Let me know if you want any more.
>
> davep
>
>> -Original Message-
>> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>> Of Bridgman, John
>> Sent: Tuesday, February 13, 2018 6:45 PM
>> To: Bridgman, John ; Ming Yang
>> ; Kuehling, Felix 
>> Cc: Deucher, Alexander ; amd-
>> g...@lists.freedesktop.org
>> Subject: RE: Documentation about AMD's HSA implementation?
>>
>>
>> >-Original Message-
>> >From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>> >Of Bridgman, John
>> >Sent: Tuesday, February 13, 2018 6:42 PM
>> >To: Ming Yang; Kuehling, Felix
>> >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>> >Subject: RE: Documentation about AMD's HSA implementation?
>> >
>> >
>> >
>> >>-Original Message-
>> >>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On
>> Behalf
>> >>Of Ming Yang
>> >>Sent: Tuesday, February 13, 2018 4:59 PM
>> >>To: Kuehling, Felix
>> >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>> >>Subject: Re: Documentation about AMD's HSA implementation?
>> >>
>> >>That's very helpful, thanks!
>> >>
>> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>> >>
>> >>wrote:
>> >>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>  Thanks for the suggestions!  But I might ask several specific
>>  questions, as I can't find the answer in those documents, to give
>>  myself a quick start if that's okay. Pointing me to the
>>  files/functions would be good enough.  Any explanations are
>>  appreciated.   My purpose is to hack it with different scheduling
>>  policy with real-time and predictability consideration.
>> 
>>  - Where/How is the packet scheduler implemented?  How are packets
>>  from multiple queues scheduled?  What about scheduling packets from
>>  queues in different address spaces?
>> >>>
>> >>> This is done mostly in firmware. The CP engine supports up to 32
>> queues.
>> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> >>> micro engine. Within each pipe the queues are time-multiplexed.
>> >>
>> >>Please correct me if I'm wrong.  CP is computing processor, like the
>> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
>> >>scheduler multiplexing queues, in order to hide memory latency.
>> >
>> >CP is one step back from that - it's a "command processor" which reads
>> >command packets from driver (PM4 format) or application (AQL format)
>> >then manages the execution of each command on the GPU. A typical
>> packet
>> >might be "dispatch", which initiates a compute operation on an
>> >N-dimensional array, or "draw" which initiates the rendering of an
>> >array of triangles. Those compute and render commands then generate a
>> >(typically) large number of wavefronts which are multiplexed on the
>> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine
>> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each
>> pipe on an MEC block as an "ACE".
>>
>> I missed one important point - "CP" refers to the combination of ME, MEC(s)
>> and a few other related blocks.
>>
>> >>
>> >>>
>> >>> If we need more than 24 queues, or if we have more than 8 processes,
>> >>> the hardware scheduler (HWS) adds another layer scheduling,
>> >>> basically round-robin between batches of 24 queues or 8 processes.
>> >>> Once you get into such an over-subscribed scenario your performance
>> >>> and GPU utilization can suffers quite badly.
>> >>
>> >>HWS is also implemented in the firmware that's closed-source?
>> >
>> >Correct - HWS is implemented in the MEC microcode. We also include a
>> >simple SW scheduler in the open source driver code, however.
>> >>
>> >>>
>> 
>>  - I noticed the new support of concurrency of multi-processes in
>>  the archive of this mailing list.  Could you point me to the code
>>  that implements this?
>> >>>
>> >>> That's basically just a switch that tells the firmware that it is
>> >>> allowed to schedule queues from different processes at the same time.
>> >>> The upper limit is the number of VMIDs that HWS can work with. It
>> >>> 

RE: Documentation about AMD's HSA implementation?

2018-02-13 Thread Bridgman, John

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Bridgman, John
>Sent: Tuesday, February 13, 2018 6:42 PM
>To: Ming Yang; Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: RE: Documentation about AMD's HSA implementation?
>
>
>
>>-Original Message-
>>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>>Of Ming Yang
>>Sent: Tuesday, February 13, 2018 4:59 PM
>>To: Kuehling, Felix
>>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>That's very helpful, thanks!
>>
>>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>>
>>wrote:
>>> On 2018-02-13 04:06 PM, Ming Yang wrote:
 Thanks for the suggestions!  But I might ask several specific
 questions, as I can't find the answer in those documents, to give
 myself a quick start if that's okay. Pointing me to the
 files/functions would be good enough.  Any explanations are
 appreciated.   My purpose is to hack it with different scheduling
 policy with real-time and predictability consideration.

 - Where/How is the packet scheduler implemented?  How are packets
 from multiple queues scheduled?  What about scheduling packets from
 queues in different address spaces?
>>>
>>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>>> micro engine. Within each pipe the queues are time-multiplexed.
>>
>>Please correct me if I'm wrong.  CP is computing processor, like the
>>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler
>>multiplexing queues, in order to hide memory latency.
>
>CP is one step back from that - it's a "command processor" which reads
>command packets from driver (PM4 format) or application (AQL format) then
>manages the execution of each command on the GPU. A typical packet might
>be "dispatch", which initiates a compute operation on an N-dimensional array,
>or "draw" which initiates the rendering of an array of triangles. Those
>compute and render commands then generate a (typically) large number of
>wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of
>our recent GPUs have one micro engine for graphics ("ME") and two for
>compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".

I missed one important point - "CP" refers to the combination of ME, MEC(s) and 
a few other related blocks.

>>
>>>
>>> If we need more than 24 queues, or if we have more than 8 processes,
>>> the hardware scheduler (HWS) adds another layer scheduling, basically
>>> round-robin between batches of 24 queues or 8 processes. Once you get
>>> into such an over-subscribed scenario your performance and GPU
>>> utilization can suffers quite badly.
>>
>>HWS is also implemented in the firmware that's closed-source?
>
>Correct - HWS is implemented in the MEC microcode. We also include a simple
>SW scheduler in the open source driver code, however.
>>
>>>

 - I noticed the new support of concurrency of multi-processes in the
 archive of this mailing list.  Could you point me to the code that
 implements this?
>>>
>>> That's basically just a switch that tells the firmware that it is
>>> allowed to schedule queues from different processes at the same time.
>>> The upper limit is the number of VMIDs that HWS can work with. It
>>> needs to assign a unique VMID to each process (each VMID representing
>>> a separate address space, page table, etc.). If there are more
>>> processes than VMIDs, the HWS has to time-multiplex.
>>
>>HWS dispatch packets in their order of becoming the head of the queue,
>>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>>round-robin between queues? You mentioned round-robin over batches in
>>the over- subscribed scenario.
>
>Round robin between sets of queues. The HWS logic generates sets as
>follows:
>
>1. "set resources" packet from driver tells scheduler how many VMIDs and
>HW queues it can use
>
>2. "runlist" packet from driver provides list of processes and list of queues 
>for
>each process
>
>3. if multi-process switch not set, HWS schedules as many queues from the
>first process in the runlist as it has HW queues (see #1)
>
>4. at the end of process quantum (set by driver) either switch to next process
>(if all queues from first process have been scheduled) or schedule next set of
>queues from the same process
>
>5. when all queues from all processes have been scheduled and run for a
>process quantum, go back to the start of the runlist and repeat
>
>If the multi-process switch is set, and the number of queues for a process is
>less than the number of HW queues available, then in step #3 above HWS will
>start scheduling queues for additional processes, using a 

RE: Documentation about AMD's HSA implementation?

2018-02-13 Thread Bridgman, John


>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Ming Yang
>Sent: Tuesday, February 13, 2018 4:59 PM
>To: Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>That's very helpful, thanks!
>
>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling 
>wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets
>>> from multiple queues scheduled?  What about scheduling packets from
>>> queues in different address spaces?
>>
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
>
>Please correct me if I'm wrong.  CP is computing processor, like the Execution
>Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing
>queues, in order to hide memory latency.

CP is one step back from that - it's a "command processor" which reads command 
packets from driver (PM4 format) or application (AQL format) then manages the 
execution of each command on the GPU. A typical packet might be "dispatch", 
which initiates a compute operation on an N-dimensional array, or "draw" which 
initiates the rendering of an array of triangles. Those compute and render 
commands then generate a (typically) large number of wavefronts which are 
multiplexed on the shader core (by SQ IIRC). Most of our recent GPUs have one 
micro engine for graphics ("ME") and two for compute ("MEC"). Marketing refers 
to each pipe on an MEC block as an "ACE".
>
>>
>> If we need more than 24 queues, or if we have more than 8 processes,
>> the hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
>
>HWS is also implemented in the firmware that's closed-source?

Correct - HWS is implemented in the MEC microcode. We also include a simple SW 
scheduler in the open source driver code, however. 
>
>>
>>>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>>
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It
>> needs to assign a unique VMID to each process (each VMID representing
>> a separate address space, page table, etc.). If there are more
>> processes than VMIDs, the HWS has to time-multiplex.
>
>HWS dispatch packets in their order of becoming the head of the queue, i.e.,
>being pointed by the read_index? So in this way it's FIFO.  Or round-robin
>between queues? You mentioned round-robin over batches in the over-
>subscribed scenario.

Round robin between sets of queues. The HWS logic generates sets as follows:

1. "set resources" packet from driver tells scheduler how many VMIDs and HW 
queues it can use

2. "runlist" packet from driver provides list of processes and list of queues 
for each process

3. if multi-process switch not set, HWS schedules as many queues from the first 
process in the runlist as it has HW queues (see #1)

4. at the end of process quantum (set by driver) either switch to next process 
(if all queues from first process have been scheduled) or schedule next set of 
queues from the same process

5. when all queues from all processes have been scheduled and run for a process 
quantum, go back to the start of the runlist and repeat

If the multi-process switch is set, and the number of queues for a process is 
less than the number of HW queues available, then in step #3 above HWS will 
start scheduling queues for additional processes, using a different VMID for 
each process, and continue until it either runs out of VMIDs or HW queues (or 
reaches the end of the runlist). All of the queues and processes would then run 
together for a process quantum before switching to the next queue set.

>
>This might not be a big deal for performance, but it matters for predictability
>and real-time analysis.

Agreed. In general you would not want to overcommit either VMIDs or HW queues 
in a real-time scenario, and for hard real time you would probably 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 02:18 PM, Felix Kuehling wrote:
> On 2018-02-13 01:15 PM, Christian König wrote:
>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>> On 2018-02-13 12:06 PM, Christian König wrote:
 [SNIP]
 Ah, yeah that is also a point I wanted to to talk about with you.

 The approach of using the same buffer object with multiple amdgpu
 devices doesn't work in general.

 We need separate TTM object for each BO in each device or otherwise we
 break A+A laptops.
>>> I think it broke for VRAM BOs because we enabled P2P on systems that
>>> didn't support it properly. But at least system memory BOs can be shared
>>> quite well between devices and we do it all the time.
>> Sharing VRAM BOs is one issue, but the problem goes deeper than just
>> that.
>>
>> Starting with Carizzo we can scanout from system memory to avoid the
>> extra copy on A+A laptops. For this to work we need the BO mapped to
>> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
>> And for this to work in turn we need a TTM object per device and not a
>> global one.
> I still don't understand. I think what you're talking about applies only
> to BOs used for scan-out. Every BO is allocated from a specific device
> and can only be GART-mapped on that device. What we do is map the same
> BO in VMs on other devices. It has no effect on GART mappings.
>
>>> I don't see how you can have separate TTM objects referring to the
>>> same memory.
>> Well that is trivial, we do this all the time with prime and I+A laptops.
> As I understand it, you use DMABuf to export/import buffers on multiple
> devices. I believe all devices share a single amdgpu_bo, which contains
> the ttm_buffer_object. The only way you can have a new TTM buffer object
> per device is by using SG tables and pinning the BOs. But I think we
> want to avoid pinning BOs.
>
> What we do does not involve pinning of BOs, even when they're shared
> between multiple devices' VMs.
>
 That is also the reason we had to disable this feature again in the
 hybrid branches.
>>> What you disabled on hybrid branches was P2P, which only affects
>>> large-BAR systems. Sharing of system memory BOs between devices still
>>> works fine.
>> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
>> and Raven. Additional to that we found that it breaks some aspects of
>> the user space interface.
> Let me check that with my current patch series. The patches I submitted
> here shouldn't include anything that breaks the use cases you describe.
> But I'm quite sure it will support sharing BOs between multiple devices'
> VMs.

Confirmed with the current patch series. I can allocate buffers on GPU 1
and map them into a VM on GPU 2.

BTW, this is the normal mode of operation for system memory BOs on
multi-GPU systems. Logically system memory BOs belong to the CPU node in
the KFD topology. So the Thunk API isn't told which GPU to allocate the
system memory BO from. It just picks the first one. Then we can map
those BOs on any GPUs we want. If we want to use them on GPU2, that's
where they get mapped.

So I just put two GPUs in a system and ran a test on GPU2. The system
memory buffers are allocated from the GPU1 device, but mapped into the
GPU2 VM. The tests work normally.

If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.

Regards,
  Felix

>
> Regards,
>   Felix
>
>> So end result is that we probably need to revert it and find a
>> different solution. I'm already working on this for a couple of weeks
>> now and should have something ready after I'm done with the PASID
>> handling.
>>
>> Regards,
>> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Felix Kuehling
On 2018-02-13 04:58 PM, Ming Yang wrote:
> That's very helpful, thanks!
>
> On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling  
> wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets from
>>> multiple queues scheduled?  What about scheduling packets from queues
>>> in different address spaces?
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
> Please correct me if I'm wrong.  CP is computing processor, like the
> Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
> scheduler multiplexing queues, in order to hide memory latency.

CP stands for "command processor". There are multiple CP micro engines.
CPG for graphics, CPC for compute. This is not related to warps or wave
fronts. Those execute on the CUs (compute units). There are many CUs,
but only one CPC. The scheduling or dispatching of wavefronts to CUs is
yet another level of scheduling that I didn't talk about. Application
submit AQL dispatch packets to user mode queues. The CP processes those
dispatch packets and schedules the resulting wavefronts on CUs.

>
>> If we need more than 24 queues, or if we have more than 8 processes, the
>> hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
> HWS is also implemented in the firmware that's closed-source?

Yes.

>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It needs
>> to assign a unique VMID to each process (each VMID representing a
>> separate address space, page table, etc.). If there are more processes
>> than VMIDs, the HWS has to time-multiplex.
> HWS dispatch packets in their order of becoming the head of the queue,
> i.e., being pointed by the read_index? So in this way it's FIFO.  Or
> round-robin between queues? You mentioned round-robin over batches in
> the over-subscribed scenario.

Commands within a queue are handled in FIFO order. Commands on different
queues are not ordered with respect to each other.

When I talk about round robin of batches of queues, it means that the
pipes of the CP are executing up to 24 user mode queues. After the time
slice is up, the HWS preempts those queues, and loads another batch of
queues to the CP pipes. This goes on until all the queues in the runlist
have had some time on the GPU. Then the whole process starts over.

>
> This might not be a big deal for performance, but it matters for
> predictability and real-time analysis.
>
>>> - Also another related question -- where/how is the preemption/context
>>> switch between packets/queues implemented?
>> As long as you don't oversubscribe the available VMIDs, there is no real
>> context switching. Everything can run concurrently. When you start
>> oversubscribing HW queues or VMIDs, the HWS firmware will start
>> multiplexing. This is all handled inside the firmware and is quite
>> transparent even to KFD.
> I see.  So the preemption in at least AMD's implementation is not
> switching out the executing kernel, but just letting new kernels to
> run concurrently with the existing ones.  This means the performance
> is degraded when too many workloads are submitted.  The running
> kernels leave the GPU only when they are done.

No, that's not what I meant. As long as nothing is oversubscribed, you
don't have preemptions. But as soon as hardware queues or VMIDs are
oversubscribed, the HWS will need to preempt queues in order to let
other queues have some time on the hardware. Preempting a queue includes
preempting all the wavefronts that were dispatched by that queue. The
state of all the CUs is saved and later restored. We call this CWSR
(compute wave save/restore).

>
> Is there any reason for not preempting/switching out the existing
> kernel, besides context switch overheads?  NVIDIA is not providing
> this option either.  Non-preemption hurts the real-time property in
> terms of 

Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Ming Yang
That's very helpful, thanks!

On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling  wrote:
> On 2018-02-13 04:06 PM, Ming Yang wrote:
>> Thanks for the suggestions!  But I might ask several specific
>> questions, as I can't find the answer in those documents, to give
>> myself a quick start if that's okay. Pointing me to the
>> files/functions would be good enough.  Any explanations are
>> appreciated.   My purpose is to hack it with different scheduling
>> policy with real-time and predictability consideration.
>>
>> - Where/How is the packet scheduler implemented?  How are packets from
>> multiple queues scheduled?  What about scheduling packets from queues
>> in different address spaces?
>
> This is done mostly in firmware. The CP engine supports up to 32 queues.
> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
> micro engine. Within each pipe the queues are time-multiplexed.

Please correct me if I'm wrong.  CP is computing processor, like the
Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
scheduler multiplexing queues, in order to hide memory latency.

>
> If we need more than 24 queues, or if we have more than 8 processes, the
> hardware scheduler (HWS) adds another layer scheduling, basically
> round-robin between batches of 24 queues or 8 processes. Once you get
> into such an over-subscribed scenario your performance and GPU
> utilization can suffers quite badly.

HWS is also implemented in the firmware that's closed-source?

>
>>
>> - I noticed the new support of concurrency of multi-processes in the
>> archive of this mailing list.  Could you point me to the code that
>> implements this?
>
> That's basically just a switch that tells the firmware that it is
> allowed to schedule queues from different processes at the same time.
> The upper limit is the number of VMIDs that HWS can work with. It needs
> to assign a unique VMID to each process (each VMID representing a
> separate address space, page table, etc.). If there are more processes
> than VMIDs, the HWS has to time-multiplex.

HWS dispatch packets in their order of becoming the head of the queue,
i.e., being pointed by the read_index? So in this way it's FIFO.  Or
round-robin between queues? You mentioned round-robin over batches in
the over-subscribed scenario.

This might not be a big deal for performance, but it matters for
predictability and real-time analysis.

>
>>
>> - Also another related question -- where/how is the preemption/context
>> switch between packets/queues implemented?
>
> As long as you don't oversubscribe the available VMIDs, there is no real
> context switching. Everything can run concurrently. When you start
> oversubscribing HW queues or VMIDs, the HWS firmware will start
> multiplexing. This is all handled inside the firmware and is quite
> transparent even to KFD.

I see.  So the preemption in at least AMD's implementation is not
switching out the executing kernel, but just letting new kernels to
run concurrently with the existing ones.  This means the performance
is degraded when too many workloads are submitted.  The running
kernels leave the GPU only when they are done.

Is there any reason for not preempting/switching out the existing
kernel, besides context switch overheads?  NVIDIA is not providing
this option either.  Non-preemption hurts the real-time property in
terms of priority inversion.  I understand preemption should not be
massively used but having such an option may help a lot for real-time
systems.

>
> KFD interacts with the HWS firmware through the HIQ (HSA interface
> queue). It supports packets for unmapping queues, we can send it a new
> runlist (basically a bunch of map-process and map-queue packets). The
> interesting files to look at are kfd_packet_manager.c,
> kfd_kernel_queue_.c and kfd_device_queue_manager.c.
>

So in this way, if we want to implement different scheduling policy,
we should control the submission of packets to the queues in
runtime/KFD, before getting to the firmware.  Because it's out of
access once it's submitted to the HWS in the firmware.

Best,
Mark

> Regards,
>   Felix
>
>>
>> Thanks in advance!
>>
>> Best,
>> Mark
>>
>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling  wrote:
>>> There is also this: https://gpuopen.com/professional-compute/, which
>>> give pointer to several libraries and tools that built on top of ROCm.
>>>
>>> Another thing to keep in mind is, that ROCm is diverging from the strict
>>> HSA standard in some important ways. For example the HSA standard
>>> includes HSAIL as an intermediate representation that gets finalized on
>>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander 
>>>  wrote:
 The ROCm documentation is probably a good place to start:

 

Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Felix Kuehling
On 2018-02-13 04:06 PM, Ming Yang wrote:
> Thanks for the suggestions!  But I might ask several specific
> questions, as I can't find the answer in those documents, to give
> myself a quick start if that's okay. Pointing me to the
> files/functions would be good enough.  Any explanations are
> appreciated.   My purpose is to hack it with different scheduling
> policy with real-time and predictability consideration.
>
> - Where/How is the packet scheduler implemented?  How are packets from
> multiple queues scheduled?  What about scheduling packets from queues
> in different address spaces?

This is done mostly in firmware. The CP engine supports up to 32 queues.
We share those between KFD and AMDGPU. KFD gets 24 queues to use.
Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
micro engine. Within each pipe the queues are time-multiplexed.

If we need more than 24 queues, or if we have more than 8 processes, the
hardware scheduler (HWS) adds another layer scheduling, basically
round-robin between batches of 24 queues or 8 processes. Once you get
into such an over-subscribed scenario your performance and GPU
utilization can suffers quite badly.

>
> - I noticed the new support of concurrency of multi-processes in the
> archive of this mailing list.  Could you point me to the code that
> implements this?

That's basically just a switch that tells the firmware that it is
allowed to schedule queues from different processes at the same time.
The upper limit is the number of VMIDs that HWS can work with. It needs
to assign a unique VMID to each process (each VMID representing a
separate address space, page table, etc.). If there are more processes
than VMIDs, the HWS has to time-multiplex.

>
> - Also another related question -- where/how is the preemption/context
> switch between packets/queues implemented?

As long as you don't oversubscribe the available VMIDs, there is no real
context switching. Everything can run concurrently. When you start
oversubscribing HW queues or VMIDs, the HWS firmware will start
multiplexing. This is all handled inside the firmware and is quite
transparent even to KFD.

KFD interacts with the HWS firmware through the HIQ (HSA interface
queue). It supports packets for unmapping queues, we can send it a new
runlist (basically a bunch of map-process and map-queue packets). The
interesting files to look at are kfd_packet_manager.c,
kfd_kernel_queue_.c and kfd_device_queue_manager.c.

Regards,
  Felix

>
> Thanks in advance!
>
> Best,
> Mark
>
>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling  wrote:
>> There is also this: https://gpuopen.com/professional-compute/, which
>> give pointer to several libraries and tools that built on top of ROCm.
>>
>> Another thing to keep in mind is, that ROCm is diverging from the strict
>> HSA standard in some important ways. For example the HSA standard
>> includes HSAIL as an intermediate representation that gets finalized on
>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>
>> Regards,
>>   Felix
>>
>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander 
>>  wrote:
>>> The ROCm documentation is probably a good place to start:
>>>
>>> https://rocm.github.io/documentation.html
>>>
>>>
>>> Alex
>>>
>>> 
>>> From: amd-gfx  on behalf of Ming Yang
>>> 
>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>> To: amd-gfx@lists.freedesktop.org
>>> Subject: Documentation about AMD's HSA implementation?
>>>
>>> Hi,
>>>
>>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
>>> supporting it. Before digging into the code, I wonder if there's any
>>> documentation available about AMD's HSA implementation, either book,
>>> whitepaper, paper, or documentation.
>>>
>>> I did find helpful materials about HSA, including HSA standards on this page
>>> (http://www.hsafoundation.com/standards/) and a nice book about HSA
>>> (Heterogeneous System Architecture A New Compute Platform Infrastructure).
>>> But regarding the documentation about AMD's implementation, I haven't found
>>> anything yet.
>>>
>>> Please let me know if there are ones publicly accessible. If no, any
>>> suggestions on learning the implementation of specific system components,
>>> e.g., queue scheduling.
>>>
>>> Best,
>>> Mark

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Ming Yang
Thanks for the suggestions!  But I might ask several specific
questions, as I can't find the answer in those documents, to give
myself a quick start if that's okay. Pointing me to the
files/functions would be good enough.  Any explanations are
appreciated.   My purpose is to hack it with different scheduling
policy with real-time and predictability consideration.

- Where/How is the packet scheduler implemented?  How are packets from
multiple queues scheduled?  What about scheduling packets from queues
in different address spaces?

- I noticed the new support of concurrency of multi-processes in the
archive of this mailing list.  Could you point me to the code that
implements this?

- Also another related question -- where/how is the preemption/context
switch between packets/queues implemented?

Thanks in advance!

Best,
Mark

> On 13 Feb 2018, at 2:56 PM, Felix Kuehling  wrote:
> There is also this: https://gpuopen.com/professional-compute/, which
> give pointer to several libraries and tools that built on top of ROCm.
>
> Another thing to keep in mind is, that ROCm is diverging from the strict
> HSA standard in some important ways. For example the HSA standard
> includes HSAIL as an intermediate representation that gets finalized on
> the target system, whereas ROCm compiles directly to native GPU ISA.
>
> Regards,
>   Felix
>
> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander 
>  wrote:
> > The ROCm documentation is probably a good place to start:
> >
> > https://rocm.github.io/documentation.html
> >
> >
> > Alex
> >
> > 
> > From: amd-gfx  on behalf of Ming Yang
> > 
> > Sent: Tuesday, February 13, 2018 12:00 AM
> > To: amd-gfx@lists.freedesktop.org
> > Subject: Documentation about AMD's HSA implementation?
> >
> > Hi,
> >
> > I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
> > supporting it. Before digging into the code, I wonder if there's any
> > documentation available about AMD's HSA implementation, either book,
> > whitepaper, paper, or documentation.
> >
> > I did find helpful materials about HSA, including HSA standards on this page
> > (http://www.hsafoundation.com/standards/) and a nice book about HSA
> > (Heterogeneous System Architecture A New Compute Platform Infrastructure).
> > But regarding the documentation about AMD's implementation, I haven't found
> > anything yet.
> >
> > Please let me know if there are ones publicly accessible. If no, any
> > suggestions on learning the implementation of specific system components,
> > e.g., queue scheduling.
> >
> > Best,
> > Mark
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Panariti, David
+ Dave Roberts

Do you still have links to the HSA doc collected during NMI?


From: amd-gfx  on behalf of Felix 
Kuehling 
Sent: Tuesday, February 13, 2018 2:56:47 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: Documentation about AMD's HSA implementation?

There is also this: https://gpuopen.com/professional-compute/, which
give pointer to several libraries and tools that built on top of ROCm.


Another thing to keep in mind is, that ROCm is diverging from the strict
HSA standard in some important ways. For example the HSA standard
includes HSAIL as an intermediate representation that gets finalized on
the target system, whereas ROCm compiles directly to native GPU ISA.


Regards,
  Felix


On 2018-02-13 09:40 AM, Deucher, Alexander wrote:
>
> The ROCm documentation is probably a good place to start:
>
> https://rocm.github.io/documentation.html
>
>
> Alex
>
> 
> *From:* amd-gfx  on behalf of
> Ming Yang 
> *Sent:* Tuesday, February 13, 2018 12:00 AM
> *To:* amd-gfx@lists.freedesktop.org
> *Subject:* Documentation about AMD's HSA implementation?
>
> Hi,
>
> I'm interested in HSA and excited when I found AMD's fully open-stack
> ROCm supporting it. Before digging into the code, I wonder if there's
> any documentation available about AMD's HSA implementation, either
> book, whitepaper, paper, or documentation.
>
> I did find helpful materials about HSA, including HSA standards on
> this page (http://www.hsafoundation.com/standards/) and a nice book
> about HSA (Heterogeneous System Architecture A New Compute Platform
> Infrastructure). But regarding the documentation about AMD's
> implementation, I haven't found anything yet.
>
> Please let me know if there are ones publicly accessible. If no, any
> suggestions on learning the implementation of specific system
> components, e.g., queue scheduling.
>
> Best,
> Mark
>
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Felix Kuehling
There is also this: https://gpuopen.com/professional-compute/, which
give pointer to several libraries and tools that built on top of ROCm.


Another thing to keep in mind is, that ROCm is diverging from the strict
HSA standard in some important ways. For example the HSA standard
includes HSAIL as an intermediate representation that gets finalized on
the target system, whereas ROCm compiles directly to native GPU ISA.


Regards,
  Felix


On 2018-02-13 09:40 AM, Deucher, Alexander wrote:
>
> The ROCm documentation is probably a good place to start:
>
> https://rocm.github.io/documentation.html
>
>
> Alex
>
> 
> *From:* amd-gfx  on behalf of
> Ming Yang 
> *Sent:* Tuesday, February 13, 2018 12:00 AM
> *To:* amd-gfx@lists.freedesktop.org
> *Subject:* Documentation about AMD's HSA implementation?
>  
> Hi,
>
> I'm interested in HSA and excited when I found AMD's fully open-stack
> ROCm supporting it. Before digging into the code, I wonder if there's
> any documentation available about AMD's HSA implementation, either
> book, whitepaper, paper, or documentation.
>
> I did find helpful materials about HSA, including HSA standards on
> this page (http://www.hsafoundation.com/standards/) and a nice book
> about HSA (Heterogeneous System Architecture A New Compute Platform
> Infrastructure). But regarding the documentation about AMD's
> implementation, I haven't found anything yet.
>
> Please let me know if there are ones publicly accessible. If no, any
> suggestions on learning the implementation of specific system
> components, e.g., queue scheduling.
>
> Best,
> Mark
>
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 4/5] drm/amdgpu/cgs: add refresh rate checking to non-DC display code

2018-02-13 Thread Alex Deucher
Clamp the vblank period to 0 if the refresh rate is larger than
120 hz for non-DC.  This allows us to remove the refresh rate
checks from powerplay for mclk switching.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
index 71b4aec7f650..dc3360b16bda 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
@@ -953,6 +953,11 @@ static int amdgpu_cgs_get_active_displays_info(struct 
cgs_device *cgs_device,

(amdgpu_crtc->v_border * 2);
mode_info->vblank_time_us = 
vblank_lines * line_time_us;
mode_info->refresh_rate = 
drm_mode_vrefresh(_crtc->hw_mode);
+   /* we have issues with mclk switching 
with refresh rates
+* over 120 hz on the non-DC code.
+*/
+   if (mode_info->refresh_rate > 120)
+   mode_info->vblank_time_us = 0;
mode_info = NULL;
}
}
-- 
2.13.6

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/5] drm/amd/powerplay/vega10: allow mclk switching with no displays

2018-02-13 Thread Alex Deucher
If there are no displays attached, there is no reason to disable
mclk switching.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
index 6a153ad1b942..1d442a498bf6 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
@@ -3171,10 +3171,13 @@ static int vega10_apply_state_adjust_rules(struct 
pp_hwmgr *hwmgr,
PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR);
force_mclk_high = PP_CAP(PHM_PlatformCaps_ForceMclkHigh);
 
-   disable_mclk_switching = (info.display_count > 1) ||
-   disable_mclk_switching_for_frame_lock ||
-   disable_mclk_switching_for_vr ||
-   force_mclk_high;
+   if (info.display_count == 0)
+   disable_mclk_switching = false;
+   else
+   disable_mclk_switching = (info.display_count > 1) ||
+   disable_mclk_switching_for_frame_lock ||
+   disable_mclk_switching_for_vr ||
+   force_mclk_high;
 
sclk = vega10_ps->performance_levels[0].gfx_clock;
mclk = vega10_ps->performance_levels[0].mem_clock;
-- 
2.13.6

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 5/5] drm/amdgpu/powerplay/smu7: drop refresh rate checks for mclk switching

2018-02-13 Thread Alex Deucher
The logic has moved to cgs.  mclk switching with DC at higher refresh
rates should work.

Signed-off-by: Alex Deucher 
Cc: Harry Wentland 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
index 84600ff6f4de..0202841ae639 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
@@ -2909,8 +2909,7 @@ static int smu7_apply_state_adjust_rules(struct pp_hwmgr 
*hwmgr,
else
disable_mclk_switching = ((1 < info.display_count) ||
  disable_mclk_switching_for_frame_lock 
||
- smu7_vblank_too_short(hwmgr, 
mode_info.vblank_time_us) ||
- (mode_info.refresh_rate > 120));
+ smu7_vblank_too_short(hwmgr, 
mode_info.vblank_time_us));
 
sclk = smu7_ps->performance_levels[0].engine_clock;
mclk = smu7_ps->performance_levels[0].memory_clock;
-- 
2.13.6

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/5] drm/amd/powerplay: use PP_CAP macro for disable_mclk_switching_for_frame_lock

2018-02-13 Thread Alex Deucher
Rather than open coding it.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
index 03a93b0eff38..6a153ad1b942 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
@@ -3165,10 +3165,10 @@ static int vega10_apply_state_adjust_rules(struct 
pp_hwmgr *hwmgr,
minimum_clocks.memoryClock = stable_pstate_mclk;
}
 
-   disable_mclk_switching_for_frame_lock = phm_cap_enabled(
-   hwmgr->platform_descriptor.platformCaps,
-   
PHM_PlatformCaps_DisableMclkSwitchingForFrameLock);
-   disable_mclk_switching_for_vr = 
PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR);
+   disable_mclk_switching_for_frame_lock =
+   PP_CAP(PHM_PlatformCaps_DisableMclkSwitchingForFrameLock);
+   disable_mclk_switching_for_vr =
+   PP_CAP(PHM_PlatformCaps_DisableMclkSwitchForVR);
force_mclk_high = PP_CAP(PHM_PlatformCaps_ForceMclkHigh);
 
disable_mclk_switching = (info.display_count > 1) ||
-- 
2.13.6

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/5] drm/amd/powerplay/smu7: allow mclk switching with no displays

2018-02-13 Thread Alex Deucher
If there are no displays attached, there is no reason to disable
mclk switching.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
index 937971361b65..84600ff6f4de 100644
--- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c
@@ -2904,10 +2904,13 @@ static int smu7_apply_state_adjust_rules(struct 
pp_hwmgr *hwmgr,

PHM_PlatformCaps_DisableMclkSwitchingForFrameLock);
 
 
-   disable_mclk_switching = ((1 < info.display_count) ||
- disable_mclk_switching_for_frame_lock ||
- smu7_vblank_too_short(hwmgr, 
mode_info.vblank_time_us) ||
- (mode_info.refresh_rate > 120));
+   if (info.display_count == 0)
+   disable_mclk_switching = false;
+   else
+   disable_mclk_switching = ((1 < info.display_count) ||
+ disable_mclk_switching_for_frame_lock 
||
+ smu7_vblank_too_short(hwmgr, 
mode_info.vblank_time_us) ||
+ (mode_info.refresh_rate > 120));
 
sclk = smu7_ps->performance_levels[0].engine_clock;
mclk = smu7_ps->performance_levels[0].memory_clock;
-- 
2.13.6

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 01:45 PM, Christian König wrote:
> Am 13.02.2018 um 17:56 schrieb Felix Kuehling:
>> [SNIP]
>> Each process gets a whole page of the doorbell aperture assigned to it.
>> The assumption is that amdgpu only uses the first page of the doorbell
>> aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
>> used as the offset into the doorbell page. On GFX9 the hardware does
>> some engine-specific doorbell routing, so we added another layer of
>> doorbell management that's decoupled from the queue ID.
>>
>> Either way, an entire doorbell page gets mapped into user mode and user
>> mode knows the offset of the doorbells for specific queues. The mapping
>> is currently handled by kfd_mmap in kfd_chardev.c.
>
> Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost
> looks like you map different doorbells with the same offset depending
> on which process is calling this.
>
> Is that correct? If yes then that would be illegal and a problem if
> I'm not completely mistaken.

Why is that a problem. Each process has its own file descriptor. The
mapping is done using io_remap_pfn_range in kfd_doorbell_mmap. This is
nothing new. It's been done like this forever even on Kaveri and Carrizo.

>
>>> Do you simply assume that after evicting a process it always needs to
>>> be restarted without checking if it actually does something? Or how
>>> does that work?
>> Exactly.
>
> Ok, understood. Well that limits the usefulness of the whole eviction
> drastically.
>
>> With later addition of GPU self-dispatch a page-fault based
>> mechanism wouldn't work any more. We have to restart the queues blindly
>> with a timer. See evict_process_worker, which schedules the restore with
>> a delayed worker.
>> which was send either by the GPU o
>> The user mode queue ABI specifies that user mode update both the
>> doorbell and a WPTR in memory. When we restart queues we (or the CP
>> firmware) use the WPTR to make sure we catch up with any work that was
>> submitted while the queues were unmapped.
>
> Putting cross process work dispatch aside for a moment GPU
> self-dispatch works only when there is work on the GPU running.
>
> So you can still check if there are some work pending after you
> unmapped everything and only restart the queues when there is new work
> based on the page fault.
>
> In other words either there is work pending and it doesn't matter if
> it was send by the GPU or by the CPU or there is no work pending and
> we can delay restarting everything until there is.

That sounds like a useful optimization.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>> Regards,
>>> Christian.
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 01:15 PM, Christian König wrote:
> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>> On 2018-02-13 12:06 PM, Christian König wrote:
>>> [SNIP]
>>> Ah, yeah that is also a point I wanted to to talk about with you.
>>>
>>> The approach of using the same buffer object with multiple amdgpu
>>> devices doesn't work in general.
>>>
>>> We need separate TTM object for each BO in each device or otherwise we
>>> break A+A laptops.
>> I think it broke for VRAM BOs because we enabled P2P on systems that
>> didn't support it properly. But at least system memory BOs can be shared
>> quite well between devices and we do it all the time.
>
> Sharing VRAM BOs is one issue, but the problem goes deeper than just
> that.
>
> Starting with Carizzo we can scanout from system memory to avoid the
> extra copy on A+A laptops. For this to work we need the BO mapped to
> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
> And for this to work in turn we need a TTM object per device and not a
> global one.

I still don't understand. I think what you're talking about applies only
to BOs used for scan-out. Every BO is allocated from a specific device
and can only be GART-mapped on that device. What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.

>
>> I don't see how you can have separate TTM objects referring to the
>> same memory.
>
> Well that is trivial, we do this all the time with prime and I+A laptops.

As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object. The only way you can have a new TTM buffer object
per device is by using SG tables and pinning the BOs. But I think we
want to avoid pinning BOs.

What we do does not involve pinning of BOs, even when they're shared
between multiple devices' VMs.

>
>>> That is also the reason we had to disable this feature again in the
>>> hybrid branches.
>> What you disabled on hybrid branches was P2P, which only affects
>> large-BAR systems. Sharing of system memory BOs between devices still
>> works fine.
>
> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
> and Raven. Additional to that we found that it breaks some aspects of
> the user space interface.

Let me check that with my current patch series. The patches I submitted
here shouldn't include anything that breaks the use cases you describe.
But I'm quite sure it will support sharing BOs between multiple devices'
VMs.

Regards,
  Felix

>
> So end result is that we probably need to revert it and find a
> different solution. I'm already working on this for a couple of weeks
> now and should have something ready after I'm done with the PASID
> handling.
>
> Regards,
> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH xf86-video-amdgpu 2/2] Don't call AMDGPUFreeRec from AMDGPUPreInit_KMS

2018-02-13 Thread Alex Deucher
On Tue, Feb 13, 2018 at 1:12 PM, Michel Dänzer  wrote:
> From: Michel Dänzer 
>
> If the latter fails, Xorg will call AMDGPUFreeScreen_KMS, which calls
> the former.
>
> Signed-off-by: Michel Dänzer 

Series is:
Reviewed-by: Alex Deucher 

> ---
>  src/amdgpu_kms.c | 22 +-
>  1 file changed, 9 insertions(+), 13 deletions(-)
>
> diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c
> index 43c18d426..f5874d3af 100644
> --- a/src/amdgpu_kms.c
> +++ b/src/amdgpu_kms.c
> @@ -1352,7 +1352,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
> && info->pEnt->location.type != BUS_PLATFORM
>  #endif
> )
> -   goto fail;
> +   return FALSE;
>
> pPriv = xf86GetEntityPrivate(pScrn->entityList[0],
>  getAMDGPUEntityIndex());
> @@ -1375,23 +1375,23 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
> pScrn->monitor = pScrn->confScreen->monitor;
>
> if (!AMDGPUPreInitVisual(pScrn))
> -   goto fail;
> +   return FALSE;
>
> xf86CollectOptions(pScrn, NULL);
> if (!(info->Options = malloc(sizeof(AMDGPUOptions_KMS
> -   goto fail;
> +   return FALSE;
>
> memcpy(info->Options, AMDGPUOptions_KMS, sizeof(AMDGPUOptions_KMS));
> xf86ProcessOptions(pScrn->scrnIndex, pScrn->options, info->Options);
>
> if (!AMDGPUPreInitWeight(pScrn))
> -   goto fail;
> +   return FALSE;
>
> memset(_info, 0, sizeof(gpu_info));
> amdgpu_query_gpu_info(pAMDGPUEnt->pDev, _info);
>
> if (!AMDGPUPreInitChipType_KMS(pScrn, _info))
> -   goto fail;
> +   return FALSE;
>
> info->dri2.available = FALSE;
> info->dri2.enabled = FALSE;
> @@ -1399,7 +1399,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
> if (info->dri2.pKernelDRMVersion == NULL) {
> xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
>"AMDGPUDRIGetVersion failed to get the DRM 
> version\n");
> -   goto fail;
> +   return FALSE;
> }
>
> /* Get ScreenInit function */
> @@ -1407,7 +1407,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
> return FALSE;
>
> if (!AMDGPUPreInitAccel_KMS(pScrn))
> -   goto fail;
> +   return FALSE;
>
> amdgpu_drm_queue_init();
>
> @@ -1466,7 +1466,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
> FALSE) {
> xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
>"Kernel modesetting setup failed\n");
> -   goto fail;
> +   return FALSE;
> }
>
> AMDGPUSetupCapabilities(pScrn);
> @@ -1518,14 +1518,10 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
>  #endif
> ) {
> xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "No modes.\n");
> -   goto fail;
> +   return FALSE;
> }
>
> return TRUE;
> -fail:
> -   AMDGPUFreeRec(pScrn);
> -   return FALSE;
> -
>  }
>
>  static Bool AMDGPUCursorInit_KMS(ScreenPtr pScreen)
> --
> 2.16.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH xf86-video-amdgpu 4/4] Always use screen depth/bpp for KMS framebuffers

2018-02-13 Thread Alex Deucher
On Tue, Feb 13, 2018 at 12:53 PM, Michel Dänzer  wrote:
> From: Michel Dänzer 
>
> DRI clients can use depth 32 pixmaps while the screen is depth 24, in
> which case page flipping would fail.
>
> Reported-by: Mario Kleiner 
> (Ported from radeon commit 733f606dd6ca8350e6e7f0858bfff5454ddc98ed)
>
> Signed-off-by: Michel Dänzer 

Series is:
Reviewed-by: Alex Deucher 

> ---
>  src/amdgpu_pixmap.h   | 13 ++---
>  src/drmmode_display.c |  5 ++---
>  2 files changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/src/amdgpu_pixmap.h b/src/amdgpu_pixmap.h
> index eded17037..d744ca573 100644
> --- a/src/amdgpu_pixmap.h
> +++ b/src/amdgpu_pixmap.h
> @@ -104,8 +104,8 @@ static inline struct amdgpu_buffer 
> *amdgpu_get_pixmap_bo(PixmapPtr pPix)
>  }
>
>  static inline struct drmmode_fb*
> -amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t height, uint8_t depth,
> -uint8_t bpp, uint32_t pitch, uint32_t handle)
> +amdgpu_fb_create(ScrnInfoPtr scrn, int drm_fd, uint32_t width, uint32_t 
> height,
> +uint32_t pitch, uint32_t handle)
>  {
> struct drmmode_fb *fb  = malloc(sizeof(*fb));
>
> @@ -113,8 +113,8 @@ amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t 
> height, uint8_t depth,
> return NULL;
>
> fb->refcnt = 1;
> -   if (drmModeAddFB(drm_fd, width, height, depth, bpp, pitch, handle,
> ->handle) == 0)
> +   if (drmModeAddFB(drm_fd, width, height, scrn->depth, 
> scrn->bitsPerPixel,
> +pitch, handle, >handle) == 0)
> return fb;
>
> free(fb);
> @@ -154,9 +154,8 @@ amdgpu_pixmap_get_fb(PixmapPtr pix)
> ScrnInfoPtr scrn = 
> xf86ScreenToScrn(pix->drawable.pScreen);
> AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(scrn);
>
> -   *fb_ptr = amdgpu_fb_create(pAMDGPUEnt->fd, 
> pix->drawable.width,
> -  pix->drawable.height, 
> pix->drawable.depth,
> -  
> pix->drawable.bitsPerPixel, pix->devKind,
> +   *fb_ptr = amdgpu_fb_create(scrn, pAMDGPUEnt->fd, 
> pix->drawable.width,
> +  pix->drawable.height, 
> pix->devKind,
>handle);
> }
> }
> diff --git a/src/drmmode_display.c b/src/drmmode_display.c
> index e52b7a355..8ccbf735d 100644
> --- a/src/drmmode_display.c
> +++ b/src/drmmode_display.c
> @@ -885,9 +885,8 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr 
> mode,
> union gbm_bo_handle bo_handle;
>
> bo_handle = 
> gbm_bo_get_handle(info->front_buffer->bo.gbm);
> -   fb = amdgpu_fb_create(pAMDGPUEnt->fd, pScrn->virtualX,
> - pScrn->virtualY, pScrn->depth,
> - pScrn->bitsPerPixel,
> +   fb = amdgpu_fb_create(pScrn, pAMDGPUEnt->fd,
> + pScrn->virtualX, 
> pScrn->virtualY,
>   pScrn->displayWidth * 
> info->pixel_bytes,
>   bo_handle.u32);
> /* Prevent refcnt of ad-hoc FBs from reaching 2 */
> --
> 2.16.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 17:56 schrieb Felix Kuehling:

[SNIP]
Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c.


Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost looks 
like you map different doorbells with the same offset depending on which 
process is calling this.


Is that correct? If yes then that would be illegal and a problem if I'm 
not completely mistaken.



Do you simply assume that after evicting a process it always needs to
be restarted without checking if it actually does something? Or how
does that work?

Exactly.


Ok, understood. Well that limits the usefulness of the whole eviction 
drastically.



With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.
which was send either by the GPU o
The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.


Putting cross process work dispatch aside for a moment GPU self-dispatch 
works only when there is work on the GPU running.


So you can still check if there are some work pending after you unmapped 
everything and only restart the queues when there is new work based on 
the page fault.


In other words either there is work pending and it doesn't matter if it 
was send by the GPU or by the CPU or there is no work pending and we can 
delay restarting everything until there is.


Regards,
Christian.



Regards,
   Felix


Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 18:18 schrieb Felix Kuehling:

On 2018-02-13 12:06 PM, Christian König wrote:

[SNIP]
Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu
devices doesn't work in general.

We need separate TTM object for each BO in each device or otherwise we
break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time.


Sharing VRAM BOs is one issue, but the problem goes deeper than just that.

Starting with Carizzo we can scanout from system memory to avoid the 
extra copy on A+A laptops. For this to work we need the BO mapped to 
GART (and I mean a real VMID0 mapping, not just in the GTT domain). And 
for this to work in turn we need a TTM object per device and not a 
global one.



I don't see how you can have separate TTM objects referring to the same memory.


Well that is trivial, we do this all the time with prime and I+A laptops.


That is also the reason we had to disable this feature again in the
hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.


No, it doesn't. It completely breaks any scanout on Carizzo, Stoney and 
Raven. Additional to that we found that it breaks some aspects of the 
user space interface.


So end result is that we probably need to revert it and find a different 
solution. I'm already working on this for a couple of weeks now and 
should have something ready after I'm done with the PASID handling.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 2/2] Don't call AMDGPUFreeRec from AMDGPUPreInit_KMS

2018-02-13 Thread Michel Dänzer
From: Michel Dänzer 

If the latter fails, Xorg will call AMDGPUFreeScreen_KMS, which calls
the former.

Signed-off-by: Michel Dänzer 
---
 src/amdgpu_kms.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c
index 43c18d426..f5874d3af 100644
--- a/src/amdgpu_kms.c
+++ b/src/amdgpu_kms.c
@@ -1352,7 +1352,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
&& info->pEnt->location.type != BUS_PLATFORM
 #endif
)
-   goto fail;
+   return FALSE;
 
pPriv = xf86GetEntityPrivate(pScrn->entityList[0],
 getAMDGPUEntityIndex());
@@ -1375,23 +1375,23 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
pScrn->monitor = pScrn->confScreen->monitor;
 
if (!AMDGPUPreInitVisual(pScrn))
-   goto fail;
+   return FALSE;
 
xf86CollectOptions(pScrn, NULL);
if (!(info->Options = malloc(sizeof(AMDGPUOptions_KMS
-   goto fail;
+   return FALSE;
 
memcpy(info->Options, AMDGPUOptions_KMS, sizeof(AMDGPUOptions_KMS));
xf86ProcessOptions(pScrn->scrnIndex, pScrn->options, info->Options);
 
if (!AMDGPUPreInitWeight(pScrn))
-   goto fail;
+   return FALSE;
 
memset(_info, 0, sizeof(gpu_info));
amdgpu_query_gpu_info(pAMDGPUEnt->pDev, _info);
 
if (!AMDGPUPreInitChipType_KMS(pScrn, _info))
-   goto fail;
+   return FALSE;
 
info->dri2.available = FALSE;
info->dri2.enabled = FALSE;
@@ -1399,7 +1399,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
if (info->dri2.pKernelDRMVersion == NULL) {
xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
   "AMDGPUDRIGetVersion failed to get the DRM 
version\n");
-   goto fail;
+   return FALSE;
}
 
/* Get ScreenInit function */
@@ -1407,7 +1407,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
return FALSE;
 
if (!AMDGPUPreInitAccel_KMS(pScrn))
-   goto fail;
+   return FALSE;
 
amdgpu_drm_queue_init();
 
@@ -1466,7 +1466,7 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
FALSE) {
xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
   "Kernel modesetting setup failed\n");
-   goto fail;
+   return FALSE;
}
 
AMDGPUSetupCapabilities(pScrn);
@@ -1518,14 +1518,10 @@ Bool AMDGPUPreInit_KMS(ScrnInfoPtr pScrn, int flags)
 #endif
) {
xf86DrvMsg(pScrn->scrnIndex, X_ERROR, "No modes.\n");
-   goto fail;
+   return FALSE;
}
 
return TRUE;
-fail:
-   AMDGPUFreeRec(pScrn);
-   return FALSE;
-
 }
 
 static Bool AMDGPUCursorInit_KMS(ScreenPtr pScreen)
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 1/2] Guard against pAMDGPUEnt == NULL in AMDGPUFreeRec

2018-02-13 Thread Michel Dänzer
From: Michel Dänzer 

This can happen if PreInit fails early.

Signed-off-by: Michel Dänzer 
---
 src/amdgpu_kms.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c
index 7dc9e22a9..43c18d426 100644
--- a/src/amdgpu_kms.c
+++ b/src/amdgpu_kms.c
@@ -137,7 +137,7 @@ static void AMDGPUFreeRec(ScrnInfoPtr pScrn)
 
pPriv = xf86GetEntityPrivate(pEnt->index, gAMDGPUEntityIndex);
pAMDGPUEnt = pPriv->ptr;
-   if (pAMDGPUEnt->fd > 0) {
+   if (pAMDGPUEnt && pAMDGPUEnt->fd > 0) {
DevUnion *pPriv;
AMDGPUEntPtr pAMDGPUEnt;
pPriv = xf86GetEntityPrivate(pScrn->entityList[0],
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 4/4] Always use screen depth/bpp for KMS framebuffers

2018-02-13 Thread Michel Dänzer
From: Michel Dänzer 

DRI clients can use depth 32 pixmaps while the screen is depth 24, in
which case page flipping would fail.

Reported-by: Mario Kleiner 
(Ported from radeon commit 733f606dd6ca8350e6e7f0858bfff5454ddc98ed)

Signed-off-by: Michel Dänzer 
---
 src/amdgpu_pixmap.h   | 13 ++---
 src/drmmode_display.c |  5 ++---
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/amdgpu_pixmap.h b/src/amdgpu_pixmap.h
index eded17037..d744ca573 100644
--- a/src/amdgpu_pixmap.h
+++ b/src/amdgpu_pixmap.h
@@ -104,8 +104,8 @@ static inline struct amdgpu_buffer 
*amdgpu_get_pixmap_bo(PixmapPtr pPix)
 }
 
 static inline struct drmmode_fb*
-amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t height, uint8_t depth,
-uint8_t bpp, uint32_t pitch, uint32_t handle)
+amdgpu_fb_create(ScrnInfoPtr scrn, int drm_fd, uint32_t width, uint32_t height,
+uint32_t pitch, uint32_t handle)
 {
struct drmmode_fb *fb  = malloc(sizeof(*fb));
 
@@ -113,8 +113,8 @@ amdgpu_fb_create(int drm_fd, uint32_t width, uint32_t 
height, uint8_t depth,
return NULL;
 
fb->refcnt = 1;
-   if (drmModeAddFB(drm_fd, width, height, depth, bpp, pitch, handle,
->handle) == 0)
+   if (drmModeAddFB(drm_fd, width, height, scrn->depth, scrn->bitsPerPixel,
+pitch, handle, >handle) == 0)
return fb;
 
free(fb);
@@ -154,9 +154,8 @@ amdgpu_pixmap_get_fb(PixmapPtr pix)
ScrnInfoPtr scrn = 
xf86ScreenToScrn(pix->drawable.pScreen);
AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(scrn);
 
-   *fb_ptr = amdgpu_fb_create(pAMDGPUEnt->fd, 
pix->drawable.width,
-  pix->drawable.height, 
pix->drawable.depth,
-  pix->drawable.bitsPerPixel, 
pix->devKind,
+   *fb_ptr = amdgpu_fb_create(scrn, pAMDGPUEnt->fd, 
pix->drawable.width,
+  pix->drawable.height, 
pix->devKind,
   handle);
}
}
diff --git a/src/drmmode_display.c b/src/drmmode_display.c
index e52b7a355..8ccbf735d 100644
--- a/src/drmmode_display.c
+++ b/src/drmmode_display.c
@@ -885,9 +885,8 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr 
mode,
union gbm_bo_handle bo_handle;
 
bo_handle = 
gbm_bo_get_handle(info->front_buffer->bo.gbm);
-   fb = amdgpu_fb_create(pAMDGPUEnt->fd, pScrn->virtualX,
- pScrn->virtualY, pScrn->depth,
- pScrn->bitsPerPixel,
+   fb = amdgpu_fb_create(pScrn, pAMDGPUEnt->fd,
+ pScrn->virtualX, pScrn->virtualY,
  pScrn->displayWidth * 
info->pixel_bytes,
  bo_handle.u32);
/* Prevent refcnt of ad-hoc FBs from reaching 2 */
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 1/4] Define per x-screen individual drmmode_crtc_funcs

2018-02-13 Thread Michel Dänzer
From: Mario Kleiner 

This allows to en-/disable some functions depending on individual screen
settings.

Prep work for more efficient depth 30 support.

Suggested-by: Michel Dänzer 
Signed-off-by: Mario Kleiner 
(Ported from radeon commit 21f6753462464acfd3c452393328c977a375ce26)
Signed-off-by: Michel Dänzer 
---
 src/amdgpu_drv.h  |  1 +
 src/drmmode_display.c | 14 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/amdgpu_drv.h b/src/amdgpu_drv.h
index f60b192da..40157262e 100644
--- a/src/amdgpu_drv.h
+++ b/src/amdgpu_drv.h
@@ -340,6 +340,7 @@ typedef struct {
SetSharedPixmapBackingProcPtr SavedSetSharedPixmapBacking;
} glamor;
 
+   xf86CrtcFuncsRec drmmode_crtc_funcs;
 } AMDGPUInfoRec, *AMDGPUInfoPtr;
 
 
diff --git a/src/drmmode_display.c b/src/drmmode_display.c
index 047179449..be3deef20 100644
--- a/src/drmmode_display.c
+++ b/src/drmmode_display.c
@@ -1344,8 +1344,9 @@ drmmode_crtc_init(ScrnInfoPtr pScrn, drmmode_ptr drmmode, 
drmModeResPtr mode_res
xf86CrtcPtr crtc;
drmmode_crtc_private_ptr drmmode_crtc;
AMDGPUEntPtr pAMDGPUEnt = AMDGPUEntPriv(pScrn);
+   AMDGPUInfoPtr info = AMDGPUPTR(pScrn);
 
-   crtc = xf86CrtcCreate(pScrn, _crtc_funcs);
+   crtc = xf86CrtcCreate(pScrn, >drmmode_crtc_funcs);
if (crtc == NULL)
return 0;
 
@@ -2348,11 +2349,16 @@ Bool drmmode_pre_init(ScrnInfoPtr pScrn, drmmode_ptr 
drmmode, int cpp)
xf86DrvMsgVerb(pScrn->scrnIndex, X_INFO, AMDGPU_LOGLEVEL_DEBUG,
   "%d crtcs needed for screen.\n", crtcs_needed);
 
+   /* Need per-screen drmmode_crtc_funcs, based on our global template,
+* so we can disable some functions, depending on screen settings.
+*/
+   info->drmmode_crtc_funcs = drmmode_crtc_funcs;
+
if (!info->use_glamor) {
/* Rotation requires hardware acceleration */
-   drmmode_crtc_funcs.shadow_allocate = NULL;
-   drmmode_crtc_funcs.shadow_create = NULL;
-   drmmode_crtc_funcs.shadow_destroy = NULL;
+   info->drmmode_crtc_funcs.shadow_allocate = NULL;
+   info->drmmode_crtc_funcs.shadow_create = NULL;
+   info->drmmode_crtc_funcs.shadow_destroy = NULL;
}
 
for (i = 0; i < mode_res->count_crtcs; i++)
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 3/4] Add 30bit RGB color format support

2018-02-13 Thread Michel Dänzer
From: Hawking Zhang 

Signed-off-by: Hawking Zhang 

[ Michel Dänzer:
* Require Xorg >= 1.19.99.1 for depth 30, otherwise it can't work with glamor
* Update manpage, per radeon commit
  574bfab4bf1fcd95163a8f33cea2889189429d30 ]

Signed-off-by: Michel Dänzer 
---
 man/amdgpu.man | 2 +-
 src/amdgpu_bo_helper.c | 2 ++
 src/amdgpu_dri2.c  | 1 +
 src/amdgpu_kms.c   | 8 
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/man/amdgpu.man b/man/amdgpu.man
index 0d9dd93d9..390f4837b 100644
--- a/man/amdgpu.man
+++ b/man/amdgpu.man
@@ -18,7 +18,7 @@ following features:
 .PD 0
 .TP 2
 \(bu
-Support for 24-bit pixel depth;
+Support for 24- and 30-bit pixel depths;
 .TP
 \(bu
 RandR support up to version 1.4;
diff --git a/src/amdgpu_bo_helper.c b/src/amdgpu_bo_helper.c
index ba2212228..34880ff12 100644
--- a/src/amdgpu_bo_helper.c
+++ b/src/amdgpu_bo_helper.c
@@ -42,6 +42,8 @@ amdgpu_get_gbm_format(int depth, int bitsPerPixel)
return GBM_FORMAT_RGB565;
case 32:
return GBM_FORMAT_ARGB;
+   case 30:
+   return GBM_FORMAT_XRGB2101010;
case 24:
if (bitsPerPixel == 32)
return GBM_FORMAT_XRGB;
diff --git a/src/amdgpu_dri2.c b/src/amdgpu_dri2.c
index 4ffa34676..4a0f8bf24 100644
--- a/src/amdgpu_dri2.c
+++ b/src/amdgpu_dri2.c
@@ -120,6 +120,7 @@ amdgpu_dri2_create_buffer2(ScreenPtr pScreen,
cpp = 2;
break;
case 24:
+   case 30:
cpp = 4;
break;
default:
diff --git a/src/amdgpu_kms.c b/src/amdgpu_kms.c
index 49044f514..7dc9e22a9 100644
--- a/src/amdgpu_kms.c
+++ b/src/amdgpu_kms.c
@@ -1085,6 +1085,14 @@ static Bool AMDGPUPreInitVisual(ScrnInfoPtr pScrn)
case 24:
break;
 
+   case 30:
+   if (xorgGetVersion() < XORG_VERSION_NUMERIC(1,19,99,1,0)) {
+   xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
+  "Depth 30 requires Xorg >= 1.19.99.1\n");
+   return FALSE;
+   }
+   break;
+
default:
xf86DrvMsg(pScrn->scrnIndex, X_ERROR,
   "Given depth (%d) is not supported by %s driver\n",
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH xf86-video-amdgpu 2/4] Disable gamma set when deep color

2018-02-13 Thread Michel Dänzer
From: Qiang Yu 

gamma set is disabled in kernel driver when deep color.
Enable it will confuse the user.

Signed-off-by: Qiang Yu 

[ Michel Dänzer: Align drmmode_pre_init change with radeon commit
  1f1d4b1fa7d4b22dd8553f7e71251bf17ca7a7b1 ]

Signed-off-by: Michel Dänzer 
---
 src/drmmode_display.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/drmmode_display.c b/src/drmmode_display.c
index be3deef20..e52b7a355 100644
--- a/src/drmmode_display.c
+++ b/src/drmmode_display.c
@@ -858,8 +858,11 @@ drmmode_set_mode_major(xf86CrtcPtr crtc, DisplayModePtr 
mode,
if (drmmode_crtc->tear_free)
scanout_id = drmmode_crtc->scanout_id;
 
-   drmmode_crtc_gamma_do_set(crtc, crtc->gamma_red, 
crtc->gamma_green,
- crtc->gamma_blue, crtc->gamma_size);
+   /* gamma is disabled in kernel driver for deep color */
+   if (pScrn->depth != 30)
+   drmmode_crtc_gamma_do_set(
+   crtc, crtc->gamma_red, crtc->gamma_green,
+   crtc->gamma_blue, crtc->gamma_size);
 
if (drmmode_crtc->prime_scanout_pixmap) {
drmmode_crtc_prime_scanout_update(crtc, mode, 
scanout_id,
@@ -2361,6 +2364,12 @@ Bool drmmode_pre_init(ScrnInfoPtr pScrn, drmmode_ptr 
drmmode, int cpp)
info->drmmode_crtc_funcs.shadow_destroy = NULL;
}
 
+   /* Hw gamma lut's are currently bypassed by the hw at color depth 30,
+* so spare the server the effort to compute and update the cluts.
+*/
+   if (pScrn->depth == 30)
+   info->drmmode_crtc_funcs.gamma_set = NULL;
+
for (i = 0; i < mode_res->count_crtcs; i++)
if (!xf86IsEntityShared(pScrn->entityList[0]) ||
(crtcs_needed && !(pAMDGPUEnt->assigned_crtcs & (1 << i
@@ -2591,8 +2600,9 @@ Bool drmmode_setup_colormap(ScreenPtr pScreen, 
ScrnInfoPtr pScrn)
   "Initializing kms color map\n");
if (!miCreateDefColormap(pScreen))
return FALSE;
-   /* all amdgpus support 10 bit CLUTs */
-   if (!xf86HandleColormaps(pScreen, 256, 10,
+   /* All radeons support 10 bit CLUTs. They get bypassed at depth 
30. */
+   if (pScrn->depth != 30 &&
+   !xf86HandleColormaps(pScreen, 256, 10,
 NULL, NULL,
 CMAP_PALETTED_TRUECOLOR
 #if 0  /* This option messes up text mode! 
(e...@suse.de) */
-- 
2.16.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 12:06 PM, Christian König wrote:
> Am 13.02.2018 um 17:42 schrieb Felix Kuehling:
>> On 2018-02-13 05:25 AM, Christian König wrote:
>>> Am 13.02.2018 um 00:23 schrieb Felix Kuehling:
 On 2018-02-12 11:57 AM, Felix Kuehling wrote:
>>> I just thought of a slightly different approach I would consider
>>> more
>>> realistic, without having thought through all the details: Adding
>>> KFD-specific memory management ioctls to the amdgpu device.
>>> Basically
>>> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
>>> instead
>>> of KFD ioctl functions. But we'd still have KFD ioctls for other
>>> things,
>>> and the new amdgpu ioctls would be KFD-specific and not useful for
>>> graphics applications. It's probably still several weeks of
>>> work, but
>>> shouldn't have major teething issues because the internal logic and
>>> functionality would be basically unchanged. It would just move the
>>> ioctls from one device to another.
>> My thinking went into a similar direction. But instead of
>> exposing the
>> KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.
>>
>> And then use the DRM FD in the KFD for things like the buffer
>> provider
>> of a device, e.g. no separate IDR for BOs in the KFD but rather a
>> reference to the DRM FD.
> I'm not sure I understand this. With "DRM FD" I think you mean the
> device file descriptor. Then have KFD call file operation on the
> FD to
> call AMDGPU ioctls? Is there any precedent for this that I can
> look at
> for an example?
 OK, I think this could work for finding a VM from a DRM file
 descriptor:

   struct file *filp = fget(fd);
   struct drm_file *drm_priv = filp->private_data;
   struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
   struct amdgpu_vm *vm = _priv->vm;

 That would let us use the DRM VM instead of creating our own and would
 avoid wasting a perfectly good VM that gets created by opening the DRM
 device. We'd need a new ioctl to import VMs into KFD. But that's about
 as far as I would take it.

 We'd still need to add KFD-specific data to the VM. If we use an
 existing VM, we can't wrap it in our own structure any more. We'd need
 to come up with a different solution or extend struct amdgpu_vm with
 what we need.
>>> Well feel free to extend the VM structure with a pointer for KFD data.
>> Some more thoughts about that: Currently the lifetime of the VM is tied
>> to the file descriptor. If we import it into KFD, we either have to make
>> struct amdgpu_vm reference counted, or we have to keep a reference to
>> the file descriptor in KFD just to keep the VM alive until we drop our
>> reference to it.
>>
 And we still need our own BO creation ioctl. Importing DMABufs adds
 extra overhead (more system calls, file descriptors, GEM objects) and
 doesn't cover some of our use cases (userptr). We also need our own
 idr
 for managing buffer handles that are used with all our memory and VM
 management functions.
>>> Yeah, well that is the second part of that idea.
>>>
>>> When you have the drm_file for the VM, you can also use it to call
>>> drm_gem_object_lookup().
>> KFD only has one device, so we need a buffer ID that's globally unique,
>> not per-device. Or we'd need to identify buffers by a pair of a GEM
>> handle and a device ID. Our BO handles are 64-bit, so we could pack both
>> into a single handle.
>
> Ah, yeah that is also a point I wanted to to talk about with you.
>
> The approach of using the same buffer object with multiple amdgpu
> devices doesn't work in general.
>
> We need separate TTM object for each BO in each device or otherwise we
> break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time. I don't see how
you can have separate TTM objects referring to the same memory.

>
> That is also the reason we had to disable this feature again in the
> hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.

>
> So we need BO handles per device here or some other solution.
>
>> Either way, that still requires a GEM object, which adds not just a
>> minor bit of overhead. It includes a struct file * which points to a
>> shared memory file that gets created for each BO in drm_gem_object_init.
>
> HUI WHAT? Why the heck do we still do this?
>
> Sorry something went wrong here and yes that shmen file is completely
> superfluous even for the gfx side.
>
>>> And when you need to iterate over all the BOs you can just use the
>>> object_idr or the VM structure as well.
>> object_idr only gives us BOs allocated from a particular 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 17:42 schrieb Felix Kuehling:

On 2018-02-13 05:25 AM, Christian König wrote:

Am 13.02.2018 um 00:23 schrieb Felix Kuehling:

On 2018-02-12 11:57 AM, Felix Kuehling wrote:

I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
instead
of KFD ioctl functions. But we'd still have KFD ioctls for other
things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.

My thinking went into a similar direction. But instead of exposing the
KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

And then use the DRM FD in the KFD for things like the buffer provider
of a device, e.g. no separate IDR for BOs in the KFD but rather a
reference to the DRM FD.

I'm not sure I understand this. With "DRM FD" I think you mean the
device file descriptor. Then have KFD call file operation on the FD to
call AMDGPU ioctls? Is there any precedent for this that I can look at
for an example?

OK, I think this could work for finding a VM from a DRM file descriptor:

  struct file *filp = fget(fd);
  struct drm_file *drm_priv = filp->private_data;
  struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
  struct amdgpu_vm *vm = _priv->vm;

That would let us use the DRM VM instead of creating our own and would
avoid wasting a perfectly good VM that gets created by opening the DRM
device. We'd need a new ioctl to import VMs into KFD. But that's about
as far as I would take it.

We'd still need to add KFD-specific data to the VM. If we use an
existing VM, we can't wrap it in our own structure any more. We'd need
to come up with a different solution or extend struct amdgpu_vm with
what we need.

Well feel free to extend the VM structure with a pointer for KFD data.

Some more thoughts about that: Currently the lifetime of the VM is tied
to the file descriptor. If we import it into KFD, we either have to make
struct amdgpu_vm reference counted, or we have to keep a reference to
the file descriptor in KFD just to keep the VM alive until we drop our
reference to it.


And we still need our own BO creation ioctl. Importing DMABufs adds
extra overhead (more system calls, file descriptors, GEM objects) and
doesn't cover some of our use cases (userptr). We also need our own idr
for managing buffer handles that are used with all our memory and VM
management functions.

Yeah, well that is the second part of that idea.

When you have the drm_file for the VM, you can also use it to call
drm_gem_object_lookup().

KFD only has one device, so we need a buffer ID that's globally unique,
not per-device. Or we'd need to identify buffers by a pair of a GEM
handle and a device ID. Our BO handles are 64-bit, so we could pack both
into a single handle.


Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu 
devices doesn't work in general.


We need separate TTM object for each BO in each device or otherwise we 
break A+A laptops.


That is also the reason we had to disable this feature again in the 
hybrid branches.


So we need BO handles per device here or some other solution.


Either way, that still requires a GEM object, which adds not just a
minor bit of overhead. It includes a struct file * which points to a
shared memory file that gets created for each BO in drm_gem_object_init.


HUI WHAT? Why the heck do we still do this?

Sorry something went wrong here and yes that shmen file is completely 
superfluous even for the gfx side.



And when you need to iterate over all the BOs you can just use the
object_idr or the VM structure as well.

object_idr only gives us BOs allocated from a particular device. The VM
structure has various BO lists, but they only contain BOs that were
added to the VM. The same BO can be added multiple times or to multiple
VMs, or it may not be in any VM at all.


Ok, yeah that probably isn't sufficient for the BO handling like you 
need it for eviction.



In later patch series we also need to track additional information for
KFD buffers in KFD itself. At that point we change the KFD IDR to point
to our own structure, which in turn points to the BO. For example our
BOs get their VA assigned at allocation time, and we later do reverse
lookups from the address to the BO using an interval tree.


You can do this with the standard VM structure as well, that is needed 
for UVD and VCE anyway. See amdgpu_vm_bo_lookup_mapping.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling

On 2018-02-13 05:46 AM, Christian König wrote:
> Am 12.02.2018 um 17:57 schrieb Felix Kuehling:
>>> [SNIP]
>>> I see that as an advantage rather than a problem, cause it fixes a
>>> couple of problems with the KFD where the address space of the inode
>>> is not managed correctly as far as I can see.
>> I don't think GEM is involved in the management of address space. That's
>> handled inside TTM and drm_vma_manager, both of which we are using. We
>> have tested CPU mappings with evictions and this is working correctly
>> now.
>
> Ok, correct. I was thinking about that as one large functionality.
> When you see it like this GEM basically only implements looking up BOs
> and reference counting them, but both are still useful features.
>
>> We had problems with this before, when we were CPU-mapping our buffers
>> through the KFD device FD.
>>
>>> That implementation issues never caused problems right now because you
>>> never tried to unmap doorbells. But with the new eviction code that is
>>> about to change, isn't it?
>> I don't understand this comment. What does this have to do with
>> doorbells?
>
> I was assuming that doorbells are now unmapped to figure out when to
> start a process again.

Restart is done based on a timer.

>
> See what I'm missing is how does userspace figures out which address
> to use to map the doorbell? E.g. I don't see a call to
> drm_vma_offset_manager_init() or something like that.

Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c. A later patch will
add the ability to map doorbells into GPUVM address space so that GPUs
can dispatch work to themselves (or each other) by creating a special
doorbell BO that can be mapped both to the GPU and the CPU. It works by
creating an amdgpu_bo with an SG describing the doorbell page of the
process.

>
>> Doorbells are never unmapped. When queues are evicted doorbells remain
>> mapped to the process. User mode can continue writing to doorbells,
>> though they won't have any immediate effect while the corresponding
>> queues are unmapped from the HQDs.
>
> Do you simply assume that after evicting a process it always needs to
> be restarted without checking if it actually does something? Or how
> does that work?

Exactly. With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.

The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.

Regards,
  Felix

>
> Regards,
> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling

On 2018-02-13 05:25 AM, Christian König wrote:
> Am 13.02.2018 um 00:23 schrieb Felix Kuehling:
>> On 2018-02-12 11:57 AM, Felix Kuehling wrote:
> I just thought of a slightly different approach I would consider more
> realistic, without having thought through all the details: Adding
> KFD-specific memory management ioctls to the amdgpu device. Basically
> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
> instead
> of KFD ioctl functions. But we'd still have KFD ioctls for other
> things,
> and the new amdgpu ioctls would be KFD-specific and not useful for
> graphics applications. It's probably still several weeks of work, but
> shouldn't have major teething issues because the internal logic and
> functionality would be basically unchanged. It would just move the
> ioctls from one device to another.
 My thinking went into a similar direction. But instead of exposing the
 KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

 And then use the DRM FD in the KFD for things like the buffer provider
 of a device, e.g. no separate IDR for BOs in the KFD but rather a
 reference to the DRM FD.
>>> I'm not sure I understand this. With "DRM FD" I think you mean the
>>> device file descriptor. Then have KFD call file operation on the FD to
>>> call AMDGPU ioctls? Is there any precedent for this that I can look at
>>> for an example?
>> OK, I think this could work for finding a VM from a DRM file descriptor:
>>
>>  struct file *filp = fget(fd);
>>  struct drm_file *drm_priv = filp->private_data;
>>  struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
>>  struct amdgpu_vm *vm = _priv->vm;
>>
>> That would let us use the DRM VM instead of creating our own and would
>> avoid wasting a perfectly good VM that gets created by opening the DRM
>> device. We'd need a new ioctl to import VMs into KFD. But that's about
>> as far as I would take it.
>>
>> We'd still need to add KFD-specific data to the VM. If we use an
>> existing VM, we can't wrap it in our own structure any more. We'd need
>> to come up with a different solution or extend struct amdgpu_vm with
>> what we need.
>
> Well feel free to extend the VM structure with a pointer for KFD data.

Some more thoughts about that: Currently the lifetime of the VM is tied
to the file descriptor. If we import it into KFD, we either have to make
struct amdgpu_vm reference counted, or we have to keep a reference to
the file descriptor in KFD just to keep the VM alive until we drop our
reference to it.

>
>> And we still need our own BO creation ioctl. Importing DMABufs adds
>> extra overhead (more system calls, file descriptors, GEM objects) and
>> doesn't cover some of our use cases (userptr). We also need our own idr
>> for managing buffer handles that are used with all our memory and VM
>> management functions.
>
> Yeah, well that is the second part of that idea.
>
> When you have the drm_file for the VM, you can also use it to call
> drm_gem_object_lookup().

KFD only has one device, so we need a buffer ID that's globally unique,
not per-device. Or we'd need to identify buffers by a pair of a GEM
handle and a device ID. Our BO handles are 64-bit, so we could pack both
into a single handle.

Either way, that still requires a GEM object, which adds not just a
minor bit of overhead. It includes a struct file * which points to a
shared memory file that gets created for each BO in drm_gem_object_init.

>
> And when you need to iterate over all the BOs you can just use the
> object_idr or the VM structure as well.

object_idr only gives us BOs allocated from a particular device. The VM
structure has various BO lists, but they only contain BOs that were
added to the VM. The same BO can be added multiple times or to multiple
VMs, or it may not be in any VM at all.

In later patch series we also need to track additional information for
KFD buffers in KFD itself. At that point we change the KFD IDR to point
to our own structure, which in turn points to the BO. For example our
BOs get their VA assigned at allocation time, and we later do reverse
lookups from the address to the BO using an interval tree.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Or do you mean a DMABuf FD exported by GEM? After importing the buffer
>>> and closing the FD, we still need a way to refer to the buffer for
>>> managing the GPU mappings and for releasing the reference. So we still
>>> need handles that are currently provided by our own IDR. I think we'll
>>> still need that.
>>>
>>> Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for
>>> creating userptr BOs. Those can't be exported as DMABufs, so we still
>>> need our own ioctl for creating that type of BO. That was going to
>>> be my
>>> next patch series.
>>>
>>> Regards,
>>>    Felix
>>>
>
> ___
> amd-gfx mailing list
> 

Re: [PATCH 31/34] drm/amd/display: Add logging for aux DPCD access

2018-02-13 Thread Harry Wentland
On 2018-02-12 04:31 PM, Andrey Grodzovsky wrote:
> With this logger you should probably remove the Linux specific logger in 
> amdgpu_dm_mst_types.c, check log_dpcd function.
> 

This currently only logs the first byte of the response. Once that's fixed 
you're right, we should rip out the one from amdgpu_dm_mst_types.c.

Harry

> Andrey
> 
> 
> On 02/12/2018 12:16 PM, Harry Wentland wrote:
>> From: Eric Yang 
>>
>> Add basic logging for DPCD access. Does not print
>> by default.
>>
>> Currently only prints first byte of the data accessed.
>>
>> Technical debt: Need to make it so that the entire
>> data block accessed is printed. Also need to log
>> address space that's not DPCD.
>>
>> Change-Id: I10ef7042c14d70508845ef827ebec2432d8d8176
>> Signed-off-by: Eric Yang 
>> Reviewed-by: Tony Cheng 
>> Acked-by: Harry Wentland 
>> ---
>>   drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c | 16 
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c 
>> b/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c
>> index fc7a7d4ebca5..0b1db48fef36 100644
>> --- a/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c
>> +++ b/drivers/gpu/drm/amd/display/dc/i2caux/aux_engine.c
>> @@ -284,6 +284,14 @@ static bool read_command(
>>   msleep(engine->delay);
>>   } while (ctx.operation_succeeded && !ctx.transaction_complete);
>>   +    if (request->payload.address_space ==
>> +    I2CAUX_TRANSACTION_ADDRESS_SPACE_DPCD) {
>> +    dm_logger_write(engine->base.ctx->logger, LOG_I2C_AUX, "READ: 
>> addr:0x%x  value:0x%x Result:%d",
>> +    request->payload.address,
>> +    request->payload.data[0],
>> +    ctx.operation_succeeded);
>> +    }
>> +
>>   return ctx.operation_succeeded;
>>   }
>>   @@ -484,6 +492,14 @@ static bool write_command(
>>   msleep(engine->delay);
>>   } while (ctx.operation_succeeded && !ctx.transaction_complete);
>>   +    if (request->payload.address_space ==
>> +    I2CAUX_TRANSACTION_ADDRESS_SPACE_DPCD) {
>> +    dm_logger_write(engine->base.ctx->logger, LOG_I2C_AUX, "WRITE: 
>> addr:0x%x  value:0x%x Result:%d",
>> +    request->payload.address,
>> +    request->payload.data[0],
>> +    ctx.operation_succeeded);
>> +    }
>> +
>>   return ctx.operation_succeeded;
>>   }
>>   
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Deucher, Alexander
The ROCm documentation is probably a good place to start:

https://rocm.github.io/documentation.html

Alex


From: amd-gfx  on behalf of Ming Yang 

Sent: Tuesday, February 13, 2018 12:00 AM
To: amd-gfx@lists.freedesktop.org
Subject: Documentation about AMD's HSA implementation?

Hi,

I'm interested in HSA and excited when I found AMD's fully open-stack ROCm 
supporting it. Before digging into the code, I wonder if there's any 
documentation available about AMD's HSA implementation, either book, 
whitepaper, paper, or documentation.

I did find helpful materials about HSA, including HSA standards on this page 
(http://www.hsafoundation.com/standards/) and a nice book about HSA 
(Heterogeneous System Architecture A New Compute Platform Infrastructure). But 
regarding the documentation about AMD's implementation, I haven't found 
anything yet.

Please let me know if there are ones publicly accessible. If no, any 
suggestions on learning the implementation of specific system components, e.g., 
queue scheduling.

Best,
Mark
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/amdgpu: fix and cleanup UVD IB generation

2018-02-13 Thread Leo Liu

Reviewed-by: Leo Liu 


On 02/07/2018 02:48 PM, Christian König wrote:

We didn't synced the BO after validating it. Also sart to use
amdgpu_bo_create_reserved to simplify things.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 106 
  1 file changed, 38 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 7cdbe0c14496..9cd5517a4fa9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -952,37 +952,28 @@ int amdgpu_uvd_ring_parse_cs(struct amdgpu_cs_parser 
*parser, uint32_t ib_idx)
  static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
   bool direct, struct dma_fence **fence)
  {
-   struct ttm_operation_ctx ctx = { true, false };
-   struct ttm_validate_buffer tv;
-   struct ww_acquire_ctx ticket;
-   struct list_head head;
+   struct amdgpu_device *adev = ring->adev;
+   struct dma_fence *f = NULL;
struct amdgpu_job *job;
struct amdgpu_ib *ib;
-   struct dma_fence *f = NULL;
-   struct amdgpu_device *adev = ring->adev;
-   uint64_t addr;
uint32_t data[4];
-   int i, r;
-
-   memset(, 0, sizeof(tv));
-   tv.bo = >tbo;
-
-   INIT_LIST_HEAD();
-   list_add(, );
+   uint64_t addr;
+   long r;
+   int i;
  
-	r = ttm_eu_reserve_buffers(, , true, NULL);

-   if (r)
-   return r;
+   amdgpu_bo_kunmap(bo);
+   amdgpu_bo_unpin(bo);
  
  	if (!ring->adev->uvd.address_64_bit) {

+   struct ttm_operation_ctx ctx = { true, false };
+
amdgpu_ttm_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM);
amdgpu_uvd_force_into_uvd_segment(bo);
+   r = ttm_bo_validate(>tbo, >placement, );
+   if (r)
+   goto err;
}
  
-	r = ttm_bo_validate(>tbo, >placement, );

-   if (r)
-   goto err;
-
r = amdgpu_job_alloc_with_ib(adev, 64, );
if (r)
goto err;
@@ -1014,6 +1005,14 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, 
struct amdgpu_bo *bo,
ib->length_dw = 16;
  
  	if (direct) {

+   r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
+   true, false,
+   msecs_to_jiffies(10));
+   if (r == 0)
+   r = -ETIMEDOUT;
+   if (r < 0)
+   goto err_free;
+
r = amdgpu_ib_schedule(ring, 1, ib, NULL, );
job->fence = dma_fence_get(f);
if (r)
@@ -1021,17 +1020,23 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring 
*ring, struct amdgpu_bo *bo,
  
  		amdgpu_job_free(job);

} else {
+   r = amdgpu_sync_resv(adev, >sync, bo->tbo.resv,
+AMDGPU_FENCE_OWNER_UNDEFINED, false);
+   if (r)
+   goto err_free;
+
r = amdgpu_job_submit(job, ring, >uvd.entity,
  AMDGPU_FENCE_OWNER_UNDEFINED, );
if (r)
goto err_free;
}
  
-	ttm_eu_fence_buffer_objects(, , f);

+   amdgpu_bo_fence(bo, f, false);
+   amdgpu_bo_unreserve(bo);
+   amdgpu_bo_unref();
  
  	if (fence)

*fence = dma_fence_get(f);
-   amdgpu_bo_unref();
dma_fence_put(f);
  
  	return 0;

@@ -1040,7 +1045,8 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, 
struct amdgpu_bo *bo,
amdgpu_job_free(job);
  
  err:

-   ttm_eu_backoff_reservation(, );
+   amdgpu_bo_unreserve(bo);
+   amdgpu_bo_unref();
return r;
  }
  
@@ -1051,31 +1057,16 @@ int amdgpu_uvd_get_create_msg(struct amdgpu_ring *ring, uint32_t handle,

  struct dma_fence **fence)
  {
struct amdgpu_device *adev = ring->adev;
-   struct amdgpu_bo *bo;
+   struct amdgpu_bo *bo = NULL;
uint32_t *msg;
int r, i;
  
-	r = amdgpu_bo_create(adev, 1024, PAGE_SIZE, true,

-AMDGPU_GEM_DOMAIN_VRAM,
-AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED |
-AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
-NULL, NULL, );
+   r = amdgpu_bo_create_reserved(adev, 1024, PAGE_SIZE,
+ AMDGPU_GEM_DOMAIN_VRAM,
+ , NULL, (void **));
if (r)
return r;
  
-	r = amdgpu_bo_reserve(bo, false);

-   if (r) {
-   amdgpu_bo_unref();
-   return r;
-   }
-
-   r = amdgpu_bo_kmap(bo, (void **));
-   if (r) {
-   amdgpu_bo_unreserve(bo);
-

Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation

2018-02-13 Thread Leo Liu



On 02/13/2018 08:57 AM, Christian König wrote:

It's always the obvious. Leo any more comments on this?

If not can we get your rb on this patch as well?


Tested-and-Reviewed-by: Leo Liu 




Andrey if Leo gives his ok can you commit both? I'm on vacation and 
don't want to mess with this at the moment.


Thanks,
Christian.

Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky:


Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct 
amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not 
call amdgpu_bo_create.


Attached updated patch.

Thanks,

Andrey


On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote:
Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec 
tests pass but when modprobing amdgpu I get a few warnings and 
BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch.



Thanks,

Andrey


On 02/09/2018 07:31 AM, Christian König wrote:

Question is rather can somebody please test this on Raven?

Thanks,
Christian. 






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation

2018-02-13 Thread Andrey Grodzovsky

Sure

Andrey


On 02/13/2018 08:57 AM, Christian König wrote:

It's always the obvious. Leo any more comments on this?

If not can we get your rb on this patch as well?

Andrey if Leo gives his ok can you commit both? I'm on vacation and 
don't want to mess with this at the moment.


Thanks,
Christian.

Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky:


Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct 
amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not 
call amdgpu_bo_create.


Attached updated patch.

Thanks,

Andrey


On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote:
Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec 
tests pass but when modprobing amdgpu I get a few warnings and 
BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch.



Thanks,

Andrey


On 02/09/2018 07:31 AM, Christian König wrote:

Question is rather can somebody please test this on Raven?

Thanks,
Christian. 






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation

2018-02-13 Thread Christian König

It's always the obvious. Leo any more comments on this?

If not can we get your rb on this patch as well?

Andrey if Leo gives his ok can you commit both? I'm on vacation and 
don't want to mess with this at the moment.


Thanks,
Christian.

Am 13.02.2018 um 14:50 schrieb Andrey Grodzovsky:


Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct 
amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not 
call amdgpu_bo_create.


Attached updated patch.

Thanks,

Andrey


On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote:
Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec 
tests pass but when modprobing amdgpu I get a few warnings and 
BUG_ONs - log attached. UVD is not enabled so can't test the UVD patch.



Thanks,

Andrey


On 02/09/2018 07:31 AM, Christian König wrote:

Question is rather can somebody please test this on Raven?

Thanks,
Christian. 






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amdgpu: cleanup VCN IB generation

2018-02-13 Thread Andrey Grodzovsky
Found the issue, amdgpu_vcn_dec_get_destroy_msg was missing struct 
amdgpu_bo *bo *= NULL*; and so amdgpu_bo_create_reserved would not call 
amdgpu_bo_create.


Attached updated patch.

Thanks,

Andrey


On 02/12/2018 02:46 PM, Andrey Grodzovsky wrote:
Tested with latest amd-staging-drm-next + VCN patch on top. VCN dec 
tests pass but when modprobing amdgpu I get a few warnings and BUG_ONs 
- log attached. UVD is not enabled so can't test the UVD patch.



Thanks,

Andrey


On 02/09/2018 07:31 AM, Christian König wrote:

Question is rather can somebody please test this on Raven?

Thanks,
Christian. 




>From 03d3d0195b4e672cefb2851e66f0fe03bd9089a3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= 
Date: Wed, 7 Feb 2018 20:48:22 +0100
Subject: drm/amdgpu: cleanup VCN IB generation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Start to use amdgpu_bo_create_reserved v2.

v2:
Fix missing pointer init to NULL.
Remove extra new lines.

Signed-off-by: Christian König 
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 91 -
 1 file changed, 20 insertions(+), 71 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index e86d0b2..6c5e25f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -270,43 +270,26 @@ int amdgpu_vcn_dec_ring_test_ring(struct amdgpu_ring *ring)
 	return r;
 }
 
-static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo,
-			   bool direct, struct dma_fence **fence)
+static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring,
+   struct amdgpu_bo *bo, bool direct,
+   struct dma_fence **fence)
 {
-	struct ttm_operation_ctx ctx = { true, false };
-	struct ttm_validate_buffer tv;
-	struct ww_acquire_ctx ticket;
-	struct list_head head;
+	struct amdgpu_device *adev = ring->adev;
+	struct dma_fence *f = NULL;
 	struct amdgpu_job *job;
 	struct amdgpu_ib *ib;
-	struct dma_fence *f = NULL;
-	struct amdgpu_device *adev = ring->adev;
 	uint64_t addr;
 	int i, r;
 
-	memset(, 0, sizeof(tv));
-	tv.bo = >tbo;
-
-	INIT_LIST_HEAD();
-	list_add(, );
-
-	r = ttm_eu_reserve_buffers(, , true, NULL);
-	if (r)
-		return r;
-
-	r = ttm_bo_validate(>tbo, >placement, );
-	if (r)
-		goto err;
-
 	r = amdgpu_job_alloc_with_ib(adev, 64, );
 	if (r)
 		goto err;
 
 	ib = >ibs[0];
 	addr = amdgpu_bo_gpu_offset(bo);
-	ib->ptr[0] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA0), 0);
+	ib->ptr[0] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA0),0);
 	ib->ptr[1] = addr;
-	ib->ptr[2] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA1), 0);
+	ib->ptr[2] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_DATA1),0);
 	ib->ptr[3] = addr >> 32;
 	ib->ptr[4] = PACKET0(SOC15_REG_OFFSET(UVD, 0, mmUVD_GPCOM_VCPU_CMD), 0);
 	ib->ptr[5] = 0;
@@ -330,11 +313,12 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *b
 			goto err_free;
 	}
 
-	ttm_eu_fence_buffer_objects(, , f);
+	amdgpu_bo_fence(bo, f, false);
+	amdgpu_bo_unreserve(bo);
+	amdgpu_bo_unref();
 
 	if (fence)
 		*fence = dma_fence_get(f);
-	amdgpu_bo_unref();
 	dma_fence_put(f);
 
 	return 0;
@@ -343,7 +327,8 @@ static int amdgpu_vcn_dec_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *b
 	amdgpu_job_free(job);
 
 err:
-	ttm_eu_backoff_reservation(, );
+	amdgpu_bo_unreserve(bo);
+	amdgpu_bo_unref();
 	return r;
 }
 
@@ -351,31 +336,16 @@ static int amdgpu_vcn_dec_get_create_msg(struct amdgpu_ring *ring, uint32_t hand
 			  struct dma_fence **fence)
 {
 	struct amdgpu_device *adev = ring->adev;
-	struct amdgpu_bo *bo;
+	struct amdgpu_bo *bo = NULL;
 	uint32_t *msg;
 	int r, i;
 
-	r = amdgpu_bo_create(adev, 1024, PAGE_SIZE, true,
-			 AMDGPU_GEM_DOMAIN_VRAM,
-			 AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED |
-			 AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
-			 NULL, NULL, );
+	r = amdgpu_bo_create_reserved(adev, 1024, PAGE_SIZE,
+  AMDGPU_GEM_DOMAIN_VRAM,
+  , NULL, (void **));
 	if (r)
 		return r;
 
-	r = amdgpu_bo_reserve(bo, false);
-	if (r) {
-		amdgpu_bo_unref();
-		return r;
-	}
-
-	r = amdgpu_bo_kmap(bo, (void **));
-	if (r) {
-		amdgpu_bo_unreserve(bo);
-		amdgpu_bo_unref();
-		return r;
-	}
-
 	msg[0] = cpu_to_le32(0x0028);
 	msg[1] = cpu_to_le32(0x0038);
 	msg[2] = cpu_to_le32(0x0001);
@@ -393,9 +363,6 @@ static int amdgpu_vcn_dec_get_create_msg(struct amdgpu_ring *ring, uint32_t hand
 	for (i = 14; i < 1024; ++i)
 		msg[i] = cpu_to_le32(0x0);
 
-	amdgpu_bo_kunmap(bo);
-	amdgpu_bo_unreserve(bo);
-
 	return amdgpu_vcn_dec_send_msg(ring, bo, true, fence);
 }
 
@@ -403,31 +370,16 @@ static int amdgpu_vcn_dec_get_destroy_msg(struct amdgpu_ring *ring, uint32_t han
 			   bool direct, struct dma_fence **fence)
 {
 	struct 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 12.02.2018 um 17:57 schrieb Felix Kuehling:

[SNIP]
I see that as an advantage rather than a problem, cause it fixes a
couple of problems with the KFD where the address space of the inode
is not managed correctly as far as I can see.

I don't think GEM is involved in the management of address space. That's
handled inside TTM and drm_vma_manager, both of which we are using. We
have tested CPU mappings with evictions and this is working correctly now.


Ok, correct. I was thinking about that as one large functionality. When 
you see it like this GEM basically only implements looking up BOs and 
reference counting them, but both are still useful features.



We had problems with this before, when we were CPU-mapping our buffers
through the KFD device FD.


That implementation issues never caused problems right now because you
never tried to unmap doorbells. But with the new eviction code that is
about to change, isn't it?

I don't understand this comment. What does this have to do with doorbells?


I was assuming that doorbells are now unmapped to figure out when to 
start a process again.


See what I'm missing is how does userspace figures out which address to 
use to map the doorbell? E.g. I don't see a call to 
drm_vma_offset_manager_init() or something like that.



Doorbells are never unmapped. When queues are evicted doorbells remain
mapped to the process. User mode can continue writing to doorbells,
though they won't have any immediate effect while the corresponding
queues are unmapped from the HQDs.


Do you simply assume that after evicting a process it always needs to be 
restarted without checking if it actually does something? Or how does 
that work?


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 00:23 schrieb Felix Kuehling:

On 2018-02-12 11:57 AM, Felix Kuehling wrote:

I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
of KFD ioctl functions. But we'd still have KFD ioctls for other things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.

My thinking went into a similar direction. But instead of exposing the
KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

And then use the DRM FD in the KFD for things like the buffer provider
of a device, e.g. no separate IDR for BOs in the KFD but rather a
reference to the DRM FD.

I'm not sure I understand this. With "DRM FD" I think you mean the
device file descriptor. Then have KFD call file operation on the FD to
call AMDGPU ioctls? Is there any precedent for this that I can look at
for an example?

OK, I think this could work for finding a VM from a DRM file descriptor:

 struct file *filp = fget(fd);
 struct drm_file *drm_priv = filp->private_data;
 struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
 struct amdgpu_vm *vm = _priv->vm;

That would let us use the DRM VM instead of creating our own and would
avoid wasting a perfectly good VM that gets created by opening the DRM
device. We'd need a new ioctl to import VMs into KFD. But that's about
as far as I would take it.

We'd still need to add KFD-specific data to the VM. If we use an
existing VM, we can't wrap it in our own structure any more. We'd need
to come up with a different solution or extend struct amdgpu_vm with
what we need.


Well feel free to extend the VM structure with a pointer for KFD data.


And we still need our own BO creation ioctl. Importing DMABufs adds
extra overhead (more system calls, file descriptors, GEM objects) and
doesn't cover some of our use cases (userptr). We also need our own idr
for managing buffer handles that are used with all our memory and VM
management functions.


Yeah, well that is the second part of that idea.

When you have the drm_file for the VM, you can also use it to call 
drm_gem_object_lookup().


And when you need to iterate over all the BOs you can just use the 
object_idr or the VM structure as well.


Regards,
Christian.



Regards,
   Felix



Or do you mean a DMABuf FD exported by GEM? After importing the buffer
and closing the FD, we still need a way to refer to the buffer for
managing the GPU mappings and for releasing the reference. So we still
need handles that are currently provided by our own IDR. I think we'll
still need that.

Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for
creating userptr BOs. Those can't be exported as DMABufs, so we still
need our own ioctl for creating that type of BO. That was going to be my
next patch series.

Regards,
   Felix



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Documentation about AMD's HSA implementation?

2018-02-13 Thread Ming Yang
Hi,

I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
supporting it. Before digging into the code, I wonder if there's any
documentation available about AMD's HSA implementation, either book,
whitepaper, paper, or documentation.

I did find helpful materials about HSA, including HSA standards on this
page (http://www.hsafoundation.com/standards/) and a nice book about HSA
(Heterogeneous System Architecture A New Compute Platform Infrastructure).
But regarding the documentation about AMD's implementation, I haven't found
anything yet.

Please let me know if there are ones publicly accessible. If no, any
suggestions on learning the implementation of specific system components,
e.g., queue scheduling.

Best,
Mark
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx