Re: [amdgpu] deadlock

2021-02-03 Thread Bridgman, John
‎>>Uh, that doesn't work. If you want infinite compute queues you need the
amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
run forever, you just hang the kernel whenever userspace feels like. Not
just the gpu, the kernel (anything that allocates memory, irrespective of
process can hang). That's no good.

We have moved from using gfx paths to using kfd paths as of the 20.45 release a 
couple of months ago. Not sure if that applies to APU's yet but if not I would 
expect it to just be a matter of time.

Thanks,
John
  Original Message
From: Daniel Vetter
Sent: Wednesday, February 3, 2021 9:27 AM
To: Alex Deucher
Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; 
Daniel Gomez; Koenig, Christian
Subject: Re: [amdgpu] deadlock


On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> On Wed, Feb 3, 2021 at 7:30 AM Christian König  
> wrote:
> >
> > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez  wrote:
> >  On Wed, 3 Feb 2021 at 10:17, Daniel Vetter  wrote:
> > > On Wed, Feb 3, 2021 at 9:51 AM Christian König 
> > >  wrote:
> > >> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > >>> On Wed, Feb 3, 2021 at 9:36 AM Christian König 
> > >>>  wrote:
> >  Hi Daniel,
> > 
> >  this is not a deadlock, but rather a hardware lockup.
> > >>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > >>> chance of being a dma_fence deadlock. GPU hang should never result 
> > >>> in
> > >>> a forever stuck dma_fence.
> > >> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up 
> > >> like
> > >> this.
> > > Maybe clarifying, could be both. TDR should notice and get us out of
> > > this, but if there's a dma_fence deadlock and we can't re-emit or
> > > force complete the pending things, then we're stuck for good.
> > > -Daniel
> > >
> > >> Question is rather why we end up in the userptr handling for GFX? Our
> > >> ROCm OpenCL stack shouldn't use this.
> > >>
> > >>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > >>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > >>> the backtraces it's tricky to construct the full dependency chain of
> > >>> what's going on. Also is this plain -rc6, not some more patches on
> > >>> top?
> > >> Yeah, that's still a good idea to have.
> >  Here the full backtrace dmesg logs after the hang:
> >  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3Dreserved=0
> > 
> >  This is another dmesg log with the backtraces after SIGKILL the matrix 
> >  process:
> >  (I didn't have the sysrq enable at the time):
> >  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3Dreserved=0
> > >>> I've now removed all our v4l2 patches and did the same test with the 
> > >>> 'plain'
> > >>> mainline version (-rc6).
> > >>>
> > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > >>>
> > >>> Same error, same behaviour. Full dmesg log attached:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3Dreserved=0
> > >>> Note:
> > >>> dmesg with sysrq-t before running the test starts in [  122.016502]
> > >>> sysrq: Show State
> > >>> dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: 
> > >>> Show State
> > >> There is nothing amdgpu related in there except for waiting for the
> > >> hardware.
> > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > is stuck, the kernel should at least complete all the dma_fences even if
> > > the gpu for some reason is terminally ill ...
> >
> > That's a good question as well. I'm digging into 

Re: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10

2020-12-27 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

If you want to pick up the firmware directly it is maintained at...

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu


-rw-r--r-- sienna_cichlid_ce.bin 263296   logstatsplain
-rw-r--r-- sienna_cichlid_dmcub.bin 80244 logstatsplain
-rw-r--r-- sienna_cichlid_me.bin 263424   logstatsplain
-rw-r--r-- sienna_cichlid_mec.bin 268592  logstatsplain
-rw-r--r-- sienna_cichlid_mec2.bin 268592 logstatsplain
-rw-r--r-- sienna_cichlid_pfp.bin 263424  logstatsplain
-rw-r--r-- sienna_cichlid_rlc.bin 128592  logstatsplain
-rw-r--r-- sienna_cichlid_sdma.bin 34048  logstatsplain
-rw-r--r-- sienna_cichlid_smc.bin 247396  logstatsplain
-rw-r--r-- sienna_cichlid_sos.bin 215152  logstatsplain
-rw-r--r-- sienna_cichlid_ta.bin 333568   logstatsplain
-rw-r--r-- sienna_cichlid_vcn.bin 504224  logstatsplain

My understanding was that the firmware was also added to Fedora back in 
November but I'm having a tough time finding confirmation of that.



From: amd-gfx  on behalf of Mikhail 
Gavrilov 
Sent: December 27, 2020 11:39 AM
To: amd-gfx list ; Linux List Kernel Mailing 
; dri-devel 
Subject: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10

Hi folks.
I bought myself a gift a new AMD 6900 XT graphics card to replace the
AMD Radeon VII.
But all joy was overshadowed that this video card did not working in Linux.
Output on the my boot screen was ended with message "fb0: switching to
amdgpudrmfb from EFI VGA" and videocard not switched to graphic mode.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fphotos.app.goo.gl%2FzwpErNrusq9CNyES7data=04%7C01%7Cjohn.bridgman%40amd.com%7C37ba164fc80241451b9808d8aa864aa2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637446841356919588%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000sdata=Y3V3lbEaXNwHiakgRUAeO7gBJASeElBaIwZ9Vmd0AgU%3Dreserved=0

I suppose the root of cause my problem here:

[3.961326] amdgpu :0b:00.0: Direct firmware load for
amdgpu/sienna_cichlid_sos.bin failed with error -2
[3.961359] amdgpu :0b:00.0: amdgpu: failed to init sos firmware
[3.961433] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[3.961529] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init
of IP block  failed -2
[3.961549] amdgpu :0b:00.0: amdgpu: amdgpu_device_ip_init failed
[3.961569] amdgpu :0b:00.0: amdgpu: Fatal error during GPU init
[3.961911] amdgpu: probe of :0b:00.0 failed with error -2

Can anybody here help me get firmware?
my distro: Fedora Rawhide
kernel: 5.10 rc6
mesa: from git 21.0.0 devel

Sorry for disturb and merry xmas.


--
Best Regards,
Mike Gavrilov.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [amdkfd] where are AQL packets processed after writing the userspace doorbell

2020-12-21 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

>Hi! I noticed that the AQL packets are more concise compared with PM4 packets. 
>It seemed that AQL packets need more post-processing than PM4 packets.
>I was wondering where the AQL packets are processed, such like calculating the 
>code address using code_entry_offset, resetting packets' headers into INVALID, 
>and writing values to the completion signal when finished.
>Are all these operations done by the firmware?

Yes, these operations are performed entirely by MEC firmware.

There were special cases (eg certain debug scenarios) where we used a "soft 
AQL" layer in the ROC runtime which interpreted AQL packets and translated them 
into a series of PM4 packets, but I don't believe that mechanism is used any 
more. It was never used during normal processing anyways.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Enabling AMDGPU by default for SI & CIK

2020-08-04 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

At the risk of asking a dumb question, does amdgpu default to using DC on SI 
and CI ?

I'm asking because a lot of people seem to be using amdgpu successfully with 
analog outputs today on SI/CI... suggests that they are not using DC ?

If so then would enabling HDMI/DP audio support without DC be sufficient to 
flip the switch assuming we felt that other risks were manageable ?

Thanks,
John


From: amd-gfx  on behalf of Alex Deucher 

Sent: August 4, 2020 1:35 PM
To: Michel Dänzer 
Cc: Deucher, Alexander ; Koenig, Christian 
; amd-gfx mailing list 
; Bas Nieuwenhuizen 
Subject: Re: Enabling AMDGPU by default for SI & CIK

On Tue, Aug 4, 2020 at 4:38 AM Michel Dänzer  wrote:
>
> On 2020-08-03 1:45 a.m., Bas Nieuwenhuizen wrote:
> > Hi all,
> >
> > Now that we have recently made some progress on getting feature parity
> > with the Radeon driver for SI, I'm wondering what it would take to
> > make AMDGPU the default driver for these generations.
> >
> > As far as I understand AMDGPU has had these features for CIK for a
> > while already but it is still not the default driver. What would it
> > take to make it the default? What is missing and/or broken?
>
> The main blockers I'm aware of for CIK are:
>
> 1) Lack of analogue connector support with DC
> 2) Lack of HDMI/DP audio support without DC
>
>
> 1) may apply to SI as well.

Also, IIRC, there are suspend and resume problems with some CIK parts
using amdgpu.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cjohn.bridgman%40amd.com%7C26df81f9a6df4e2f9fbb08d8389cda79%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637321593620378481sdata=Ep%2BYRRT1dAcE8zSDIaZiXuVMb9gBVUnLnbtP1%2Be7Pkc%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: slow rx 5600 xt fps

2020-05-19 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

Suggest you use something more demanding that glxgears as a test - part of the 
problem is that glxgears runs so fast normally (30x faster than your display) 
that even a small amount of overhead copying a frame from one place to another 
makes a huge difference in FPS.

If you use a test program that normally runs at 90 FPS you'll probably find 
that the "slow" speed is something like 85 FPS, rather than the 6:1 difference 
you see with glxgears.


From: amd-gfx  on behalf of Javad Karabi 

Sent: May 19, 2020 9:16 PM
To: Alex Deucher 
Cc: amd-gfx list 
Subject: Re: slow rx 5600 xt fps

thanks for the answers alex.

so, i went ahead and got a displayport cable to see if that changes
anything. and now, when i run monitor only, and the monitor connected
to the card, it has no issues like before! so i am thinking that
somethings up with either the hdmi cable, or some hdmi related setting
in my system? who knows, but im just gonna roll with only using
displayport cables now.
the previous hdmi cable was actually pretty long, because i was
extending it with an hdmi extension cable, so maybe the signal was
really bad or something :/

but yea, i guess the only real issue now is maybe something simple
related to some sysfs entry about enabling some powermode, voltage,
clock frequency, or something, so that glxgears will give me more than
300 fps. but atleast now i can use a single monitor configuration with
the monitor displayported up to the card.

also, one other thing i think you might be interested in, that was
happening before.

so, previously, with laptop -tb3-> egpu-hdmi> monitor, there was a
funny thing happening which i never could figure out.
when i would look at the X logs, i would see that "modesetting" (for
the intel integrated graphics) was reporting that MonitorA was used
with "eDP-1",  which is correct and what i expected.
when i scrolled further down, i then saw that "HDMI-A-1-2" was being
used for another MonitorB, which also is what i expected (albeit i
have no idea why its saying A-1-2)
but amdgpu was _also_ saying that DisplayPort-1-2 (a port on the
radeon card) was being used for MonitorA, which is the same Monitor
that the modesetting driver had claimed to be using with eDP-1!

so the point is that amdgpu was "using" Monitor0 with DisplayPort-1-2,
although that is what modesetting was using for eDP-1.

anyway, thats a little aside, i doubt it was related to the terrible
hdmi experience i was getting, since its about display port and stuff,
but i thought id let you know about that.

if you think that is a possible issue, im more than happy to plug the
hdmi setup back in and create an issue on gitlab with the logs and
everything

On Tue, May 19, 2020 at 4:42 PM Alex Deucher  wrote:
>
> On Tue, May 19, 2020 at 5:22 PM Javad Karabi  wrote:
> >
> > lol youre quick!
> >
> > "Windows has supported peer to peer DMA for years so it already has a
> > numbers of optimizations that are only now becoming possible on Linux"
> >
> > whoa, i figured linux would be ahead of windows when it comes to
> > things like that. but peer-to-peer dma is something that is only
> > recently possible on linux, but has been possible on windows? what
> > changed recently that allows for peer to peer dma in linux?
> >
>
> A few things that made this more complicated on Linux:
> 1. Linux uses IOMMUs more extensively than windows so you can't just
> pass around physical bus addresses.
> 2. Linux supports lots of strange architectures that have a lot of
> limitations with respect to peer to peer transactions
>
> It just took years to get all the necessary bits in place in Linux and
> make everyone happy.
>
> > also, in the context of a game running opengl on some gpu, is the
> > "peer-to-peer" dma transfer something like: the game draw's to some
> > memory it has allocated, then a DMA transfer gets that and moves it
> > into the graphics card output?
>
> Peer to peer DMA just lets devices access another devices local memory
> directly.  So if you have a buffer in vram on one device, you can
> share that directly with another device rather than having to copy it
> to system memory first.  For example, if you have two GPUs, you can
> have one of them copy it's content directly to a buffer in the other
> GPU's vram rather than having to go through system memory first.
>
> >
> > also, i know it can be super annoying trying to debug an issue like
> > this, with someone like me who has all types of differences from a
> > normal setup (e.g. using it via egpu, using a kernel with custom
> > configs and stuff) so as a token of my appreciation i donated 50$ to
> > the red cross' corona virus outbreak charity thing, on behalf of
> > amd-gfx.
>
> Thanks,
>
> Alex
>
> >
> > On Tue, May 19, 2020 at 4:13 PM Alex Deucher  wrote:
> > >
> > > On Tue, May 19, 2020 at 3:44 PM Javad Karabi  
> > > wrote:
> > > >
> > > > just a couple more questions:
> > > >
> > > 

Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-09 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

>I know RX570 (polaris) should stay at PCI3 as far as I know.

Yep... thought I remembered you mentioning having a 5700XT though... is that in 
a different system ?


From: Clemens Eisserer 
Sent: March 9, 2020 2:30 AM
To: Bridgman, John ; amd-gfx@lists.freedesktop.org 

Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) 
with Ryzen 3700x?

Hi John,

Thanks a lot for taking the time to look at this, even if it doesn't
seem to be GPU related at first.

> OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options 
> for decoding.
Sorry for omitting that information - indeed I was using
MCE-Ryzen-Decoder, thanks for pointing to mcelog.
The mce log output definitivly makes more sense, I'll try to
experiment a bit with RAM.

Thanks also for the link to the forum, seems of all the affected users,
no one reported success in that thread.

> For something as simple as the GPU bus interface not responding to an access
> by the CPU I think you would get a different error (bus error) but not 100% 
> sure about that.
>
> My first thought would be to see if your mobo BIOS has an option to force PCIE
> gen3 instead of 4 and see if that makes a difference. There are some amdgpu 
> module parms
> related to PCIE as well but I'm not sure which ones to recommend.

I'll give it a try and have a look at the pcie options - but as far as
I know RX570 (polaris) should stay at PCI3 as far as I know.
Disabling IOMMU didn't help as far as I recall.

Thanks & best regards, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-08 Thread Bridgman, John
[AMD Public Use]

Fixing the security tag...


From: amd-gfx  on behalf of Bridgman, 
John 
Sent: March 8, 2020 3:10 PM
To: Clemens Eisserer ; amd-gfx@lists.freedesktop.org 

Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) 
with Ryzen 3700x?

OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options 
for decoding.

In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, 
with the same output, so guessing that is what you are using:

https://github.com/DimitriFourny/MCE-Ryzen-Decoder<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDimitriFourny%2FMCE-Ryzen-Decoder=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327=N8FCig9TNL8tppMXnn9RJ2K%2BIsuYFaBJ7cHvsfhgris%3D=0>

On the other hand I found a report on AMD forums where the same error is 
decoded by mce log as a generic error in a memory transaction, which seems to 
make more sense.

https://community.amd.com/thread/216084<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommunity.amd.com%2Fthread%2F216084=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327=G8MPgLKheVdcuA626wFpZwSgqektnTpKkEPnBqlk1QM%3D=0>

For something as simple as the GPU bus interface not responding to an access by 
the CPU I think you would get a different error (bus error) but not 100% sure 
about that.

My first thought would be to see if your mobo BIOS has an option to force PCIE 
gen3 instead of 4 and see if that makes a difference. There are some amdgpu 
module parms related to PCIE as well but I'm not sure which ones to recommend.


From: amd-gfx  on behalf of Bridgman, 
John 
Sent: March 8, 2020 2:45 PM
To: Clemens Eisserer ; amd-gfx@lists.freedesktop.org 

Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) 
with Ryzen 3700x?


[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe 
that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a 
bit more information would be required than what you passed in. Can you point 
me to the program you used ?

Thanks,
John


From: amd-gfx  on behalf of Clemens 
Eisserer 
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org 
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with 
Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[0.105003]  node  #0, CPUs:#1  #2
[0.107022] mce: [Hardware Error]: Machine check events logged
[0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea00108
[0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d0120001 SYND 4d00 IPID 500b0
[0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2Fdata=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3Dreserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078591321=QAbr3IkabyLUlYrR4K%2B%2BOpVbkf5BPEgNjrnDSltoQNg%3D=0>
what the decoder logic is
The users there claim to

Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-08 Thread Bridgman, John
OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options 
for decoding.

In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, 
with the same output, so guessing that is what you are using:

https://github.com/DimitriFourny/MCE-Ryzen-Decoder

On the other hand I found a report on AMD forums where the same error is 
decoded by mce log as a generic error in a memory transaction, which seems to 
make more sense.

https://community.amd.com/thread/216084

For something as simple as the GPU bus interface not responding to an access by 
the CPU I think you would get a different error (bus error) but not 100% sure 
about that.

My first thought would be to see if your mobo BIOS has an option to force PCIE 
gen3 instead of 4 and see if that makes a difference. There are some amdgpu 
module parms related to PCIE as well but I'm not sure which ones to recommend.


From: amd-gfx  on behalf of Bridgman, 
John 
Sent: March 8, 2020 2:45 PM
To: Clemens Eisserer ; amd-gfx@lists.freedesktop.org 

Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) 
with Ryzen 3700x?


[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe 
that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a 
bit more information would be required than what you passed in. Can you point 
me to the program you used ?

Thanks,
John


From: amd-gfx  on behalf of Clemens 
Eisserer 
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org 
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with 
Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[0.105003]  node  #0, CPUs:#1  #2
[0.107022] mce: [Hardware Error]: Machine check events logged
[0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea00108
[0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d0120001 SYND 4d00 IPID 500b0
[0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2Fdata=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3Dreserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F=02%7C01%7Cjohn.bridgman%40amd.com%7Ca20457c9361648485aeb08d7c390d88a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192899911911960=ARxvLcwPrVQkP%2Bil%2FvKz9mKZOBd5Sx%2Bg0MOlQ%2F8UmIs%3D=0>
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
___
amd-gfx mailing list
amd-gfx@lists.fre

Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

2020-03-08 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe 
that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a 
bit more information would be required than what you passed in. Can you point 
me to the program you used ?

Thanks,
John


From: amd-gfx  on behalf of Clemens 
Eisserer 
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org 
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with 
Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[0.105003]  node  #0, CPUs:#1  #2
[0.107022] mce: [Hardware Error]: Machine check events logged
[0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea00108
[0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d0120001 SYND 4d00 IPID 500b0
[0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2Fdata=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3Dreserved=0
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111264585sdata=L52zHeIm8GzEr5eYjUDm5bPK4U1DF0t1GtaxaUy9qHY%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [Mesa-dev] gitlab.fd.o financial situation and impact on services

2020-03-01 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

The one suggestion I saw that definitely seemed worth looking at was adding 
download caches if the larger CI systems didn't already have them.

Then again do we know that CI traffic is generating the bulk of the costs ? My 
guess would have been that individual developers and users would be generating 
as much traffic as the CI rigs.


From: amd-gfx  on behalf of Jason 
Ekstrand 
Sent: March 1, 2020 3:18 PM
To: Jacob Lifshay ; Nicolas Dufresne 

Cc: Erik Faye-Lund ; Daniel Vetter 
; Michel Dänzer ; X.Org development 
; amd-gfx list ; wayland 
; X.Org Foundation Board 
; Xorg Members List ; dri-devel 
; Mesa Dev ; 
intel-gfx ; Discussion of the development of 
and with GStreamer 
Subject: Re: [Intel-gfx] [Mesa-dev] gitlab.fd.o financial situation and impact 
on services

I don't think we need to worry so much about the cost of CI that we need to 
micro-optimize to to get the minimal number of CI runs. We especially shouldn't 
if it begins to impact coffee quality, people's ability to merge patches in a 
timely manner, or visibility into what went wrong when CI fails. I've seen a 
number of suggestions which will do one or both of those things including:

 - Batching merge requests
 - Not running CI on the master branch
 - Shutting off CI
 - Preventing CI on other non-MR branches
 - Disabling CI on WIP MRs
 - I'm sure there are more...

I think there are things we can do to make CI runs more efficient with some 
sort of end-point caching and we can probably find some truly wasteful CI to 
remove. Most of the things in the list above, I've seen presented by people who 
are only lightly involved the project to my knowledge (no offense to anyone 
intended).  Developers depend on the CI system for their day-to-day work and 
hampering it will only show down development, reduce code quality, and 
ultimately hurt our customers and community. If we're so desperate as to be 
considering painful solutions which will have a negative impact on development, 
we're better off trying to find more money.

--Jason


On March 1, 2020 13:51:32 Jacob Lifshay  wrote:

One idea for Marge-bot (don't know if you already do this):
Rust-lang has their bot (bors) automatically group together a few merge 
requests into a single merge commit, which it then tests, then, then the tests 
pass, it merges. This could help reduce CI runs to once a day (or some other 
rate). If the tests fail, then it could automatically deduce which one failed, 
by recursive subdivision or similar. There's also a mechanism to adjust 
priority and grouping behavior when the defaults aren't sufficient.

Jacob
___
Intel-gfx mailing list
intel-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: writing custom driver for VGA emulation ?

2020-02-18 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

>And we already checked, 256MB is unfortunately the minimum you can resize the 
>VRAM BAR on the E9171 to.

Ahh, OK... I didn't realize we had already looked into that. I guess that 
approach isn't going to work.

Yusef, guessing you are using a 32-bit CPU ? Is it possible to talk to whoever 
does SBIOS for your platform to see if you could maybe reduce address space 
allocated to RAM and bump up the MMIO space ?


From: Christian König 
Sent: February 18, 2020 9:19 AM
To: Bridgman, John ; Alex Deucher 
; Yusuf Altıparmak 
Cc: amd-gfx list 
Subject: Re: writing custom driver for VGA emulation ?

The problem Yusuf runs into is that his platform has multiple PCIe root hubs, 
but only 512MB of MMIO address space. That is not enough to fit all the BARs of 
an E9171 into.

But without the BARs neither the VGA emulation nor amdgpu not anything else 
will work correctly.

And we already checked, 256MB is unfortunately the minimum you can resize the 
VRAM BAR on the E9171 to.

What could maybe work is to trick the upstream bridge of the VGA device into 
not routing all the addresses to the BARs and actually use only a smaller 
portion of visible VRAM. But that would be highly experimental and requires a 
rather big hack into the PCI(e) subsystem in the Linux kernel.

Regards,
Christian.

Am 18.02.20 um 15:08 schrieb Bridgman, John:

[AMD Official Use Only - Internal Distribution Only]

Does the VBIOS come up with something like a splash screen, ie is VBIOS able to 
initialize and drive the card ?

If so then another option might be to use a VESA driver rather than VGA.



From: amd-gfx 
<mailto:amd-gfx-boun...@lists.freedesktop.org>
 on behalf of Alex Deucher <mailto:alexdeuc...@gmail.com>
Sent: February 18, 2020 8:50 AM
To: Yusuf Altıparmak <mailto:yusufalti1...@gmail.com>
Cc: amd-gfx list 
<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: writing custom driver for VGA emulation ?

On Tue, Feb 18, 2020 at 2:56 AM Yusuf Altıparmak
<mailto:yusufalti1...@gmail.com> wrote:
>
> Hello AMD team;
>
> I have E 9171 GPU and want to use it on a embedded system which has limited 
> MMIO space on PCIe bus (MAX 512 MB).
>
> I received feedbacks that I can only use VGA emulation with this memory 
> space. I was unable to get 'amdgpu' driver working with Xorg due to I had 
> many errors(firmwares are not loading) in each step and tired of solving them 
> one by one.
>
> I want to write a simple custom driver for this GPU with kernel version 4.19.
> Is it possible to print some colors on screen with a custom driver over PCIe 
> communication ? or writing some words on screen as VGA ?
>
> If answer is yes, then which code pieces (on amdgpu driver folder) or 
> reference documentation should I use? I have Register Reference Guide.pdf.
>
> I will be appreciated for your guidance.

That is not going to do what you want on your platform.  The VGA
emulation requires that you set up the card first to enable it, which
in turn requires MMIO access and thus you are back to square one.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cjohn.bridgman%40amd.com%7Ce7bf224775ad487d240708d7b47992f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637176306561328560sdata=QbfaIN%2F6LvgUihz5O0x41TwvdGYy7QTS5IVJq3RXYlA%3Dreserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx=02%7C01%7CJohn.Bridgman%40amd.com%7Ccda5469b6f5f4ae43e6d08d7b47d899a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637176323587003958=6eKo51jnHbE1QWkDB%2BN%2FFLMLB40HA2wVN3mU1l%2FeFhk%3D=0>



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx=02%7C01%7CJohn.Bridgman%40amd.com%7Ccda5469b6f5f4ae43e6d08d7b47d899a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637176323587003958=6eKo51jnHbE1QWkDB%2BN%2FFLMLB40HA2wVN3mU1l%2FeFhk%3D=0>


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: writing custom driver for VGA emulation ?

2020-02-18 Thread Bridgman, John
[AMD Official Use Only - Internal Distribution Only]

Does the VBIOS come up with something like a splash screen, ie is VBIOS able to 
initialize and drive the card ?

If so then another option might be to use a VESA driver rather than VGA.



From: amd-gfx  on behalf of Alex Deucher 

Sent: February 18, 2020 8:50 AM
To: Yusuf Altıparmak 
Cc: amd-gfx list 
Subject: Re: writing custom driver for VGA emulation ?

On Tue, Feb 18, 2020 at 2:56 AM Yusuf Altıparmak
 wrote:
>
> Hello AMD team;
>
> I have E 9171 GPU and want to use it on a embedded system which has limited 
> MMIO space on PCIe bus (MAX 512 MB).
>
> I received feedbacks that I can only use VGA emulation with this memory 
> space. I was unable to get 'amdgpu' driver working with Xorg due to I had 
> many errors(firmwares are not loading) in each step and tired of solving them 
> one by one.
>
> I want to write a simple custom driver for this GPU with kernel version 4.19.
> Is it possible to print some colors on screen with a custom driver over PCIe 
> communication ? or writing some words on screen as VGA ?
>
> If answer is yes, then which code pieces (on amdgpu driver folder) or 
> reference documentation should I use? I have Register Reference Guide.pdf.
>
> I will be appreciated for your guidance.

That is not going to do what you want on your platform.  The VGA
emulation requires that you set up the card first to enable it, which
in turn requires MMIO access and thus you are back to square one.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cjohn.bridgman%40amd.com%7Ce7bf224775ad487d240708d7b47992f4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637176306561328560sdata=QbfaIN%2F6LvgUihz5O0x41TwvdGYy7QTS5IVJq3RXYlA%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Irc and Other Things

2019-03-15 Thread Bridgman, John
Hi Nick,


I can see your username on the radeon IRC channel, so that much is working. 
When I try to chat it just hands up at "offering DCC CHAT to haweh" though.


What kind of error message are you getting ?


Thanks,

John


From: amd-gfx  on behalf of nick 

Sent: March 15, 2019 1:12 PM
To: amd-gfx@lists.freedesktop.org
Subject: Irc and Other Things

Greetings All,

I was trying to ask a question on IRC but after registering it just complains 
and I have no idea why,
if you want my nick to check it out it is was haweh. Anyhow I'm a student with 
interest in helping out with
your mesa stack.

I've already got a few very minor patches in gcc and was working on it for 
awhile before other things
including optimizing the STL for r values in move/copy constructors, with 
should have been noexcept.
One of the the maintainers there continued my idea it seems with the new 
filesystem class and other
STL libraries.

Sorry if this is the wrong list and just point me to the correct list,

Nick
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Set VM_L2_CNTL.PDE_FAULT_CLASSIFICATION to 0

2019-02-25 Thread Bridgman, John
Or is the idea that we should never see a PDE fault unless something goes 
wrong, and that we would set up an entry corresponding to an unmapped subtree 
as an invalid PTE for a very large page rather than an invalid PDE?

Thanks,
John
  Original Message
From: Bridgman, John
Sent: Monday, February 25, 2019 22:46
To: Alex Deucher; Zhao, Yong
Cc: amd-gfx list
Subject: Re: [PATCH] drm/amdgpu: Set VM_L2_CNTL.PDE_FAULT_CLASSIFICATION to 0


Don't we want PDE faults to be treated the same way as page faults? Or am I 
misinterpreting the commit message?

Thanks,
John
  Original Message
From: Alex Deucher
Sent: Monday, February 25, 2019 21:53
To: Zhao, Yong
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Set VM_L2_CNTL.PDE_FAULT_CLASSIFICATION to 0


On Mon, Feb 25, 2019 at 6:03 PM Zhao, Yong  wrote:
>
> This is recommended by HW designers. Previously when it was set to 1,
> the PDE walk error in VM fault will be treated as
> PERMISSION_OR_INVALID_PAGE_FAULT rather than usually expected OTHER_FAULT.
> As a result, the retry control in VM_CONTEXT*_CNTL will change accordingly.
>
> The above behavior is kind of abnormal. Furthermore, the
> PDE_FAULT_CLASSIFICATION == 1 feature was targeted for very old ASICs
> and it never made it way to production. Therefore, we should set it to 0.
>
> Signed-off-by: Yong Zhao 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> index f5edddf3b29d..c10ed568ca6c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> @@ -143,7 +143,7 @@ static void gfxhub_v1_0_init_cache_regs(struct 
> amdgpu_device *adev)
> /* XXX for emulation, Refer to closed source code.*/
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, 
> L2_PDE0_CACHE_TAG_GENERATION_MODE,
> 0);
> -   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 1);
> +   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 
> 1);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0);
> WREG32_SOC15(GC, 0, mmVM_L2_CNTL, tmp);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c 
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> index d0d966d6080a..2a039946a549 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> @@ -163,7 +163,7 @@ static void mmhub_v1_0_init_cache_regs(struct 
> amdgpu_device *adev)
> /* XXX for emulation, Refer to closed source code.*/
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, 
> L2_PDE0_CACHE_TAG_GENERATION_MODE,
> 0);
> -   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 1);
> +   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 
> 1);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0);
> WREG32_SOC15(MMHUB, 0, mmVM_L2_CNTL, tmp);
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Set VM_L2_CNTL.PDE_FAULT_CLASSIFICATION to 0

2019-02-25 Thread Bridgman, John
Don't we want PDE faults to be treated the same way as page faults? Or am I 
misinterpreting the commit message?

Thanks,
John
  Original Message
From: Alex Deucher
Sent: Monday, February 25, 2019 21:53
To: Zhao, Yong
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Set VM_L2_CNTL.PDE_FAULT_CLASSIFICATION to 0


On Mon, Feb 25, 2019 at 6:03 PM Zhao, Yong  wrote:
>
> This is recommended by HW designers. Previously when it was set to 1,
> the PDE walk error in VM fault will be treated as
> PERMISSION_OR_INVALID_PAGE_FAULT rather than usually expected OTHER_FAULT.
> As a result, the retry control in VM_CONTEXT*_CNTL will change accordingly.
>
> The above behavior is kind of abnormal. Furthermore, the
> PDE_FAULT_CLASSIFICATION == 1 feature was targeted for very old ASICs
> and it never made it way to production. Therefore, we should set it to 0.
>
> Signed-off-by: Yong Zhao 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> index f5edddf3b29d..c10ed568ca6c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> @@ -143,7 +143,7 @@ static void gfxhub_v1_0_init_cache_regs(struct 
> amdgpu_device *adev)
> /* XXX for emulation, Refer to closed source code.*/
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, 
> L2_PDE0_CACHE_TAG_GENERATION_MODE,
> 0);
> -   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 1);
> +   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 
> 1);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0);
> WREG32_SOC15(GC, 0, mmVM_L2_CNTL, tmp);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c 
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> index d0d966d6080a..2a039946a549 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
> @@ -163,7 +163,7 @@ static void mmhub_v1_0_init_cache_regs(struct 
> amdgpu_device *adev)
> /* XXX for emulation, Refer to closed source code.*/
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, 
> L2_PDE0_CACHE_TAG_GENERATION_MODE,
> 0);
> -   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 1);
> +   tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, PDE_FAULT_CLASSIFICATION, 0);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, CONTEXT1_IDENTITY_ACCESS_MODE, 
> 1);
> tmp = REG_SET_FIELD(tmp, VM_L2_CNTL, IDENTITY_MODE_FRAGMENT_SIZE, 0);
> WREG32_SOC15(MMHUB, 0, mmVM_L2_CNTL, tmp);
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: amdkfd build regression if HSA_AMD is disabled

2018-12-10 Thread Bridgman, John
Do we still need the HSA_AMD option ?


Seems to me that KFD stopped being "something we only sometimes include" a long 
time ago.


Thanks,
John


From: amd-gfx  on behalf of StDenis, Tom 

Sent: December 10, 2018 10:02 AM
To: Kuehling, Felix
Cc: Huang, JinHuiEric; Deucher, Alexander; amd-gfx mailing list
Subject: amdkfd build regression if HSA_AMD is disabled

Hi All,

The commit:

commit 62f65d3cb34a8300bf1e07fc478e03c3c02634d4
Refs: v4.20-rc3-524-g62f65d3cb34a
Author: Felix Kuehling 
AuthorDate: Mon Nov 19 20:05:54 2018 -0500
Commit: Felix Kuehling 
CommitDate: Fri Dec 7 17:17:11 2018 -0500

 drm/amdgpu: Add KFD VRAM limit checking

 We don't want KFD processes evicting each other over VRAM usage.
 Therefore prevent overcommitting VRAM among KFD applications with
 a per-GPU limit. Also leave enough room for page tables on top
 of the application memory usage.

 Signed-off-by: Felix Kuehling 
 Reviewed-by: Eric Huang 
 Acked-by: Alex Deucher 

Breaks the build if HSA_AMD is not enabled:

scripts/kconfig/conf  --syncconfig Kconfig
   DESCEND  objtool
   CALLscripts/checksyscalls.sh
   CHK include/generated/compile.h
   Building modules, stage 2.
   MODPOST 63 modules
Kernel: arch/x86/boot/bzImage is ready  (#58)
ERROR: "amdgpu_amdkfd_unreserve_memory_limit"
[drivers/gpu/drm/amd/amdgpu/amdgpu.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
make: *** [Makefile:1271: modules] Error 2

This is because the function being used is not included in the build
(the previous function called was part of amdgpu_amdkfd.c which is
unconditionally built).

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: SOC15 DC support warning

2018-07-03 Thread Bridgman, John
I have seen a couple of reports that booting Raven desktop parts requires 
disabling DC, although I'm not sure if that actually makes sense (I didn't 
think we implemented non-DC display paths for Vega/Raven).


Might be specific to people with a dGPU plugged in, I guess ?


From: amd-gfx  on behalf of Alex Deucher 

Sent: July 3, 2018 12:45 PM
To: Michel Dänzer
Cc: StDenis, Tom; Kuehling, Felix; amd-gfx@lists.freedesktop.org
Subject: Re: SOC15 DC support warning

On Tue, Jul 3, 2018 at 12:36 PM, Michel Dänzer  wrote:
> On 2018-07-03 06:13 PM, Felix Kuehling wrote:
>> On 2018-07-03 10:19 AM, Tom St Denis wrote:
>>> Hi all,
>>>
>>> This block
>>>
>>> #if defined(CONFIG_DRM_AMD_DC)
>>> else if (amdgpu_device_has_dc_support(adev))
>>> amdgpu_device_ip_block_add(adev, _ip_block);
>>> #else
>>> #warning "Enable CONFIG_DRM_AMD_DC for display support on SOC15."
>>> #endif
>>>
>>>
>>> in soc15_set_ip_blocks() should probably be ported to a runtime
>>> drm_warn() no?  This means a kernel with no DC support will just
>>> silently fail to light up on SOC15.
>>>
>>> I can write a patch for this if nobody objects.
>>
>> Maybe do both. So someone building a kernel without DC gets a warning as
>> well as someone running it.
>
> Yep, we already do both for CONFIG_MTRR & CONFIG_X86_PAT.
>
> BTW, the build warning should also be guarded by #ifndef
> CONFIG_COMPILE_TEST, per commit 31bb90f1cd084 "drm/amdgpu: shut up
> #warning for compile testing".
>
>
> OTOH, do we even need the capability to build the driver without DC anymore?

Yeah, we can probably go ahead and remove CONFIG_DRM_AMD_DC.

Alex

>
>
> --
> Earthling Michel Dänzer   |   http://www.amd.com
> Libre software enthusiast | Mesa and X developer
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Documentation about AMD's HSA implementation?

2018-03-17 Thread Bridgman, John

>-Original Message-
>From: Ming Yang [mailto:minos.fut...@gmail.com]
>Sent: Saturday, March 17, 2018 12:35 PM
>To: Kuehling, Felix; Bridgman, John
>Cc: amd-gfx@lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>Hi,
>
>After digging into documents and code, our previous discussion about GPU
>workload scheduling (mainly HWS and ACE scheduling) makes a lot more
>sense to me now.  Thanks a lot!  I'm writing this email to ask more questions.
>Before asking, I first share a few links to the documents that are most helpful
>to me.
>
>GCN (1st gen.?) architecture whitepaper
>https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
>Notes: ACE scheduling.
>
>Polaris architecture whitepaper (4th gen. GCN)
>http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf
>Notes: ACE scheduling; HWS; quick response queue (priority assignment);
>compute units reservation.
>
>AMDKFD patch cover letters:
>v5: https://lwn.net/Articles/619581/
>v1: https://lwn.net/Articles/605153/
>
>A comprehensive performance analysis of HSA and OpenCL 2.0:
>http://ieeexplore.ieee.org/document/7482093/
>
>Partitioning resources of a processor (AMD patent)
>https://patents.google.com/patent/US8933942B2/
>Notes: Compute resources are allocated according to the resource
>requirement percentage of the command.
>
>Here come my questions about ACE scheduling:
>Many of my questions are about ACE scheduling because the firmware is
>closed-source and how ACE schedules commands (queues) is not detailed
>enough in these documents.  I'm not able to run experiments on Raven Ridge
>yet.
>
>1. Wavefronts of one command scheduled by an ACE can be spread out to
>multiple compute engines (shader arrays)?  This is quite confirmed by the
>cu_mask setting, as cu_mask for one queue can cover CUs over multiple
>compute engines.

Correct, assuming the work associated with the command is not trivially small
and so generates enough wavefronts to require multiple CU's. 

>
>2.  If so, how is the competition resolved between commands scheduled by
>ACEs?  What's the scheduling scheme?  For example, when each ACE has a
>command ready to occupy 50% compute resources, are these 4 commands
>each occupies 25%, or they execute in the round-robin with 50% resources at
>a time?  Or just the first two scheduled commands execute and the later two
>wait?

Depends on how you measure compute resources, since each SIMD in a CU can
have up to 10 separate wavefronts running on it as long as total register usage
for all the threads does not exceed the number available in HW. 

If each ACE (let's say pipe for clarity) has enough work to put a single 
wavefront
on 50% of the SIMDs then all of the work would get scheduled to the SIMDs (4
SIMDs per CU) and run in a round-robin-ish manner as each wavefront was 
blocked waiting for memory access.

If each pipe has enough work to fill 50% of the CPUs and all pipes/queues were
assigned the same priority (see below) then the behaviour would be more like
"each one would get 25% and each time a wavefront finished another one would
be started". 
 
>
>3. If the barrier bit of the AQL packet is not set, does ACE schedule the
>following command using the same scheduling scheme in #2?

Not sure, barrier behaviour has paged so far out of my head that I'll have to 
skip
this one.

>
>4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL queue
>has 7 priority levels, right?

Yes-ish. Remember that there are multiple levels of scheduling going on here. At
any given time a pipe is only processing work from one of the queues; queue 
priorities affect the pipe's round-robin-ing between queues in a way that I have
managed to forget (but will try to find). There is a separate pipe priority, 
which
IIRC is actually programmed per queue and takes effect when the pipe is active
on that queue. There is also a global (IIRC) setting which adjusts how compute
work and graphics work are prioritized against each other, giving options like
making all compute lower priority than graphics or making only high priority
compute get ahead of graphics.

I believe the pipe priority is also referred to as SPI priority, since it 
affects
the way SPI decides which pipe (graphics/compute) to accept work from 
next.

This is all a bit complicated by a separate (global IIRC) option which 
randomizes
priority settings in order to avoid deadlock in certain conditions. We used to 
have that enabled by default (believe it was needed for specific OpenCL 
programs) but not sure if it is still enabled - if so then most of the above 
gets
murky because of the randomization.

At first glance we do not enable randomization for Polaris or Vega but do for
all of the older parts. Haven't looked at Raven yet.

>
>5. Is this pate

Re: ROCm installation from source

2018-03-10 Thread Bridgman, John
If you look about half way down this page under "The latest ROCm platform - 
ROCm 1.7" you should see a list of links to component repos. Each of those 
component repos has information on building:


https://github.com/RadeonOpenCompute/ROCm


If you have questions or problems building/installing you can use the "Issues" 
button at the top level repo or the component level repo to contact the ROC 
development team.



From: amd-gfx  on behalf of Joseph Wang 

Sent: March 10, 2018 5:30 AM
To: amd-gfx@lists.freedesktop.org
Subject: ROCm installation from source

Is there a document that describes the steps necessary to run an ROCm setup 
from source?

Thanks.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Documentation about AMD's HSA implementation?

2018-02-13 Thread Bridgman, John

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Bridgman, John
>Sent: Tuesday, February 13, 2018 6:42 PM
>To: Ming Yang; Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: RE: Documentation about AMD's HSA implementation?
>
>
>
>>-Original Message-
>>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>>Of Ming Yang
>>Sent: Tuesday, February 13, 2018 4:59 PM
>>To: Kuehling, Felix
>>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>That's very helpful, thanks!
>>
>>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>><felix.kuehl...@amd.com>
>>wrote:
>>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>>> Thanks for the suggestions!  But I might ask several specific
>>>> questions, as I can't find the answer in those documents, to give
>>>> myself a quick start if that's okay. Pointing me to the
>>>> files/functions would be good enough.  Any explanations are
>>>> appreciated.   My purpose is to hack it with different scheduling
>>>> policy with real-time and predictability consideration.
>>>>
>>>> - Where/How is the packet scheduler implemented?  How are packets
>>>> from multiple queues scheduled?  What about scheduling packets from
>>>> queues in different address spaces?
>>>
>>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>>> micro engine. Within each pipe the queues are time-multiplexed.
>>
>>Please correct me if I'm wrong.  CP is computing processor, like the
>>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler
>>multiplexing queues, in order to hide memory latency.
>
>CP is one step back from that - it's a "command processor" which reads
>command packets from driver (PM4 format) or application (AQL format) then
>manages the execution of each command on the GPU. A typical packet might
>be "dispatch", which initiates a compute operation on an N-dimensional array,
>or "draw" which initiates the rendering of an array of triangles. Those
>compute and render commands then generate a (typically) large number of
>wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of
>our recent GPUs have one micro engine for graphics ("ME") and two for
>compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".

I missed one important point - "CP" refers to the combination of ME, MEC(s) and 
a few other related blocks.

>>
>>>
>>> If we need more than 24 queues, or if we have more than 8 processes,
>>> the hardware scheduler (HWS) adds another layer scheduling, basically
>>> round-robin between batches of 24 queues or 8 processes. Once you get
>>> into such an over-subscribed scenario your performance and GPU
>>> utilization can suffers quite badly.
>>
>>HWS is also implemented in the firmware that's closed-source?
>
>Correct - HWS is implemented in the MEC microcode. We also include a simple
>SW scheduler in the open source driver code, however.
>>
>>>
>>>>
>>>> - I noticed the new support of concurrency of multi-processes in the
>>>> archive of this mailing list.  Could you point me to the code that
>>>> implements this?
>>>
>>> That's basically just a switch that tells the firmware that it is
>>> allowed to schedule queues from different processes at the same time.
>>> The upper limit is the number of VMIDs that HWS can work with. It
>>> needs to assign a unique VMID to each process (each VMID representing
>>> a separate address space, page table, etc.). If there are more
>>> processes than VMIDs, the HWS has to time-multiplex.
>>
>>HWS dispatch packets in their order of becoming the head of the queue,
>>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>>round-robin between queues? You mentioned round-robin over batches in
>>the over- subscribed scenario.
>
>Round robin between sets of queues. The HWS logic generates sets as
>follows:
>
>1. "set resources" packet from driver tells scheduler how many VMIDs and
>HW queues it can use
>
>2. "runlist" packet from driver provides list of processes and list of queues 
>for

RE: Documentation about AMD's HSA implementation?

2018-02-13 Thread Bridgman, John


>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Ming Yang
>Sent: Tuesday, February 13, 2018 4:59 PM
>To: Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>That's very helpful, thanks!
>
>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling 
>wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets
>>> from multiple queues scheduled?  What about scheduling packets from
>>> queues in different address spaces?
>>
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
>
>Please correct me if I'm wrong.  CP is computing processor, like the Execution
>Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing
>queues, in order to hide memory latency.

CP is one step back from that - it's a "command processor" which reads command 
packets from driver (PM4 format) or application (AQL format) then manages the 
execution of each command on the GPU. A typical packet might be "dispatch", 
which initiates a compute operation on an N-dimensional array, or "draw" which 
initiates the rendering of an array of triangles. Those compute and render 
commands then generate a (typically) large number of wavefronts which are 
multiplexed on the shader core (by SQ IIRC). Most of our recent GPUs have one 
micro engine for graphics ("ME") and two for compute ("MEC"). Marketing refers 
to each pipe on an MEC block as an "ACE".
>
>>
>> If we need more than 24 queues, or if we have more than 8 processes,
>> the hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
>
>HWS is also implemented in the firmware that's closed-source?

Correct - HWS is implemented in the MEC microcode. We also include a simple SW 
scheduler in the open source driver code, however. 
>
>>
>>>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>>
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It
>> needs to assign a unique VMID to each process (each VMID representing
>> a separate address space, page table, etc.). If there are more
>> processes than VMIDs, the HWS has to time-multiplex.
>
>HWS dispatch packets in their order of becoming the head of the queue, i.e.,
>being pointed by the read_index? So in this way it's FIFO.  Or round-robin
>between queues? You mentioned round-robin over batches in the over-
>subscribed scenario.

Round robin between sets of queues. The HWS logic generates sets as follows:

1. "set resources" packet from driver tells scheduler how many VMIDs and HW 
queues it can use

2. "runlist" packet from driver provides list of processes and list of queues 
for each process

3. if multi-process switch not set, HWS schedules as many queues from the first 
process in the runlist as it has HW queues (see #1)

4. at the end of process quantum (set by driver) either switch to next process 
(if all queues from first process have been scheduled) or schedule next set of 
queues from the same process

5. when all queues from all processes have been scheduled and run for a process 
quantum, go back to the start of the runlist and repeat

If the multi-process switch is set, and the number of queues for a process is 
less than the number of HW queues available, then in step #3 above HWS will 
start scheduling queues for additional processes, using a different VMID for 
each process, and continue until it either runs out of VMIDs or HW queues (or 
reaches the end of the runlist). All of the queues and processes would then run 
together for a process quantum before switching to the next queue set.

>
>This might not be a big deal for performance, but it matters for predictability
>and real-time analysis.

Agreed. In general you would not want to overcommit either VMIDs or HW queues 
in a real-time scenario, and for hard real time you would probably 

Re: [PATCH] drm/amdgpu: Add place holder for soc15 asic init on emulation

2018-02-06 Thread Bridgman, John
Yes, I made that suggestion in order to make it easy for us to replace the 
massive function we use in emulation with a stub function for upstream. I 
figured if we had that function in a separate file it would be cleaner, but 
maybe that is not the case.


I was thinking that we could replace the upstream stub file with an 
NPI-specific file for most of our emulator work, then go back to the stub file 
once VBIOS was enabled and we no longer needed the SOC init sequence from HW.


I guess from a CM perspective it doesn't make a lot of difference whether we 
add 20,000 lines to a file and delete them later or replace a short file with a 
20,000 line file and then replace it again later... so maybe we don't need to 
move the function out after all.



From: Liu, Shaoyun
Sent: February 6, 2018 4:55 PM
To: Alex Deucher; Bridgman, John
Cc: amd-gfx list
Subject: RE: [PATCH] drm/amdgpu: Add place holder for soc15 asic init on 
emulation

Ye, I  try to put  the asic specific  register sequences for emulation in a 
separate emu_soc.c file, I think it's suggested by John or you ?  The asic 
specific code will  be  keep in separate bring up branch , but a common place 
to add them  seems  not bad to me .

Shaoyun.liu

-Original Message-
From: Alex Deucher [mailto:alexdeuc...@gmail.com]
Sent: Tuesday, February 06, 2018 4:46 PM
To: Liu, Shaoyun
Cc: amd-gfx list
Subject: Re: [PATCH] drm/amdgpu: Add place holder for soc15 asic init on 
emulation

On Tue, Feb 6, 2018 at 4:41 PM, Shaoyun Liu <shaoyun@amd.com> wrote:
> Change-Id: I6ff04e1199d1ebdbdb31d0e7e8ca3c240c61ab3a
> Signed-off-by: Shaoyun Liu <shaoyun@amd.com>

What is the purpose of this?  If it's just as a hint the the driver writer as 
to where to add the emulation register sequences from the IP teams, I don't see 
much value in it.  Just add a comment.

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/Makefile  |  2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  2 ++
> drivers/gpu/drm/amd/amdgpu/emu_soc.c | 33 +
>  drivers/gpu/drm/amd/amdgpu/soc15.c   |  4 
>  4 files changed, 40 insertions(+), 1 deletion(-)  create mode 100644
> drivers/gpu/drm/amd/amdgpu/emu_soc.c
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index 1f6d43e..6f5db5e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -41,7 +41,7 @@ amdgpu-$(CONFIG_DRM_AMDGPU_CIK)+= cik.o cik_ih.o
> kv_smc.o kv_dpm.o \  amdgpu-$(CONFIG_DRM_AMDGPU_SI)+= si.o gmc_v6_0.o
> gfx_v6_0.o si_ih.o si_dma.o dce_v6_0.o si_dpm.o si_smc.o
>
>  amdgpu-y += \
> -   vi.o mxgpu_vi.o nbio_v6_1.o soc15.o mxgpu_ai.o nbio_v7_0.o 
> vega10_reg_init.o vega20_reg_init.o nbio_v7_4.o
> +   vi.o mxgpu_vi.o nbio_v6_1.o soc15.o emu_soc.o mxgpu_ai.o
> + nbio_v7_0.o vega10_reg_init.o vega20_reg_init.o nbio_v7_4.o
>
>  # add GMC block
>  amdgpu-y += \
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d417cfb..13aa8a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1740,6 +1740,8 @@ void amdgpu_mm_wreg(struct amdgpu_device *adev,
> uint32_t reg, uint32_t v,  bool amdgpu_device_asic_has_dc_support(enum
> amd_asic_type asic_type);  bool amdgpu_device_has_dc_support(struct
> amdgpu_device *adev);
>
> +int emu_soc_asic_init(struct amdgpu_device *adev);
> +
>  /*
>   * Registers read & write functions.
>   */
> diff --git a/drivers/gpu/drm/amd/amdgpu/emu_soc.c
> b/drivers/gpu/drm/amd/amdgpu/emu_soc.c
> new file mode 100644
> index 000..d72c25c
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/emu_soc.c
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright 2018 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person
> +obtaining a
> + * copy of this software and associated documentation files (the
> +"Software"),
> + * to deal in the Software without restriction, including without
> +limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> +sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom
> +the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> +included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> +EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF


Is there an expectation that MAM block features will become available under 
virtualization at some point in the future, eg Navi10 ?

> +MERCHANTABILITY,
> + * FITNE

RE: [PATCH] drm/amdgpu: Basic emulation support

2018-02-05 Thread Bridgman, John


>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Christian König
>Sent: Monday, February 05, 2018 11:49 AM
>To: Alex Deucher; Liu, Shaoyun
>Cc: amd-gfx list
>Subject: Re: [PATCH] drm/amdgpu: Basic emulation support
>
>Am 05.02.2018 um 17:45 schrieb Alex Deucher:
>> On Thu, Feb 1, 2018 at 6:16 PM, Shaoyun Liu 
>wrote:
>>> Add amdgpu_emu_mode module parameter to control the emulation
>mode
>>> Avoid vbios operation on emulation since there is no vbios post
>>> duirng emulation, use the common hw_init to simulate the post
>>>
>>> Change-Id: Iba32fa16e735490e7401e471219797b83c6c2a58
>>> Signed-off-by: Shaoyun Liu 
>> Acked-by: Alex Deucher 
>
>Acked-by: Christian König  as well.

Maybe add a comment to the following change indicating that we might have done 
early HW init either due to emulation or early init of GMC during normal 
operation ? 

Otherwise change is Reviewed-by: John Bridgman 

>
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26
>+++---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 
>>>   3 files changed, 28 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index ab10295..4c9c320 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -129,6 +129,7 @@
>>>   extern int amdgpu_lbpw;
>>>   extern int amdgpu_compute_multipipe;
>>>   extern int amdgpu_gpu_recovery;
>>> +extern int amdgpu_emu_mode;
>>>
>>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>>   extern int amdgpu_si_support;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 6adb6e8..fe7a941 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1339,6 +1339,20 @@ static int amdgpu_device_ip_init(struct
>amdgpu_device *adev)
>>>  return r;
>>>  }
>>>  adev->ip_blocks[i].status.sw = true;
>>> +
>>> +   if (amdgpu_emu_mode == 1) {
>>> +   /* Need to do common hw init first on emulation  */
>>> +   if (adev->ip_blocks[i].version->type ==
>AMD_IP_BLOCK_TYPE_COMMON) {
>>> +   r = 
>>> adev->ip_blocks[i].version->funcs->hw_init((void
>*)adev);
>>> +   if (r) {
>>> +   DRM_ERROR("hw_init of IP block <%s> 
>>> failed %d\n",
>>> +   
>>> adev->ip_blocks[i].version->funcs->name, r);
>>> +   return r;
>>> +   }
>>> +   adev->ip_blocks[i].status.hw = true;
>>> +   }
>>> +   }
>>> +
>>>  /* need to do gmc hw init early so we can allocate gpu mem 
>>> */
>>>  if (adev->ip_blocks[i].version->type ==
>AMD_IP_BLOCK_TYPE_GMC) {
>>>  r = amdgpu_device_vram_scratch_init(adev);
>>> @@ -1372,8 +1386,7 @@ static int amdgpu_device_ip_init(struct
>amdgpu_device *adev)
>>>  for (i = 0; i < adev->num_ip_blocks; i++) {
>>>  if (!adev->ip_blocks[i].status.sw)
>>>  continue;
>>> -   /* gmc hw init is done early */
>>> -   if (adev->ip_blocks[i].version->type ==
>AMD_IP_BLOCK_TYPE_GMC)
>>> +   if (adev->ip_blocks[i].status.hw)
>>>  continue;
>>>  r = adev->ip_blocks[i].version->funcs->hw_init((void 
>>> *)adev);
>>>  if (r) {
>>> @@ -1914,6 +1927,9 @@ int amdgpu_device_init(struct amdgpu_device
>*adev,
>>>  if (runtime)
>>>  vga_switcheroo_init_domain_pm_ops(adev->dev,
>>> >vga_pm_domain);
>>>
>>> +   if (amdgpu_emu_mode == 1)
>>> +   goto fence_driver_init;
>>> +
>>>  /* Read BIOS */
>>>  if (!amdgpu_get_bios(adev)) {
>>>  r = -EINVAL;
>>> @@ -1966,6 +1982,7 @@ int amdgpu_device_init(struct amdgpu_device
>*adev,
>>>  amdgpu_atombios_i2c_init(adev);
>>>  }
>>>
>>> +fence_driver_init:
>>>  /* Fence driver */
>>>  r = amdgpu_fence_driver_init(adev);
>>>  if (r) {
>>> @@ -2108,7 +2125,10 @@ void amdgpu_device_fini(struct amdgpu_device
>*adev)
>>>  /* free i2c buses */
>>>  if (!amdgpu_device_has_dc_support(adev))
>>>  amdgpu_i2c_fini(adev);
>>> -   amdgpu_atombios_fini(adev);
>>> +
>>> +   if (amdgpu_emu_mode != 1)
>>> +   amdgpu_atombios_fini(adev);
>>> +
>>>  kfree(adev->bios);
>>>  adev->bios = NULL;
>>>  

RE: [PATCH 1/2] drm/amdgpu: Set module parameter for emulation

2018-02-01 Thread Bridgman, John
If it helps, my recollection was that Intel was also pushing some pre-silicon 
support code upstream.

Agree that if the changes get big/messy/invasive we should rethink this, but my 
impression is that the changes can be fairly small. There will be one Big 
Honkin' function that programs ~10,000 registers with some readback and delay 
logic specific to the emulator, but we can replace that with a stub and maybe 
move it to a separate file.

Christian, are you OK with upstreaming the ZFB patches ? We will be using those 
on both emulator and real silicon.

Thanks,
John

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Alex Deucher
>Sent: Thursday, February 01, 2018 3:31 PM
>To: Koenig, Christian
>Cc: amd-gfx list; Liu, Shaoyun
>Subject: Re: [PATCH 1/2] drm/amdgpu: Set module parameter for emulation
>
>On Thu, Feb 1, 2018 at 3:19 PM, Christian König
> wrote:
>> I don't think we should push any emulation specific code upstream.
>>
>> Nobody outside of AMD can test anything of that not actually make any
>> use of it.
>
>It makes it much easier to maintain the code however and debug things on
>the emulator in the future if we encounter an issue, even after we get silicon
>back.  Some emulation features can even be used on real silicon, although
>there is not much value in doing so.
>
>
>>
>> Regards,
>> Christian.
>>
>>
>> Am 01.02.2018 um 21:15 schrieb Shaoyun Liu:
>>>
>>> During emulation period, use the directly load for firmware also only
>>> enable the GFX , SDMA and necessary common, gmc, ih IP block
>>>
>>> Signed-off-by: Shaoyun Liu 
>>>
>>> Change-Id: I325910fa06be4060725f404e471cc79daaf343c3
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 5a5ed47..7a1c670 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -93,10 +93,18 @@
>>>   int amdgpu_msi = -1;
>>>   int amdgpu_lockup_timeout = 1;
>>>   int amdgpu_dpm = -1;
>>> -int amdgpu_fw_load_type = -1;
>>>   int amdgpu_aspm = -1;
>>>   int amdgpu_runtime_pm = -1;
>>> +#ifndef AMDGPU_EMULATOR_BUILD
>>>   uint amdgpu_ip_block_mask = 0x;
>>> +int amdgpu_fw_load_type = -1;
>>> +#else
>>> +/* Only enable GFX and  SDMA + common, gmc, ih IP  block for
>>> +emulation */ uint amdgpu_ip_block_mask = 0xc7;
>
>I'm not sure it's a good idea to hardcode the block mask in this case.
>We'll be changing it as we test additional blocks on the emulator.
>
>Alex
>
>>> +/* Normally, only direct load is support durign emulation time */
>>> +int amdgpu_fw_load_type = 0; #endif
>>> +
>>>   int amdgpu_bapm = -1;
>>>   int amdgpu_deep_color = 0;
>>>   int amdgpu_vm_size = -1;
>>
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Strange issue on Vega 8 Mobile (HP Envy x360 Laptop)

2018-01-29 Thread Bridgman, John
Yes, please file a bugzilla ticket and attach dmesg output.

From: Min Xu [mailto:min.xu.pub...@gmail.com]
Sent: Monday, January 29, 2018 3:04 PM
To: Bridgman, John
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: Strange issue on Vega 8 Mobile (HP Envy x360 Laptop)

Thanks for explaining that firmware is not permanently loaded. Then, I have no 
ideas. Do you need my dmesg output?

On Mon, Jan 29, 2018 at 12:02 PM, Min Xu 
<min.xu.pub...@gmail.com<mailto:min.xu.pub...@gmail.com>> wrote:
I am running archlinux, the firmware version is very new:

linux-firmware-20180119.2a713be-1


On Mon, Jan 29, 2018 at 12:01 PM, Bridgman, John 
<john.bridg...@amd.com<mailto:john.bridg...@amd.com>> wrote:
Microcode for the GPU hardware blocks is not permanently updated in the chip, 
but rather is loaded at power-up. Usually the files will be distributed via a 
package with a name like linux-firmware.

I didn't see a mention of which distro/version you are using but along with new 
kernel you will need a relatively new version of linux-firmware.

From: amd-gfx 
[mailto:amd-gfx-boun...@lists.freedesktop.org<mailto:amd-gfx-boun...@lists.freedesktop.org>]
 On Behalf Of Min Xu
Sent: Monday, January 29, 2018 2:57 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Strange issue on Vega 8 Mobile (HP Envy x360 Laptop)

Dear AMD GFX developers,

I just got a HP Envy x360 laptop and I am trying to run linux on it. I want to 
first thank you all for the great work on the amdgpu driver. Without it, people 
like me wants to run Linux would be stuck with windows.

I suspect that my issue is a new issue that hasn't been reported before, 
therefore, I am writing to you to see if there indeed is a new issue and 
whether there is a workaround.

I have compiled the latest kernel from the amd-staging-drm-next branch last 
night. I think the kernel "works" with my GPU. The previous two kernels I tried 
(4.14 and 4.15 final release) either doesn't support this card or just simply 
hang the system most of the time.

The issue I have is that the graphic card seems to never switch to the high 
resolution mode of the monitor. The kernel would boot with the default 800x600 
VGA graphics and then stuck. The monitor continuous to display the content 
written to the 800x600 console (some kernel booting messages) after amdgpu 
takes over. I can see from kernel dmesg the amdgpu driver found my card and 
initialized it and seems to be all happy about it. Yet, nothing new is 
displayed on the monitor. The monitor just stuck at the content of the 800x600 
graphics.

The keyboard works in this situation. Kernel is alive and I can reboot it by 
pressing ++ and I saw from the log file the system restarts 
just fine.

I have tried different noobs of the amdgpu driver, like 
amdgpu.exp_hw_support=1, si_support=0, etc. Nothing seems to work. I just stuck 
with not able to switch to 1920x1080.

I suspect that this is related to my firmware version. I confirmed the latest 
firmware is installed on my /usr/lib/firmware/ dir. The reason I suspect that 
it is a firmware issue is that I got the machine just 2 days ago and I have 
updated windows 10 to build 1709, which is very new. I suspect that windows 
have updated the GPU's firmware and the linux driver isn't working with it. If 
so, is there a way to force a firmware load from the linux side (i.e. a 
downgrade).

Other users on the internet has report success with this particular machine 
with 4.15 kernel. Given that 4.15 doesn't work for me, the only thing that I 
could think of is the firmware version.

Any other ideas?


Thanks a lot,
Min



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Strange issue on Vega 8 Mobile (HP Envy x360 Laptop)

2018-01-29 Thread Bridgman, John
Microcode for the GPU hardware blocks is not permanently updated in the chip, 
but rather is loaded at power-up. Usually the files will be distributed via a 
package with a name like linux-firmware.

I didn't see a mention of which distro/version you are using but along with new 
kernel you will need a relatively new version of linux-firmware.

From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Min Xu
Sent: Monday, January 29, 2018 2:57 PM
To: amd-gfx@lists.freedesktop.org
Subject: Strange issue on Vega 8 Mobile (HP Envy x360 Laptop)

Dear AMD GFX developers,

I just got a HP Envy x360 laptop and I am trying to run linux on it. I want to 
first thank you all for the great work on the amdgpu driver. Without it, people 
like me wants to run Linux would be stuck with windows.

I suspect that my issue is a new issue that hasn't been reported before, 
therefore, I am writing to you to see if there indeed is a new issue and 
whether there is a workaround.

I have compiled the latest kernel from the amd-staging-drm-next branch last 
night. I think the kernel "works" with my GPU. The previous two kernels I tried 
(4.14 and 4.15 final release) either doesn't support this card or just simply 
hang the system most of the time.

The issue I have is that the graphic card seems to never switch to the high 
resolution mode of the monitor. The kernel would boot with the default 800x600 
VGA graphics and then stuck. The monitor continuous to display the content 
written to the 800x600 console (some kernel booting messages) after amdgpu 
takes over. I can see from kernel dmesg the amdgpu driver found my card and 
initialized it and seems to be all happy about it. Yet, nothing new is 
displayed on the monitor. The monitor just stuck at the content of the 800x600 
graphics.

The keyboard works in this situation. Kernel is alive and I can reboot it by 
pressing ++ and I saw from the log file the system restarts 
just fine.

I have tried different noobs of the amdgpu driver, like 
amdgpu.exp_hw_support=1, si_support=0, etc. Nothing seems to work. I just stuck 
with not able to switch to 1920x1080.

I suspect that this is related to my firmware version. I confirmed the latest 
firmware is installed on my /usr/lib/firmware/ dir. The reason I suspect that 
it is a firmware issue is that I got the machine just 2 days ago and I have 
updated windows 10 to build 1709, which is very new. I suspect that windows 
have updated the GPU's firmware and the linux driver isn't working with it. If 
so, is there a way to force a firmware load from the linux side (i.e. a 
downgrade).

Other users on the internet has report success with this particular machine 
with 4.15 kernel. Given that 4.15 doesn't work for me, the only thing that I 
could think of is the firmware version.

Any other ideas?


Thanks a lot,
Min

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: PCIe3 atomics requirement for amdkfd

2018-01-03 Thread Bridgman, John
Agreed - MEC microcode uses atomics when the queue type is set to AQL (rather 
than PM4).

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Liu, Shaoyun
>Sent: Wednesday, January 03, 2018 11:24 AM
>To: tstel...@redhat.com; Felix Kühling; amd-gfx@lists.freedesktop.org
>Cc: Kuehling, Felix
>Subject: RE: PCIe3 atomics requirement for amdkfd
>
>I think currently atomic  Ops are only used in AQL package which is only
>available for ROCm , graphics workload will not use AQL package.
>
>Regards
>Shaoyun.liu
>
>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Tom Stellard
>Sent: Wednesday, January 03, 2018 9:57 AM
>To: Felix Kühling; amd-gfx@lists.freedesktop.org
>Cc: Kuehling, Felix
>Subject: Re: PCIe3 atomics requirement for amdkfd
>
>On 12/23/2017 07:40 AM, Felix Kühling wrote:
>> As I understand it, it would require changes in the ROCr Runtime and
>> in the firmware (MEC microcode). It also changes the programming
>> model, so it may affect certain applications or higher level language
>> runtimes that rely on atomic operations.
>>
>
>How does the MEC microcode know that it is running a ROCm workload as
>opposed to a graphics workload that doesn't require PCIe3 atomics.  Is there a
>specific configuration bit that is set to indicate the ROCm programming model
>is needed?
>
>-Tom
>
>> Regards,
>>   Felix
>>
>>
>> Am 19.12.2017 um 16:04 schrieb Tom Stellard:
>>> Hi,
>>>
>>> How hard of a requirement is PCIe3 atomics for dGPUs with the amdkfd
>>> kernel driver?  Is it possible to make modifications to the
>>> runtime/kernel driver to drop this requirement?
>>>
>>> -Tom
>>> ___
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: PCIe3 atomics requirement for amdkfd

2017-12-25 Thread Bridgman, John
Let's separate out OpenCL from HCC/HIP and the rest of the ROCm stack. 

We are working on a solution to deliver OpenCL without requiring atomics, but 
not the rest of the ROCm stack.

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Luke A. Guest
>Sent: Monday, December 25, 2017 9:55 AM
>To: amd-gfx@lists.freedesktop.org
>Subject: Re: PCIe3 atomics requirement for amdkfd
>
>Hi,
>
>I have to agree here. At least there should be a non-PCIe-3.0 pathway which
>implements them on the CPU, I mean, they're fairly simple atomics, CAS,
>SWAP, FetchAdd.
>
>What AMD have actually done is royally screwed over anyone with an FX
>chipset, i.e. no OpenCL - the open source AMD one requires ROCm, they
>abandoned Clover, maybe PoCL will work? Who knows.
>
>
>On 19/12/17 15:04, Tom Stellard wrote:
>> Hi,
>>
>> How hard of a requirement is PCIe3 atomics for dGPUs with the amdkfd
>> kernel driver?  Is it possible to make modifications to the
>> runtime/kernel driver to drop this requirement?
>>
>> -Tom
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Did my graphics card get damaged?

2017-11-05 Thread Bridgman, John
For clarity, are you saying that when you go back to whatever distro you had 
installed on the machine previously it is still  not booting correctly ?

The only problems I see in the xorg log are the following lines at the end, but 
not sure if they are serious or not. 

[   205.542] (WW) RADEON(0): flip queue failed: Invalid argument
[   205.542] (WW) RADEON(0): Page flip failed: Invalid argument
[   205.544] (WW) RADEON(0): flip queue failed: Invalid argument
[   205.544] (WW) RADEON(0): Page flip failed: Invalid argument

If it makes you feel any better I just ran into a similar problem installing 
latest Ubuntu 17.10 on a newer laptop - no POST, no display at all. Haven't had 
time to mess with it to figure out what the problem is but will try to make 
time.

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Vladimir Klebanov
>Sent: Sunday, November 05, 2017 4:08 PM
>To: amd-gfx@lists.freedesktop.org
>Subject: Did my graphics card get damaged?
>
>Hello,
>
>I have an older T40p Thinkpad with an ATI Mobility FireGL 9000 / ATI RV250
>(M9).
>
>I recently tried to boot the latest OpenSuSE Tumblewewd live USB distribution
>on it. The text part went well but the process of starting X / KDE did not go
>beyond a black screen.
>
>Since then, the laptop refuses to boot. It powers up, but the screen remains
>completely black and there are no beeps (i.e., no POST).
>
>I attach the Xorg.0.log. Any ideas on what happened or what could bring the
>laptop back to life?
>
>Thanks,
>
>Vladimir
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: use 2MB fragment size for GFX6,7 and 8

2017-09-18 Thread Bridgman, John
Acked-by: John Bridgman 

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Christian König
>Sent: Monday, September 18, 2017 8:34 AM
>To: amd-gfx@lists.freedesktop.org
>Subject: [PATCH] drm/amdgpu: use 2MB fragment size for GFX6,7 and 8
>
>From: Christian König 
>
>Use 2MB fragment size by default for older hardware generations as well.
>
>Signed-off-by: Christian König 
>---
> drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 2 +-
>drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +-
>drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +-
> 3 files changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>index 5be9c83..2d1f3f6 100644
>--- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
>@@ -831,7 +831,7 @@ static int gmc_v6_0_sw_init(void *handle)
>   if (r)
>   return r;
>
>-  amdgpu_vm_adjust_size(adev, 64, 4);
>+  amdgpu_vm_adjust_size(adev, 64, 9);
>   adev->vm_manager.max_pfn = adev->vm_manager.vm_size << 18;
>
>   adev->mc.mc_mask = 0xffULL;
>diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>index eace9e7..2256277 100644
>--- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>@@ -970,7 +970,7 @@ static int gmc_v7_0_sw_init(void *handle)
>* Currently set to 4GB ((1 << 20) 4k pages).
>* Max GPUVM size for cayman and SI is 40 bits.
>*/
>-  amdgpu_vm_adjust_size(adev, 64, 4);
>+  amdgpu_vm_adjust_size(adev, 64, 9);
>   adev->vm_manager.max_pfn = adev->vm_manager.vm_size << 18;
>
>   /* Set the internal MC address mask
>diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>index 3b3326d..114671b 100644
>--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>@@ -1067,7 +1067,7 @@ static int gmc_v8_0_sw_init(void *handle)
>* Currently set to 4GB ((1 << 20) 4k pages).
>* Max GPUVM size for cayman and SI is 40 bits.
>*/
>-  amdgpu_vm_adjust_size(adev, 64, 4);
>+  amdgpu_vm_adjust_size(adev, 64, 9);
>   adev->vm_manager.max_pfn = adev->vm_manager.vm_size << 18;
>
>   /* Set the internal MC address mask
>--
>2.7.4
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: What is the relationship between amd-gfx and kernel?

2017-09-14 Thread Bridgman, John
amd-gfx is the mailing list for the amdgpu kernel driver, which is part of the 
drm subsystem in the Linux kernel.

It lives in the drivers/gpu/drm/amd/amdgpu portion of the Linux kernel tree.

We also release modified versions of the amdgpu driver in the AMDGPU-PRO and 
ROCm driver stacks.

From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of 
maok...@126.com
Sent: Thursday, September 14, 2017 5:41 AM
To: amd-gfx
Subject: What is the relationship between amd-gfx and kernel?

hi:
What is the relationship between amd-gfx and kernel and how is the version 
connected?


maok...@126.com
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 16/19] drm/amdkfd: Update PM4 packet headers

2017-08-12 Thread Bridgman, John
IIRC the amdgpu devs had been holding back on publishing the updated MEC 
microcode (with scratch support) because that WOULD have broken Kaveri. With 
this change from Felix we should be able to publish the newest microcode for 
both amdgpu and amdkfd WITHOUT breaking Kaveri.

IOW this is the "scratch fix for Kaveri KFD" you have wanted for a couple of 
years :)

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>Of Kuehling, Felix
>Sent: Saturday, August 12, 2017 2:16 PM
>To: Oded Gabbay
>Cc: amd-gfx list
>Subject: Re: [PATCH 16/19] drm/amdkfd: Update PM4 packet headers
>
>> Do you mean that it won't work with Kaveri anymore ?
>
>Kaveri got the same firmware changes, mostly for scratch memory support.
>The Kaveri firmware headers name the structures and fields a bit differently
>but they should be binary compatible. So we simplified the code to use only
>one set of headers. I'll grab a Kaveri system to confirm that it works.
>
>Regards,
>  Felix
>
>From: Oded Gabbay 
>Sent: Saturday, August 12, 2017 11:10 AM
>To: Kuehling, Felix
>Cc: amd-gfx list
>Subject: Re: [PATCH 16/19] drm/amdkfd: Update PM4 packet headers
>
>On Sat, Aug 12, 2017 at 12:56 AM, Felix Kuehling 
>wrote:
>> To match current firmware. The map process packet has been extended to
>> support scratch. This is a non-backwards compatible change and it's
>> about two years old. So no point keeping the old version around
>> conditionally.
>
>Do you mean that it won't work with Kaveri anymore ?
>I believe we aren't allowed to break older H/W support without some
>serious justification.
>
>Oded
>
>>
>> Signed-off-by: Felix Kuehling 
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c |   8 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c | 161 
>>  drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers.h    | 314
>>+++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_vi.h | 130 +-
>>  4 files changed, 199 insertions(+), 414 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index e1c2ad2..e790e7f 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -26,7 +26,7 @@
>>  #include 
>>  #include "kfd_priv.h"
>>  #include "kfd_device_queue_manager.h"
>> -#include "kfd_pm4_headers.h"
>> +#include "kfd_pm4_headers_vi.h"
>>
>>  #define MQD_SIZE_ALIGNED 768
>>
>> @@ -238,9 +238,9 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>  * calculate max size of runlist packet.
>>  * There can be only 2 packets at once
>>  */
>> -   size += (KFD_MAX_NUM_OF_PROCESSES * sizeof(struct
>>pm4_map_process) +
>> -   max_num_of_queues_per_device *
>> -   sizeof(struct pm4_map_queues) + sizeof(struct
>>pm4_runlist)) * 2;
>> +   size += (KFD_MAX_NUM_OF_PROCESSES * sizeof(struct
>> +pm4_mes_map_process) +
>> +   max_num_of_queues_per_device * sizeof(struct
>> +pm4_mes_map_queues)
>> +   + sizeof(struct pm4_mes_runlist)) * 2;
>>
>> /* Add size of HIQ & DIQ */
>> size += KFD_KERNEL_QUEUE_SIZE * 2;  diff --git
>>a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>>b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>> index 77a6f2b..3141e05 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>> @@ -26,7 +26,6 @@
>>  #include "kfd_device_queue_manager.h"
>>  #include "kfd_kernel_queue.h"
>>  #include "kfd_priv.h"
>> -#include "kfd_pm4_headers.h"
>>  #include "kfd_pm4_headers_vi.h"
>>  #include "kfd_pm4_opcodes.h"
>>
>> @@ -44,12 +43,12 @@ static unsigned int build_pm4_header(unsigned int
>>opcode, size_t packet_size)
>>  {
>> union PM4_MES_TYPE_3_HEADER header;
>>
>> -   header.u32all = 0;
>> +   header.u32All = 0;
>> header.opcode = opcode;
>> header.count = packet_size/sizeof(uint32_t) - 2;
>> header.type = PM4_TYPE_3;
>>
>> -   return header.u32all;
>> +   return header.u32All;
>>  }
>>
>>  static void pm_calc_rlib_size(struct packet_manager *pm,  @@ -69,12
>>+68,9 @@ static void pm_calc_rlib_size(struct packet_manager *pm,
>> pr_debug("Over subscribed runlist\n");
>> }
>>
>> -   map_queue_size =
>> -   (pm->dqm->dev->device_info->asic_family == CHIP_CARRIZO) ?
>> -   sizeof(struct pm4_mes_map_queues) :
>> -   sizeof(struct pm4_map_queues);
>> +   map_queue_size = sizeof(struct pm4_mes_map_queues);
>> /* calculate run list ib allocation size */
>> -   *rlib_size = process_count * sizeof(struct pm4_map_process) +
>> +   *rlib_size = process_count * sizeof(struct
>> +pm4_mes_map_process) +
>>  queue_count * map_queue_size;
>>
>> /*
>> @@ -82,7 +78,7 @@ static void 

RE: [PATCH 05/12] drm/amdgpu: Send no-retry XNACK for all fault types

2017-07-12 Thread Bridgman, John
Agreed... I thought we had already made this change but if not then... 

Reviewed-by: John Bridgman 

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>Of Felix Kuehling
>Sent: Wednesday, July 12, 2017 1:41 AM
>To: amd-gfx@lists.freedesktop.org
>Subject: Re: [PATCH 05/12] drm/amdgpu: Send no-retry XNACK for all fault
>types
>
>Any comments?
>
>I believe this is a nice stability improvement. In case of VM faults they don't
>take down the whole GPU with an interrupt storm. With KFD we can recover
>without a GPU reset in many cases just by unmapping the offending process'
>queues.
>
>Regards,
>  Felix
>
>
>On 17-07-03 05:11 PM, Felix Kuehling wrote:
>> From: Jay Cornwall 
>>
>> A subset of VM fault types currently send retry XNACK to the client.
>> This causes a storm of interrupts from the VM to the host.
>>
>> Until the storm is throttled by other means send no-retry XNACK for
>> all fault types instead. No change in behavior to the client which
>> will stall indefinitely with the current configuration in any case.
>> Improves system stability under GC or MMHUB faults.
>>
>> Signed-off-by: Jay Cornwall 
>> Reviewed-by: Felix Kuehling 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 3 +++
>> drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 3 +++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>> index a42f483..f957b18 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>> @@ -206,6 +206,9 @@ static void gfxhub_v1_0_setup_vmid_config(struct
>amdgpu_device *adev)
>>  tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>  PAGE_TABLE_BLOCK_SIZE,
>>  adev->vm_manager.block_size - 9);
>> +/* Send no-retry XNACK on fault to suppress VM fault storm.
>*/
>> +tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>> +
>RETRY_PERMISSION_OR_INVALID_PAGE_FAULT, 0);
>>  WREG32_SOC15_OFFSET(GC, 0, mmVM_CONTEXT1_CNTL, i,
>tmp);
>>  WREG32_SOC15_OFFSET(GC, 0,
>mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32, i*2, 0);
>>  WREG32_SOC15_OFFSET(GC, 0,
>> mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_HI32, i*2, 0); diff --git
>> a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>> index 01918dc..b760018 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>> @@ -222,6 +222,9 @@ static void mmhub_v1_0_setup_vmid_config(struct
>amdgpu_device *adev)
>>  tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>  PAGE_TABLE_BLOCK_SIZE,
>>  adev->vm_manager.block_size - 9);
>> +/* Send no-retry XNACK on fault to suppress VM fault storm.
>*/
>> +tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>> +
>RETRY_PERMISSION_OR_INVALID_PAGE_FAULT, 0);
>>  WREG32_SOC15_OFFSET(MMHUB, 0,
>mmVM_CONTEXT1_CNTL, i, tmp);
>>  WREG32_SOC15_OFFSET(MMHUB, 0,
>mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32, i*2, 0);
>>  WREG32_SOC15_OFFSET(MMHUB, 0,
>> mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_HI32, i*2, 0);
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 05/12] drm/amdgpu: Send no-retry XNACK for all fault types

2017-07-12 Thread Bridgman, John

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>Of Alex Deucher
>Sent: Wednesday, July 12, 2017 11:59 AM
>To: Kuehling, Felix
>Cc: amd-gfx list
>Subject: Re: [PATCH 05/12] drm/amdgpu: Send no-retry XNACK for all fault
>types
>
>On Wed, Jul 12, 2017 at 1:40 AM, Felix Kuehling 
>wrote:
>> Any comments?
>>
>> I believe this is a nice stability improvement. In case of VM faults
>> they don't take down the whole GPU with an interrupt storm. With KFD
>> we can recover without a GPU reset in many cases just by unmapping the
>> offending process' queues.
>
>Will this cause any problems with enabling recoverable page faults later?  If
>not,
>Acked-by: Alex Deucher 

We will need to back this out in order to enable recoverable page faults later, 
but probably still worth doing in the short term IMO.

>
>>
>> Regards,
>>   Felix
>>
>>
>> On 17-07-03 05:11 PM, Felix Kuehling wrote:
>>> From: Jay Cornwall 
>>>
>>> A subset of VM fault types currently send retry XNACK to the client.
>>> This causes a storm of interrupts from the VM to the host.
>>>
>>> Until the storm is throttled by other means send no-retry XNACK for
>>> all fault types instead. No change in behavior to the client which
>>> will stall indefinitely with the current configuration in any case.
>>> Improves system stability under GC or MMHUB faults.
>>>
>>> Signed-off-by: Jay Cornwall 
>>> Reviewed-by: Felix Kuehling 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 3 +++
>>> drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c  | 3 +++
>>>  2 files changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>>> index a42f483..f957b18 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
>>> @@ -206,6 +206,9 @@ static void gfxhub_v1_0_setup_vmid_config(struct
>amdgpu_device *adev)
>>>   tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>>   PAGE_TABLE_BLOCK_SIZE,
>>>   adev->vm_manager.block_size - 9);
>>> + /* Send no-retry XNACK on fault to suppress VM fault storm. */
>>> + tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>> +
>>> + RETRY_PERMISSION_OR_INVALID_PAGE_FAULT, 0);
>>>   WREG32_SOC15_OFFSET(GC, 0, mmVM_CONTEXT1_CNTL, i, tmp);
>>>   WREG32_SOC15_OFFSET(GC, 0,
>mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32, i*2, 0);
>>>   WREG32_SOC15_OFFSET(GC, 0,
>>> mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_HI32, i*2, 0); diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>>> index 01918dc..b760018 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c
>>> @@ -222,6 +222,9 @@ static void mmhub_v1_0_setup_vmid_config(struct
>amdgpu_device *adev)
>>>   tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>>   PAGE_TABLE_BLOCK_SIZE,
>>>   adev->vm_manager.block_size - 9);
>>> + /* Send no-retry XNACK on fault to suppress VM fault storm. */
>>> + tmp = REG_SET_FIELD(tmp, VM_CONTEXT1_CNTL,
>>> +
>>> + RETRY_PERMISSION_OR_INVALID_PAGE_FAULT, 0);
>>>   WREG32_SOC15_OFFSET(MMHUB, 0, mmVM_CONTEXT1_CNTL, i,
>tmp);
>>>   WREG32_SOC15_OFFSET(MMHUB, 0,
>mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32, i*2, 0);
>>>   WREG32_SOC15_OFFSET(MMHUB, 0,
>>> mmVM_CONTEXT1_PAGE_TABLE_START_ADDR_HI32, i*2, 0);
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

2017-06-26 Thread Bridgman, John
Agreed... one person's "best" is another person's "OMG I didn't want that". IMO 
we should have bits correspond to specific options as much as possible, modulo 
HW capabilities.

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Xie, AlexBin
>Sent: Monday, June 26, 2017 2:12 PM
>To: Panariti, David; Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
>of ECC/EDC.
>
>Hi,
>
>I have not checked the background of this discussion very closely yet. And you
>might have known the following.
>
>Customers may not want the default setting to change meaning. This is like an
>API.
>Example: The application and its environment is already set up and tested.
>Then if customer updates driver, suddenly driver has some new behavior?
>Certain serious application definitely does not accept this.
>
>IMHO, it is better to avoid vague concepts like "best". It will become rather
>difficult to define what is best when there are multiple customers with
>different requirements. Driver is to provide a feature or mechanism. "Best"
>sounds like a policy or a preference from driver side.
>
>In my pass work, I generally use default for two cases:
>1. The default is the most conservative option, which must work. Then the
>application can choose advanced features by choosing other parameter
>value/option.
>2. The default parameter is the compatible behavior before introducing this
>parameter/option.
>
>Regards,
>Alex Bin
>
>
>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Panariti, David
>Sent: Monday, June 26, 2017 12:06 PM
>To: Deucher, Alexander ; amd-
>g...@lists.freedesktop.org
>Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
>of ECC/EDC.
>
>>> > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by
>>> > default.  That can be our asic specific default setting.  In the
>>> > case of CZ, that will be disabled until we decide to enable EDC by 
>>> > default.
>>> [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
>>> I used ECC as the generic term for ECC and EDC, since ECC seems more
>>> basic (EDC is built on top of ECC).
>>> If I understand you, we can't do what you want with the current setup.
>
>>I'm saying we make ECC_BEST the default config (feel free to re-name if
>>ECC_DEFAULT).  Each asic can have a different default depending on what
>>features are ready.  So for CZ, we'd make ECC_BEST equivalent to
>>disabling ECC for now.  If a user wants to force it on, they can set ECC_EDC.
>Once EDC is stable on CZ, we can make ECC_BEST be equivalent to ECC_EDC.
>The way the default (ECC_BEST) always maps to the best available
>combination in that version of the driver.
>
>That's not how I meant it to work WRT BEST.
>Each asic will have a DEFAULT, but that isn't what BEST means.
>CZ is a good example (when fully implemented).  DEFAULT for CZ is everything
>except HALT, since, IMO opinion, most people do not want to hang or reboot.
>BEST for CZ would be everything a person most interested in reliability would
>want, which IMO, includes HALT/reboot.
>Similar is if something like performance degradation is really bad, DEFAULT
>would be OFF. BEST would be ON, e.g., if the user's app doesn't trigger the
>performance problem.
>The BEST bit is in a fixed position, so that customers don't need to worry what
>bits are needed for the most reliable performance (in our opinion) on a given
>asic.
>And if a customer (or developer) wants some arbitrary set of features, they
>can set bits as they want.
>
>I think DEFAULT will make most people happy.
>BEST allows people who are interested in everything they can get, regardless
>of any issues that brings with it. It is requested simply by using a fixed 
>param
>value (0x01) for any asic.
>This probably should not include features that have any kind of fatal flaw such
>as the Vega10 HBM ECC issue.  When fixed, it can be added to DEFAULT.
>And allowing per-feature control allows anyone to do precisely what they
>want.
>"Effort" increases as the number of interested users decreases.
>
>Using defines in the init code will be a problem if there is more than one kind
>of asic involved or a single asic that the user wants to use with different
>parameters.  However, this doesn't seem to be a high priority.
>If we do want to worry about it, then we'll need to put the values into the
>amdgpu_gfx struct.
>
>regards,
>davep
>
>> -Original Message-
>> From: Deucher, Alexander
>> Sent: Tuesday, June 06, 2017 6:16 PM
>> To: Panariti, David ; amd-
>> g...@lists.freedesktop.org
>> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
>> use of ECC/EDC.
>>
>> > -Original Message-
>> > From: Panariti, David
>> > Sent: Tuesday, June 06, 2017 5:50 PM
>> > To: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>> > 

Re: [pull] radeon drm-fixes-4.11

2017-03-29 Thread Bridgman, John
This is a request for Dave to pull changes from Alex's tree into Dave's 
"drm-fixes" tree, which is the last step before it gets sent to Linus.


Dave is the drm subsystem maintainer, and drm-next / drm-fixes branches are 
where code from multiple GPU driver maintainers comes together. Dave would get 
similar requests from Intel, Nouveau developers etc...



From: amd-gfx  on behalf of Panariti, 
David 
Sent: March 29, 2017 1:53 PM
To: Alex Deucher; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org; airl...@gmail.com
Cc: Deucher, Alexander
Subject: RE: [pull] radeon drm-fixes-4.11

Hi,

I'm still new to this stuff.
Is this informational or some action items?

thanks,
davep

> -Original Message-
> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> Of Alex Deucher
> Sent: Wednesday, March 29, 2017 12:55 PM
> To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org;
> airl...@gmail.com
> Cc: Deucher, Alexander 
> Subject: [pull] radeon drm-fixes-4.11
>
> Hi Dave,
>
> One small fix for radeon.
>
> The following changes since commit
> d64a04720b0e64c1cd0726a3a27b360822fbee22:
>
>   Merge branch 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux
> into drm-fixes (2017-03-24 11:05:06 +1000)
>
> are available in the git repository at:
>
>   git://people.freedesktop.org/~agd5f/linux drm-fixes-4.11
>
> for you to fetch changes up to
> ce4b4f228e51219b0b79588caf73225b08b5b779:
>
>   drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
> (2017-03-27 16:17:30 -0400)
>
> 
> Michel Dänzer (1):
>   drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
>
>  drivers/gpu/drm/radeon/radeon_ttm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Add support for high priority scheduling in amdgpu

2017-03-01 Thread Bridgman, John
In patch "drm/amdgpu: implement ring set_priority for gfx_v8 compute" can you 
remind me why you are only passing pipe and not queue to vi_srbm_select() ?

+static void gfx_v8_0_ring_set_priority_compute(struct amdgpu_ring *ring,  
+  int priority)  
+{  
+   struct amdgpu_device *adev = ring->adev;  
+  
+   if (ring->hw_ip != AMDGPU_HW_IP_COMPUTE)  
+   return;  
+  
+   mutex_lock(>srbm_mutex);  
+   vi_srbm_select(adev, ring->me, ring->pipe, 0, 0);  


>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of
>Andres Rodriguez
>Sent: Tuesday, February 28, 2017 5:14 PM
>To: amd-gfx@lists.freedesktop.org
>Subject: Add support for high priority scheduling in amdgpu
>
>This patch series introduces a mechanism that allows users with sufficient
>privileges to categorize their work as "high priority". A userspace app can
>create a high priority amdgpu context, where any work submitted to this
>context will receive preferential treatment over any other work.
>
>High priority contexts will be scheduled ahead of other contexts by the sw gpu
>scheduler. This functionality is generic for all HW blocks.
>
>Optionally, a ring can implement a set_priority() function that allows
>programming HW specific features to elevate a ring's priority.
>
>This patch series implements set_priority() for gfx8 compute rings. It takes
>advantage of SPI scheduling and CU reservation to provide improved frame
>latencies for high priority contexts.
>
>For compute + compute scenarios we get near perfect scheduling latency. E.g.
>one high priority ComputeParticles + one low priority ComputeParticles:
>- High priority ComputeParticles: 2.0-2.6 ms/frame
>- Regular ComputeParticles: 35.2-68.5 ms/frame
>
>For compute + gfx scenarios the high priority compute application does
>experience some latency variance. However, the variance has smaller bounds
>and a smalled deviation then without high priority scheduling.
>
>Following is a graph of the frame time experienced by a high priority compute
>app in 4 different scenarios to exemplify the compute + gfx latency variance:
>- ComputeParticles: this scenario invloves running the compute particles
>  sample on its own.
>- +SSAO: Previous scenario with the addition of running the ssao sample
>  application that clogs the GFX ring with constant work.
>- +SPI Priority: Previous scenario with the addition of SPI priority
>  programming for compute rings.
>- +CU Reserve: Previous scenario with the addition of dynamic CU
>  reservation for compute rings.
>
>Graph link:
>https://plot.ly/~lostgoat/9/
>
>As seen above, high priority contexts for compute allow us to schedule work
>with enhanced confidence of completion latency under high GPU loads. This
>property will be important for VR reprojection workloads.
>
>Note: The first part of this series is a resend of "Change queue/pipe split
>between amdkfd and amdgpu" with the following changes:
>- Fixed kfdtest on Kaveri due to shift overflow. Refer to: "drm/amdkfdallow
>  split HQD on per-queue granularity v3"
>- Used Felix's suggestions for a simplified HQD programming sequence
>- Added a workaround for a Tonga HW bug during HQD programming
>
>This series is also available at:
>https://github.com/lostgoat/linux/tree/wip-high-priority
>
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Change queue/pipe split between amdkfd and amdgpu

2017-02-15 Thread Bridgman, John
Any objections to authorizing Oded to post the kfdtest binary he is using to 
some public place (if not there already) so others (like Andres) can test 
changes which touch on amdkfd ? 

We should check it for embarrassing symbols but otherwise it should be OK.

That said, since we are getting perilously close to actually sending dGPU 
support changes upstream we will need (IMO) to maintain a sanitized source repo 
for kfdtest as well... sharing the binary just gets us started.

Thanks,
John

>-Original Message-
>From: Oded Gabbay [mailto:oded.gab...@gmail.com]
>Sent: Friday, February 10, 2017 12:57 PM
>To: Andres Rodriguez
>Cc: Kuehling, Felix; Bridgman, John; amd-gfx@lists.freedesktop.org;
>Deucher, Alexander; Jay Cornwall
>Subject: Re: Change queue/pipe split between amdkfd and amdgpu
>
>I don't have a repo, nor do I have the source code.
>It is a tool that we developed inside AMD (when I was working there), and
>after I left AMD I got permission to use the binary for regressions testing.
>
>Oded
>
>On Fri, Feb 10, 2017 at 6:33 PM, Andres Rodriguez <andre...@gmail.com>
>wrote:
>> Hey Oded,
>>
>> Where can I find a repo with kfdtest?
>>
>> I tried looking here bit couldn't find it:
>>
>> https://cgit.freedesktop.org/~gabbayo/
>>
>> -Andres
>>
>>
>>
>> On 2017-02-10 05:35 AM, Oded Gabbay wrote:
>>>
>>> So the warning in dmesg is gone of course, but the test (that I
>>> mentioned in previous email) still fails, and this time it caused the
>>> kernel to crash. In addition, now other tests fail as well, e.g.
>>> KFDEventTest.SignalEvent
>>>
>>> I honestly suggest to take some time to debug this patch-set on an
>>> actual Kaveri machine and then re-send the patches.
>>>
>>> Thanks,
>>> Oded
>>>
>>> log of crash from KFDQMTest.CreateMultipleCpQueues:
>>>
>>> [  160.900137] kfd: qcm fence wait loop timeout expired [
>>> 160.900143] kfd: the cp might be in an unrecoverable state due to an
>>> unsuccessful queues preemption [  160.916765] show_signal_msg: 36
>>> callbacks suppressed [  160.916771] kfdtest[2498]: segfault at
>>> 17f8a ip 7f8ae932ee5d sp 7ffc52219cd0 error 4 in
>>> libhsakmt-1.so.0.0.1[7f8ae932b000+8000]
>>> [  163.152229] kfd: qcm fence wait loop timeout expired [
>>> 163.152250] BUG: unable to handle kernel NULL pointer dereference at
>>> 005a [  163.152299] IP:
>>> kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152323] PGD
>>> 2333aa067 [  163.152323] PUD 230f64067 [  163.152335] PMD 0
>>>
>>> [  163.152364] Oops:  [#1] SMP
>>> [  163.152379] Modules linked in: joydev edac_mce_amd edac_core
>>> input_leds kvm_amd snd_hda_codec_realtek kvm irqbypass
>>> snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
>snd_hda_codec
>>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core
>>> snd_hwdep pcbc snd_pcm aesni_intel snd_seq_midi snd_seq_midi_event
>>> snd_rawmidi snd_seq aes_x86_64 crypto_simd snd_seq_device
>glue_helper
>>> cryptd snd_timer snd fam15h_power k10temp soundcore i2c_piix4 shpchp
>>> tpm_infineon mac_hid parport_pc ppdev nfsd auth_rpcgss nfs_acl lockd
>>> lp grace sunrpc parport autofs4 hid_logitech_hidpp hid_logitech_dj
>>> hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon
>>> i2c_algo_bit ttm drm_kms_helper syscopyarea ahci sysfillrect
>>> sysimgblt libahci fb_sys_fops drm r8169 mii fjes video [  163.152668]
>>> CPU: 3 PID: 2498 Comm: kfdtest Not tainted 4.10.0-rc5+ #3 [
>>> 163.152695] Hardware name: Gigabyte Technology Co., Ltd. To be filled
>>> by O.E.M./F2A88XM-D3H, BIOS F5 01/09/2014 [  163.152735] task:
>>> 995e73d16580 task.stack: b41144458000 [  163.152764] RIP:
>>> 0010:kfd_get_process_device_data+0x6/0x30 [amdkfd] [  163.152790]
>>> RSP: 0018:b4114445bab0 EFLAGS: 00010246 [  163.152812] RAX:
>>> ffea RBX: 995e75909c00 RCX:
>>> 
>>> [  163.152841] RDX:  RSI: ffea RDI:
>>> 995e75909600
>>> [  163.152869] RBP: b4114445bae0 R08: 000252a5 R09:
>>> 0414
>>> [  163.152898] R10:  R11: b412d38d R12:
>>> ffc2
>>> [  163.152926] R13:  R14: 995e75909ca8 R15:
>>> 995e75909c00
>>> [  163.152956] FS:  7f8ae975e740() GS:995e7ed8()
>>> knlGS:
>>> [  163.152988] CS:

Re: [RFC] Mechanism for high priority scheduling in amdgpu

2016-12-29 Thread Bridgman, John
One question I just remembered - the amdgpu driver includes some scheduler 
logic which maintains per-process queues and therefore avoids loading up the 
primary ring with a ton of work.


Has there been any experimentation with injecting priorities at that level 
rather than jumping straight to HW-level changes ?


From: amd-gfx  on behalf of Andres 
Rodriguez 
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; 
amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; 
Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll be 
acquiring the HMD display directly, separate from any desktop or display server.

Assuming that the the HMD is attached to the rendering device in some way you 
have the X server and the Compositor which both try to be DRM master at the 
same time.

Please correct me if that was fixed in the meantime, but that sounds like it 
will simply not work. Or is this what Andres mention below Dave is working on ?.

You are correct on both statements. We can't have two DRM_MASTERs, so the 
current DRM+X does not support this use case. And this what Dave and 
Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter 
productive when you want to keep the latency low.

One thing I'd like to correct is that our main goal is to get latency 
_predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of unpredictability: 
the scheduling latency when the hardware is already full of work from the game 
engine.

The DirectMode feature addresses one of the latency sources: multiple 
(unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland 
sounds like the much better idea.

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use 
Wayland by default). Once we have everything working nicely under X (where most 
of the users are currently), I'm sure Pierre-Loup will be pushing us to get 
everything optimized under Wayland as well (which should be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing I've 
been waiting for the day I can finally ditch X altogether :)

Regards,
Andres


On Fri, Dec 23, 2016 at 5:54 AM, Christian König 
> wrote:
But yes, in general you don't want another compositor in the way, so we'll be 
acquiring the HMD display directly, separate from any desktop or display server.
Assuming that the the HMD is attached to the rendering device in some way you 
have the X server and the Compositor which both try to be DRM master at the 
same time.

Please correct me if that was fixed in the meantime, but that sounds like it 
will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compositor in combination with X is a bit counter 
productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data to 
be displayed is from the Application -> X server -> compositor -> X server.

The extra step between X server and compositor just means extra latency and for 
this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility XWayland 
sounds like the much better idea.

Regards,
Christian.


Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
Display concerns are a separate issue, and as Andres said we have other plans 
to address. But yes, in general you don't want another compositor in the way, 
so we'll be acquiring the HMD display directly, separate from any desktop or 
display server. Same with security, we can have a separate conversation about 
that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop 

Re: [RFC] Mechanism for high priority scheduling in amdgpu

2016-12-29 Thread Bridgman, John
Excellent, thanks. Agree that it is not a complete solution, just a good start.


I do think we will need to get to setting priorities at HW level fairly quickly 
(we want it for ROCM as well as for VR) but we'll need to eliminate the current 
requirement for randomization at SQ as part of a HW approach and I don't think 
we know how long that will take at the moment.


IIRC randomization was required to avoid deadlock problems with certain OpenCL 
programs - what I don't know is whether the problem is inherent to the OpenCL 
API spec or just a function of how specific OpenCL programs were written. I'll 
try to dig up some history for that and ask around internally as well.

From: Andres Rodriguez <andre...@gmail.com>
Sent: December 23, 2016 11:30 AM
To: Bridgman, John
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; 
Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; 
Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12

My goal is to first implement this approach, then slowly work my way towards 
the HW level optimizations.

The problem I expect to see with this approach is that there will still be 
unpredictably long latencies depending on what has been committed to the HW 
rings.

But it is definitely a good start.

Regards,
Andres

On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John 
<john.bridg...@amd.com<mailto:john.bridg...@amd.com>> wrote:

One question I just remembered - the amdgpu driver includes some scheduler 
logic which maintains per-process queues and therefore avoids loading up the 
primary ring with a ton of work.


Has there been any experimentation with injecting priorities at that level 
rather than jumping straight to HW-level changes ?


From: amd-gfx 
<amd-gfx-boun...@lists.freedesktop.org<mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Andres Rodriguez <andre...@gmail.com<mailto:andre...@gmail.com>>
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Andres 
Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking

Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll be 
acquiring the HMD display directly, separate from any desktop or display server.

Assuming that the the HMD is attached to the rendering device in some way you 
have the X server and the Compositor which both try to be DRM master at the 
same time.

Please correct me if that was fixed in the meantime, but that sounds like it 
will simply not work. Or is this what Andres mention below Dave is working on ?.

You are correct on both statements. We can't have two DRM_MASTERs, so the 
current DRM+X does not support this use case. And this what Dave and 
Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter 
productive when you want to keep the latency low.

One thing I'd like to correct is that our main goal is to get latency 
_predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of unpredictability: 
the scheduling latency when the hardware is already full of work from the game 
engine.

The DirectMode feature addresses one of the latency sources: multiple 
(unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland 
sounds like the much better idea.

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use 
Wayland by default). Once we have everything working nicely under X (where most 
of the users are currently), I'm sure Pierre-Loup will be pushing us to get 
everything optimized under Wayland as well (which should be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing I've 
been waiting for the day I can finally ditch X altogether :)

Regards,
Andres


On Fri, Dec 23, 2016 at 5:54 AM, Christian König 
<christian.koe...@amd.com<mailto:christian.koe...@amd.com>> wrote:
But yes, in general you don't want another compositor in the way, so we'll be 
acquiring the HMD display directly, separate from any desktop or display server.
Assuming that the the HMD is attached to the rendering device in some way you 
have the X server and the Compositor which both try to be DRM master at the 
same time.

Please correct me if that was fixed in the meantime, but that sounds like it 
will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compos

Re: [RFC] Using DC in amdgpu for upcoming GPU

2016-12-13 Thread Bridgman, John
>>If the Linux community contributes to DC, I guess those contributions
can generally be assumed to be GPLv2 licensed.  Yet a future version
of the macOS driver would incorporate those contributions in the same
binary as their closed source OS-specific portion.


My understanding of the "general rule" was that contributions are normally 
assumed to be made under the "local license", ie GPLv2 for kernel changes in 
general, but the appropriate lower-level license when made to a specific 
subsystem with a more permissive license (eg the X11 license aka MIT aka "GPL 
plus additional rights" license we use for almost all of the graphics 
subsystem. If DC is not X11 licensed today it should be (but I'm pretty sure it 
already is).


We need to keep the graphics subsystem permissively licensed in general to 
allow uptake by other free OS projects such as *BSD, not just closed source.


Either way, driver-level maintainers are going to have to make sure that 
contributions have clear licensing.


Thanks,

John


From: dri-devel  on behalf of Lukas 
Wunner 
Sent: December 13, 2016 4:40 AM
To: Cheng, Tony
Cc: Grodzovsky, Andrey; dri-devel; amd-gfx mailing list; Deucher, Alexander
Subject: Re: [RFC] Using DC in amdgpu for upcoming GPU

On Mon, Dec 12, 2016 at 09:52:08PM -0500, Cheng, Tony wrote:
> With DC the display hardware programming, resource optimization, power
> management and interaction with rest of system will be fully validated
> across multiple OSs.

Do I understand DAL3.jpg correctly that the macOS driver builds on top
of DAL Core?  I'm asking because the graphics drivers shipping with
macOS as well as on Apple's EFI Firmware Volume are closed source.

If the Linux community contributes to DC, I guess those contributions
can generally be assumed to be GPLv2 licensed.  Yet a future version
of the macOS driver would incorporate those contributions in the same
binary as their closed source OS-specific portion.

I don't quite see how that would be legal but maybe I'm missing
something.

Presumably the situation with the Windows driver is the same.

I guess you could maintain a separate branch sans community contributions
which would serve as a basis for closed source drivers, but not sure if
that is feasible given your resource constraints.

Thanks,

Lukas
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC] Using DC in amdgpu for upcoming GPU

2016-12-11 Thread Bridgman, John
Yep, good point. We have tended to stay a bit behind bleeding edge because our 
primary tasks so far have been:


1. Support enterprise distros (with old kernels) via the hybrid driver 
(AMDGPU-PRO), where the closer to upstream we get the more of a gap we have to 
paper over with KCL code


2. Push architecturally simple code (new GPU support) upstream, where being 
closer to upstream makes the up-streaming task simpler but not by that much


So 4.7 isn't as bad a compromise as it might seem.


That said, in the case of DAL/DC it's a different story as you say... 
architecturally complex code needing to be woven into a fast-moving subsystem 
of the kernel. So for DAL/DC anything other than upstream is going to be a big 
pain.


OK, need to think that through.


Thanks !


From: dri-devel  on behalf of Daniel 
Vetter 
Sent: December 12, 2016 2:22 AM
To: Wentland, Harry
Cc: Grodzovsky, Andrey; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org; Deucher, Alexander; Cheng, Tony
Subject: Re: [RFC] Using DC in amdgpu for upcoming GPU

On Wed, Dec 07, 2016 at 09:02:13PM -0500, Harry Wentland wrote:
> Current version of DC:
>
>  * 
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display?h=amd-staging-4.7
>
> Once Alex pulls in the latest patches:
>
>  * 
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/display?h=amd-staging-4.7

One more: That 4.7 here is going to be unbelievable amounts of pain for
you. Yes it's a totally sensible idea to just freeze your baseline kernel
because then linux looks a lot more like Windows where the driver abi is
frozen. But it makes following upstream entirely impossible, because
rebasing is always a pain and hence postponed. Which means you can't just
use the latest stuff in upstream drm, which means collaboration with
others and sharing bugfixes in core is a lot more pain, which then means
you do more than necessary in your own code and results in HALs like DAL,
perpetuating the entire mess.

So I think you don't just need to demidlayer DAL/DC, you also need to
demidlayer your development process. In our experience here at Intel that
needs continuous integration testing (in drm-tip), because even 1 month of
not resyncing with drm-next is sometimes way too long. See e.g. the
controlD regression we just had. And DAL is stuck on a 1 year old kernel,
so pretty much only of historical significance and otherwise dead code.

And then for any stuff which isn't upstream yet (like your internal
enabling, or DAL here, or our own internal enabling) you need continuous
rebasing When we started doing this years ago it was still
manually, but we still rebased like every few days to keep the pain down
and adjust continuously to upstream evolution. But then going to a
continous rebase bot that sends you mail when something goes wrong was
again a massive improvement.

I guess in the end Conway's law that your software architecture
necessarily reflects how you organize your teams applies again. Fix your
process and it'll become glaringly obvious to everyone involved that
DC-the-design as-is is entirely unworkeable and how it needs to be fixed.

>From my own experience over the past few years: Doing that is a fun
journey ;-)

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC] Using DC in amdgpu for upcoming GPU

2016-12-11 Thread Bridgman, John
Thanks Dave. Apologies in advance for top posting but I'm stuck on a mail 
client that makes a big mess when I try...


>If DC was ready for the next-gen GPU it would be ready for the current
>GPU, it's not the specific ASIC code that is the problem, it's the
>huge midlayer sitting in the middle.


We realize that (a) we are getting into the high-risk-of-breakage part of the 
rework and (b) no matter how much we change the code structure there's a good 
chance that a month after it goes upstream one of us is going to find that more 
structural changes are required.


I was kinda thinking that if we are doing high-risk activities (risk of subtle 
breakage not obvious regression, and/or risk of making structural changes that 
turn out to be a bad idea even though we all thought they were correct last 
week) there's an argument for doing it in code which only supports cards that 
people can't buy yet.


From: Dave Airlie <airl...@gmail.com>
Sent: December 11, 2016 9:57 PM
To: Wentland, Harry
Cc: dri-devel; amd-gfx mailing list; Bridgman, John; Deucher, Alexander; 
Lazare, Jordan; Cheng, Tony; Cyr, Aric; Grodzovsky, Andrey
Subject: Re: [RFC] Using DC in amdgpu for upcoming GPU

On 8 December 2016 at 12:02, Harry Wentland <harry.wentl...@amd.com> wrote:
> We propose to use the Display Core (DC) driver for display support on
> AMD's upcoming GPU (referred to by uGPU in the rest of the doc). In order to
> avoid a flag day the plan is to only support uGPU initially and transition
> to older ASICs gradually.

[FAQ: from past few days]

1) Hey you replied to Daniel, you never addressed the points of the RFC!
I've read it being said that I hadn't addressed the RFC, and you know
I've realised I actually had, because the RFC is great but it
presupposes the codebase as designed can get upstream eventually, and
I don't think it can. The code is too littered with midlayering and
other problems, that actually addressing the individual points of the
RFC would be missing the main point I'm trying to make.

This code needs rewriting, not cleaning, not polishing, it needs to be
split into its constituent parts, and reintegrated in a form more
Linux process friendly.

I feel that if I reply to the individual points Harry has raised in
this RFC, that it means the code would then be suitable for merging,
which it still won't, and I don't want people wasting another 6
months.

If DC was ready for the next-gen GPU it would be ready for the current
GPU, it's not the specific ASIC code that is the problem, it's the
huge midlayer sitting in the middle.

2) We really need to share all of this code between OSes, why does
Linux not want it?

Sharing code is a laudable goal and I appreciate the resourcing
constraints that led us to the point at which we find ourselves, but
the way forward involves finding resources to upstream this code,
dedicated people (even one person) who can spend time on a day by day
basis talking to people in the open and working upstream, improving
other pieces of the drm as they go, reading atomic patches and
reviewing them, and can incrementally build the DC experience on top
of the Linux kernel infrastructure. Then having the corresponding
changes in the DC codebase happen internally to correspond to how the
kernel code ends up looking. Lots of this code overlaps with stuff the
drm already does, lots of is stuff the drm should be doing, so patches
to the drm should be sent instead.

3) Then how do we upstream it?
Resource(s) need(s) to start concentrating at splitting this thing up
and using portions of it in the upstream kernel. We don't land fully
formed code in the kernel if we can avoid it. Because you can't review
the ideas and structure as easy as when someone builds up code in
chunks and actually develops in the Linux kernel. This has always
produced better more maintainable code. Maybe the result will end up
improving the AMD codebase as well.

4) Why can't we put this in staging?
People have also mentioned staging, Daniel has called it a dead end,
I'd have considered staging for this code base, and I still might.
However staging has rules, and the main one is code in staging needs a
TODO list, and agreed criteria for exiting staging, I don't think we'd
be able to get an agreement on what the TODO list should contain and
how we'd ever get all things on it done. If this code ended up in
staging, it would most likely require someone dedicated to recreating
it in the mainline driver in an incremental fashion, and I don't see
that resource being available.

5) Why is a midlayer bad?
I'm not going to go into specifics on the DC midlayer, but we abhor
midlayers for a fair few reasons. The main reason I find causes the
most issues is locking. When you have breaks in code flow between
multiple layers, but having layers calling back into previous layers
it becomes near impossible to track who owns the locking and what the
current locking state is.

RE: AMD and free and open source software

2016-08-31 Thread Bridgman, John
Right... the microcode is part of the HW design; some vendors build the 
microcode images into the chip, while others have the BIOS or driver load them 
at start-up. 

The industry is generally moving to driver-loaded microcode, but I don't 
believe any vendor is planning to start opening up their hardware designs.

Thanks,
John

>-Original Message-
>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>Of Huang Rui
>Sent: Wednesday, August 31, 2016 10:15 PM
>To: Frederique
>Cc: amd-gfx@lists.freedesktop.org
>Subject: Re: AMD and free and open source software
>
>We don't have the plan to open up firmware source.
>
>Thanks,
>Rui
>
>On Thu, Sep 01, 2016 at 04:16:59AM +0800, Frederique wrote:
>> Dear Huang Rui,
>>
>> I recently swapped my NVIDIA Geforce 980 Ti for an AMD R9 Fury because
>> of the devoted efforts that are being made towards a free and open
>> source software driver.
>>
>> I will be sticking with AMD for as long as this effort continues and
>> extends.
>>
>> I have one question however. I use Debian, and right now I am only one
>> non-free package away from being free, this is the AMD Graphics
>> Firmware package. Will AMD make an effort to open up the firmware bits
>> too? If not, is there any particular reason why this is being held back?
>>
>> Thank you for your time.
>>
>> Sincerely yours,
>> Frederique
>___
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: AMD and free and open source software

2016-08-31 Thread Bridgman, John
One suggestion that was made recently was for AMD to supply packages containing 
only HW microcode images for AMD GPUs... you would still be using a non-free 
package but you would not have to enable the entire non-free Debian package 
that contains both real "firmware" and HW microcode for a number of different 
vendors. 

Some people felt that would be a big help while others felt it would not make 
any difference - how do you feel about that ?

>-Original Message-
>From: Bridgman, John
>Sent: Wednesday, August 31, 2016 10:22 PM
>To: Huang, Ray; Frederique
>Cc: amd-gfx@lists.freedesktop.org
>Subject: RE: AMD and free and open source software
>
>Right... the microcode is part of the HW design; some vendors build the
>microcode images into the chip, while others have the BIOS or driver load
>them at start-up.
>
>The industry is generally moving to driver-loaded microcode, but I don't
>believe any vendor is planning to start opening up their hardware designs.
>
>Thanks,
>John
>
>>-Original Message-
>>From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
>>Of Huang Rui
>>Sent: Wednesday, August 31, 2016 10:15 PM
>>To: Frederique
>>Cc: amd-gfx@lists.freedesktop.org
>>Subject: Re: AMD and free and open source software
>>
>>We don't have the plan to open up firmware source.
>>
>>Thanks,
>>Rui
>>
>>On Thu, Sep 01, 2016 at 04:16:59AM +0800, Frederique wrote:
>>> Dear Huang Rui,
>>>
>>> I recently swapped my NVIDIA Geforce 980 Ti for an AMD R9 Fury
>>> because of the devoted efforts that are being made towards a free and
>>> open source software driver.
>>>
>>> I will be sticking with AMD for as long as this effort continues and
>>> extends.
>>>
>>> I have one question however. I use Debian, and right now I am only
>>> one non-free package away from being free, this is the AMD Graphics
>>> Firmware package. Will AMD make an effort to open up the firmware
>>> bits too? If not, is there any particular reason why this is being held 
>>> back?
>>>
>>> Thank you for your time.
>>>
>>> Sincerely yours,
>>> Frederique
>>___
>>amd-gfx mailing list
>>amd-gfx@lists.freedesktop.org
>>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx