Same bridge number?

2018-11-07 Thread Panariti, David

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
> Kind of dma_fence_wait_killable, except that we don't have such API
> (maybe worth adding ?)
Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited 
using `M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx 
has run for Y seconds."  help?
I personally hate hanging w/no info.

regards,
davep


From: Grodzovsky, Andrey
Sent: Tuesday, April 24, 2018 11:58:19 AM
To: Panariti, David; linux-ker...@vger.kernel.org; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander; Koenig, Christian; o...@redhat.com; 
a...@linux-foundation.org; ebied...@xmission.com
Subject: Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.



On 04/24/2018 11:52 AM, Panariti, David wrote:
> Hi,
>
> It looks like there can be an infinite loop if neither of the if()'s become 
> true.
> Is that an impossible condition?

That intended, we want to wait until either the fence signals or fatal
signal received, we don't want to terminate the wait if fence is not
signaled  even  when interrupted by non fatal signal.
Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Andrey

>
> -Original Message-
> From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> Sent: Tuesday, April 24, 2018 11:31 AM
> To: linux-ker...@vger.kernel.org; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian 
> <christian.koe...@amd.com>; Panariti, David <david.panar...@amd.com>; 
> o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; 
> Grodzovsky, Andrey <andrey.grodzov...@amd.com>
> Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
> ring hang.
>
> If the ring is hanging for some reason allow to recover the waiting by 
> sending fatal signal.
>
> Originally-by: David Panariti <david.panar...@amd.com>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>   1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }
>   }
>   }
>
> --
> 2.7.4
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
Hi,

It looks like there can be an infinite loop if neither of the if()'s become 
true.
Is that an impossible condition?

-Original Message-
From: Andrey Grodzovsky <andrey.grodzov...@amd.com> 
Sent: Tuesday, April 24, 2018 11:31 AM
To: linux-ker...@vger.kernel.org; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian 
<christian.koe...@amd.com>; Panariti, David <david.panar...@amd.com>; 
o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, 
Andrey <andrey.grodzov...@amd.com>
Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.

If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti <david.panar...@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
 
if (other) {
signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }
}
}
 
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Documentation about AMD's HSA implementation?

2018-02-13 Thread Panariti, David
+ Dave Roberts

Do you still have links to the HSA doc collected during NMI?


From: amd-gfx  on behalf of Felix 
Kuehling 
Sent: Tuesday, February 13, 2018 2:56:47 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: Documentation about AMD's HSA implementation?

There is also this: https://gpuopen.com/professional-compute/, which
give pointer to several libraries and tools that built on top of ROCm.


Another thing to keep in mind is, that ROCm is diverging from the strict
HSA standard in some important ways. For example the HSA standard
includes HSAIL as an intermediate representation that gets finalized on
the target system, whereas ROCm compiles directly to native GPU ISA.


Regards,
  Felix


On 2018-02-13 09:40 AM, Deucher, Alexander wrote:
>
> The ROCm documentation is probably a good place to start:
>
> https://rocm.github.io/documentation.html
>
>
> Alex
>
> 
> *From:* amd-gfx  on behalf of
> Ming Yang 
> *Sent:* Tuesday, February 13, 2018 12:00 AM
> *To:* amd-gfx@lists.freedesktop.org
> *Subject:* Documentation about AMD's HSA implementation?
>
> Hi,
>
> I'm interested in HSA and excited when I found AMD's fully open-stack
> ROCm supporting it. Before digging into the code, I wonder if there's
> any documentation available about AMD's HSA implementation, either
> book, whitepaper, paper, or documentation.
>
> I did find helpful materials about HSA, including HSA standards on
> this page (http://www.hsafoundation.com/standards/) and a nice book
> about HSA (Heterogeneous System Architecture A New Compute Platform
> Infrastructure). But regarding the documentation about AMD's
> implementation, I haven't found anything yet.
>
> Please let me know if there are ones publicly accessible. If no, any
> suggestions on learning the implementation of specific system
> components, e.g., queue scheduling.
>
> Best,
> Mark
>
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

2017-06-26 Thread Panariti, David


> -Original Message-
> From: Bridgman, John
> Sent: Monday, June 26, 2017 3:12 PM
> To: Xie, AlexBin <alexbin@amd.com>; Panariti, David
> <david.panar...@amd.com>; Deucher, Alexander
> <alexander.deuc...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
> of ECC/EDC.
> 
> Agreed... one person's "best" is another person's "OMG I didn't want that".
[davep] However, I've questioned the sanity of many default choices.
> IMO we should have bits correspond to specific options as much as possible,
> modulo HW capabilities.
[davep] Yes, we can all agree that the word BEST, wasn't.
You can throw tomatoes at me when we're all in Markham.
But it's specifically the "modulo HW capabilities" that makes me want to use a 
single value to specify BEST (ADVANCED, RELIABLEST, SAFEST, AAGGRESSIVE, 
PURPLE, COMPUTE, PROFESSIONAL, etc.) 
Please, suggestions for better than BEST.

To summarize, I propose:

DEFAULT: User need do nothing, no parameter.  As AlexBin suggested, this can be 
a conservative choice of options. Options defined per asic.
ADVANCED: More features than DEFAULT, but nothing known to break.  Things we're 
sure not everyone would want. Selected by a unique value e.g. (1 << 3).  
Options defined per asic.
ALL: All compatible features, mod mutually exclusive or otherwise incompatible 
to the HW.  Could be specified as 0x.
BITMASK: The ultimate catchall.  I'd even say no masking out of any bits, 
caveat usor.  Don't forget, we're users of this interface and who knows what 
will be useful for dev or testing.
NONE: ras_param=0

DEFAULT, ADVANCED, ALL and NONE could go into a 2 bit field since they're 
mutually exclusive.

Examples:
CZ and eCZ:
DEFAULT: everything except PROP_FED (halt ip/reboot)
ADVANCED: DEFAULT + PROP_FED.
ALL: Same as ADVANCED.
BITMASK: PROP_FED, no counters (who knows why, but we can do it)

Vega10:
DEFAULT: No ECC
ADVANCED: No ECC.
ALL: ECC
BITMASK: ECC, don't count on your results.

To me, the benefit of ADVANCED and ALL is that I, as a user, would want them to 
change with new drivers.  I want that logical behavior and don't want to check 
ChangeLogs to see what new bits I need.
I think, for example, an HPC customer would want to use ADVANCED because that 
is what we think is the most reliable. 
Basically they're just macros so in the most common cases, users don't need to 
worry about the bits.

Providing a bitmask argument allows anything to be overridden, and as an 
advanced user (such as a developer), system evaluator, etc, absolute 
flexibility is essential.

From AlexD earlier (I may have redundancies):
> BEST could be defined as
> 0x since each asic would only look at the masked bits for the 
> features it
> supports. 
[davep] This could cause problems if there are mutually exclusive or otherwise 
incompatible bits.  Then there would need to be code choosing one over the 
other which is what we'd need to do to define AGGRESSIVE anyway.
Also I've seen fields where 1 enables a feature and others where 1 disables a 
feature.
This is why I chose a single bit.  The driver will know what to do with BEST, 
especially if features are added for fixed. It's no more effort than specifying 
0xff...f
> I would prefer not to call it BEST though.  Maybe ALL.
[davep] See 0xfff..f for why ALL may not be BEST. Pun intended.

BTW:  I like the RAS_ prefix over ECC, since it is really the overarching 
concept.
> 
> RAS_NONE 0
> RAS_DEFAULT (1 << 0)
> RAS_VRAM_ECC (1 << 1)
> RAS_SRAM_EDC (1 << 2)
> ...
> RAS_ALL 0x

[davep] So I see this with a change to RAS_ALL to basically be equivalent to 
what I had considered BEST and should be a fixed bit such as (1 << 3), and we 
choose the options based on what we know at the time of a release.
0xff...f as a number as opposed to a bitmask could be used instead of (1 << 3)
> 
> >-Original Message-
> >From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> >Of Xie, AlexBin
> >Sent: Monday, June 26, 2017 2:12 PM
> >To: Panariti, David; Deucher, Alexander; amd-gfx@lists.freedesktop.org
> >Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> >use of ECC/EDC.
> >
> >Hi,
> >
> >I have not checked the background of this discussion very closely yet.
> >And you might have known the following.
> >
> >Customers may not want the default setting to change meaning. This is
> >like an API.
> >Example: The application and its environment is already set up and tested.
> >Then if customer updates driver, suddenly driver has some new behavior?
> >Certain serious application definitely does not accept this.
> >
> >IMHO, it is better to avoid vague concepts like &quo

RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

2017-06-26 Thread Panariti, David
>> > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by 
>> > default.  That can be our asic specific default setting.  In the 
>> > case of CZ, that will be disabled until we decide to enable EDC by default.
>> [davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
>> I used ECC as the generic term for ECC and EDC, since ECC seems more 
>> basic (EDC is built on top of ECC).
>> If I understand you, we can't do what you want with the current setup.

>I'm saying we make ECC_BEST the default config (feel free to re-name if 
>ECC_DEFAULT).  Each asic can have a different default depending 
>on what features are ready.  So for CZ, we'd make ECC_BEST equivalent to 
>disabling ECC for now.  If a user wants to force it on, they can 
>set ECC_EDC.  Once EDC is stable on CZ, we can make ECC_BEST be equivalent to 
>ECC_EDC.  The way the default (ECC_BEST) always maps 
>to the best available combination in that version of the driver.

That's not how I meant it to work WRT BEST.
Each asic will have a DEFAULT, but that isn't what BEST means.
CZ is a good example (when fully implemented).  DEFAULT for CZ is everything 
except HALT, since, IMO opinion, most people do not want to hang or reboot.
BEST for CZ would be everything a person most interested in reliability would 
want, which IMO, includes HALT/reboot.
Similar is if something like performance degradation is really bad, DEFAULT 
would be OFF. BEST would be ON, e.g., if the user's app doesn't trigger the 
performance problem.
The BEST bit is in a fixed position, so that customers don't need to worry what 
bits are needed for the most reliable performance (in our opinion) on a given 
asic.
And if a customer (or developer) wants some arbitrary set of features, they can 
set bits as they want.

I think DEFAULT will make most people happy.
BEST allows people who are interested in everything they can get, regardless of 
any issues that brings with it. It is requested simply by using a fixed param 
value (0x01) for any asic.
This probably should not include features that have any kind of fatal flaw such 
as the Vega10 HBM ECC issue.  When fixed, it can be added to DEFAULT.
And allowing per-feature control allows anyone to do precisely what they want.
"Effort" increases as the number of interested users decreases.

Using defines in the init code will be a problem if there is more than one kind 
of asic involved or a single asic that the user wants to use with different 
parameters.  However, this doesn't seem to be a high priority.
If we do want to worry about it, then we'll need to put the values into the 
amdgpu_gfx struct.

regards,
davep

> -Original Message-----
> From: Deucher, Alexander
> Sent: Tuesday, June 06, 2017 6:16 PM
> To: Panariti, David <david.panar...@amd.com>; amd-
> g...@lists.freedesktop.org
> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
> of ECC/EDC.
> 
> > -Original Message-
> > From: Panariti, David
> > Sent: Tuesday, June 06, 2017 5:50 PM
> > To: Deucher, Alexander; amd-gfx@lists.freedesktop.org
> > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> > use of ECC/EDC.
> >
> > Rather than inlining this in a number of places, Re verbosity:
> > I've worked in embedded environments and when dealing with
> > intermittent problems it's nice to have all of the information ASAP
> > rather than waiting for a problem to reoccur, especially if it's very
> intermittent.
> > I would've preferred more.
> > Since it only shows up happens on CZ, it adds little to the output.
> > I like to show the reasons why EDC didn't happen, hence the backwards
> > looking messages.
> > In this particular case,  without the "... not requested..." we can't
> > tell if it was the flags or the ring being unready that made us bail 
> > earlier.
> 
> I'm fine with a message about EDC either being enabled or disabled, but a
> bunch of random debug statements along the way are too much.  They tend
> to just cause confusion and clutter up the logs.
> 
> >
> > > -Original Message-
> > > From: Deucher, Alexander
> > > Sent: Tuesday, June 06, 2017 5:22 PM
> > > To: Panariti, David <david.panar...@amd.com>; amd-
> > > g...@lists.freedesktop.org
> > > Cc: Panariti, David <david.panar...@amd.com>
> > > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control
> > use
> > > of ECC/EDC.
> > >
> > > > -Original Message-
> > > > From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On
> > Behalf
> > > > Of David Panariti
> > > > Sent: Tuesday, June 06, 2017 4:33 PM
> > &

RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use of ECC/EDC.

2017-06-06 Thread Panariti, David
Rather than inlining this in a number of places, Re verbosity:
I've worked in embedded environments and when dealing with intermittent 
problems it's nice to have all of the information ASAP rather than waiting for 
a problem to reoccur, especially if it's very intermittent.
I would've preferred more.
Since it only shows up happens on CZ, it adds little to the output.
I like to show the reasons why EDC didn't happen, hence the backwards looking 
messages.
In this particular case,  without the "... not requested..." we can't tell if 
it was the flags or the ring being unready that made us bail earlier.

> -Original Message-
> From: Deucher, Alexander
> Sent: Tuesday, June 06, 2017 5:22 PM
> To: Panariti, David <david.panar...@amd.com>; amd-
> g...@lists.freedesktop.org
> Cc: Panariti, David <david.panar...@amd.com>
> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
> of ECC/EDC.
> 
> > -Original Message-
> > From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> > Of David Panariti
> > Sent: Tuesday, June 06, 2017 4:33 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Panariti, David
> > Subject: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use
> > of ECC/EDC.
> >
> > Allow various kinds of memory integrity methods (e.g. ECC/EDC) to be
> > enabled or disabled.  By default, all features are disabled.
> >
> > EDC is Error Detection and Correction.  It can detect ECC errors and
> > do 0 or more of: count SEC (single error corrected) and DED (double
> > error detected, i.e. uncorrected ECC error), halt the affected block,
> interrupt the CPU.
> > Currently, only counting errors is supported.
> >
> > Fixed a whitespace error.
> >
> > Signed-off-by: David Panariti <david.panar...@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  4 
> >  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c| 28
> > ++--
> >  drivers/gpu/drm/amd/include/amd_shared.h | 14 ++
> >  4 files changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index a6f51eb..3e930ee 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -116,6 +116,7 @@ extern int amdgpu_cntl_sb_buf_per_se;  extern int
> > amdgpu_param_buf_per_se;  extern int amdgpu_job_hang_limit;  extern
> > int amdgpu_lbpw;
> > +extern unsigned amdgpu_ecc_flags;
> >
> >  #define AMDGPU_DEFAULT_GTT_SIZE_MB 3072ULL /* 3GB by
> > default */
> >  #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index c4825ff..972660d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -115,6 +115,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;  int
> > amdgpu_param_buf_per_se = 0;  int amdgpu_job_hang_limit = 0;  int
> > amdgpu_lbpw = -1;
> > +unsigned amdgpu_ecc_flags = 0;
> 
> I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by
> default.  That can be our asic specific default setting.  In the case of CZ, 
> that
> will be disabled until we decide to enable EDC by default.
[davep] I'm confused.  ECC...BEST will cause EDC to be enabled.
I used ECC as the generic term for ECC and EDC, since ECC seems more basic (EDC 
is built on top of ECC).
If I understand you, we can't do what you want with the current setup.

> 
> >
> >  MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in
> > megabytes");  module_param_named(vramlimit, amdgpu_vram_limit, int,
> > 0600); @@ -795,6 +796,9 @@ static struct pci_driver
> > amdgpu_kms_pci_driver = {
> > .driver.pm = _pm_ops,
> >  };
> >
> > +MODULE_PARM_DESC(ecc_flags, "ECC/EDC enable flags (0 = disable
> > ECC/EDC (default))");
> > +module_param_named(ecc_flags, amdgpu_ecc_flags, uint, 0444);
> > +
> >
> >
> >  static int __init amdgpu_init(void)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > index 46e766e..3b5685c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> > @@ -1824,7 +1824,17 @@ static int
> > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
> > if (dis_bit) {
> > /* On Carrizo, EDC may be perm

RE: CZ EDC param and support

2017-05-01 Thread Panariti, David
Thanks.
I’ve been using my Windwoes box way too much.  I need to set up email on my 
linux boxen.

davep

From: Christian König [mailto:deathsim...@vodafone.de]
Sent: Monday, May 01, 2017 10:40 AM
To: Panariti, David <david.panar...@amd.com>; Koenig, Christian 
<christian.koe...@amd.com>; gpudriverdevsupport <gpudriverdevsupp...@amd.com>; 
amd-gfx@lists.freedesktop.org
Subject: Re: CZ EDC param and support

I’ve gotten a comment that inline patches are preferred.
Well, they are. But you should send them with "git send-email" and not squashed 
together all in one mail :)

Otherwise I can't see how we should be able to apply them.

Additional to that at least I'm perfectly fine with attached patches as well.

Christian.

Am 28.04.2017 um 16:18 schrieb Panariti, David:
Actually, the attachment was an oversight.
It’s easier for me to attach, open the attachment and then delete the 
attachment.
I got only 2/3 this time.
I’ve gotten a comment that inline patches are preferred.

Sorry for the inconvenience.

davep

From: Koenig, Christian
Sent: Friday, April 28, 2017 4:06 AM
To: Panariti, David <david.panar...@amd.com><mailto:david.panar...@amd.com>; 
gpudriverdevsupport 
<gpudriverdevsupp...@amd.com><mailto:gpudriverdevsupp...@amd.com>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: CZ EDC param and support

You somehow messed up the attachment.

Instead of individual files everything is squashed together as all-edc.patch.

Please fix that otherwise proper review won't be possible.

Christian.

Am 28.04.2017 um 00:13 schrieb Panariti, David:
The changes in the workarounds function use DRM_INFO rather than DRM_DEBUG 
because CZs with EDC are often used in embedded environments and any info can 
be useful especially in the case of an intermittent problem.

From e1ce383592c275b58ad95bd80b5479af8c1f9dae Mon Sep 17 00:00:00 2001
From: David Panariti <david.panar...@amd.com><mailto:david.panar...@amd.com>
Date: Fri, 14 Apr 2017 13:41:52 -0400
Subject: [PATCH 1/3] drm/amdgpu: Moved gfx_v8_0_select_se_sh() in lieu of
re-redundant prototype.

Will be needed for the rest of the EDC workarounds patch.

Change-Id: Ie586ab38a69e98a91c6cb5747e285ce8bfdd1c86
Signed-off-by: David Panariti 
<david.panar...@amd.com><mailto:david.panar...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 46 +--
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index 2ff5f19..27b57cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -1500,6 +1500,29 @@ static int gfx_v8_0_kiq_init(struct amdgpu_device *adev)
   return 0;
}
+static void gfx_v8_0_select_se_sh(struct amdgpu_device *adev,
+   u32 se_num, u32 
sh_num, u32 instance)
+{
+ u32 data;
+
+ if (instance == 0x)
+ data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_INDEX, instance);
+
+ if (se_num == 0x)
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_INDEX, se_num);
+
+ if (sh_num == 0x)
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_INDEX, sh_num);
+
+ WREG32(mmGRBM_GFX_INDEX, data);
+}
+
static const u32 vgpr_init_compute_shader[] =
{
   0x7e000209, 0x7e020208,
@@ -3556,29 +3579,6 @@ static void gfx_v8_0_tiling_mode_table_init(struct 
amdgpu_device *adev)
   }
}
-static void gfx_v8_0_select_se_sh(struct amdgpu_device *adev,
-u32 se_num, 
u32 sh_num, u32 instance)
-{
-  u32 data;
-
-  if (instance == 0x)
-  data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_INDEX, instance);
-
-  if (se_num == 0x)
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_INDEX, se_num);
-
-  if (sh_num == 0x)
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(data, 

RE: CZ EDC param and support

2017-04-28 Thread Panariti, David
Actually, the attachment was an oversight.
It's easier for me to attach, open the attachment and then delete the 
attachment.
I got only 2/3 this time.
I've gotten a comment that inline patches are preferred.

Sorry for the inconvenience.

davep

From: Koenig, Christian
Sent: Friday, April 28, 2017 4:06 AM
To: Panariti, David <david.panar...@amd.com>; gpudriverdevsupport 
<gpudriverdevsupp...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: CZ EDC param and support

You somehow messed up the attachment.

Instead of individual files everything is squashed together as all-edc.patch.

Please fix that otherwise proper review won't be possible.

Christian.

Am 28.04.2017 um 00:13 schrieb Panariti, David:
The changes in the workarounds function use DRM_INFO rather than DRM_DEBUG 
because CZs with EDC are often used in embedded environments and any info can 
be useful especially in the case of an intermittent problem.

>From e1ce383592c275b58ad95bd80b5479af8c1f9dae Mon Sep 17 00:00:00 2001
From: David Panariti <david.panar...@amd.com><mailto:david.panar...@amd.com>
Date: Fri, 14 Apr 2017 13:41:52 -0400
Subject: [PATCH 1/3] drm/amdgpu: Moved gfx_v8_0_select_se_sh() in lieu of
re-redundant prototype.

Will be needed for the rest of the EDC workarounds patch.

Change-Id: Ie586ab38a69e98a91c6cb5747e285ce8bfdd1c86
Signed-off-by: David Panariti 
<david.panar...@amd.com><mailto:david.panar...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 46 +--
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index 2ff5f19..27b57cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -1500,6 +1500,29 @@ static int gfx_v8_0_kiq_init(struct amdgpu_device *adev)
   return 0;
}
+static void gfx_v8_0_select_se_sh(struct amdgpu_device *adev,
+   u32 se_num, u32 
sh_num, u32 instance)
+{
+ u32 data;
+
+ if (instance == 0x)
+ data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_INDEX, instance);
+
+ if (se_num == 0x)
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_INDEX, se_num);
+
+ if (sh_num == 0x)
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_BROADCAST_WRITES, 1);
+ else
+ data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_INDEX, sh_num);
+
+ WREG32(mmGRBM_GFX_INDEX, data);
+}
+
static const u32 vgpr_init_compute_shader[] =
{
   0x7e000209, 0x7e020208,
@@ -3556,29 +3579,6 @@ static void gfx_v8_0_tiling_mode_table_init(struct 
amdgpu_device *adev)
   }
}
-static void gfx_v8_0_select_se_sh(struct amdgpu_device *adev,
-u32 se_num, 
u32 sh_num, u32 instance)
-{
-  u32 data;
-
-  if (instance == 0x)
-  data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(0, GRBM_GFX_INDEX, 
INSTANCE_INDEX, instance);
-
-  if (se_num == 0x)
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SE_INDEX, se_num);
-
-  if (sh_num == 0x)
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_BROADCAST_WRITES, 1);
-  else
-  data = REG_SET_FIELD(data, GRBM_GFX_INDEX, 
SH_INDEX, sh_num);
-
-  WREG32(mmGRBM_GFX_INDEX, data);
-}
-
static u32 gfx_v8_0_create_bitmask(u32 bit_width)
{
   return (u32)((1ULL << bit_width) - 1);
--
2.7.4


>From 38fac8cab73dbc07e0ee7599b52106bc09dd32ea Mon Sep 17 00:00:00 2001
From: David Panariti <david.panar...@amd.com><mailto:david.panar...@amd.com>
Date: Mon, 24 Apr 2017 11:05:45 -0400
Subject: [PATCH 2/3] drm/amdgpu: Complete Carrizo EDC (Error Detection and
Correction) workarounds.

The workarounds are unconditionally performed on CZs with EDC enabled.
EDC detects uncorrected ECC errors and uses data poisoning to prevent
corrupted compute results from being used (read).
EDC enabled CZs are often used in embedded environments.

Change-Id: I84c261785329beeb797f11efbe0ec35790f2996c
Signed-off-by: David Panariti 
<david.panar...@amd.com><mailto:david.panar...@amd.com>
---
drivers/gpu/d

RE: [PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

2017-04-12 Thread Panariti, David


> -Original Message-
> From: Michel Dänzer [mailto:mic...@daenzer.net]
> Sent: Tuesday, April 11, 2017 10:12 PM
> To: Panariti, David <david.panar...@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: Add kernel parameter to manage
> memory error handling.
> 
> 
> Hi Dave,
> 
> 
> Please send patches inline instead of as attachments, ideally using git send-
> email.
> 
> 
> > @@ -212,6 +213,9 @@ module_param_named(cg_mask,
> amdgpu_cg_mask, uint,
> > 0444);  MODULE_PARM_DESC(pg_mask, "Powergating flags mask (0 =
> disable
> > power gating)");  module_param_named(pg_mask, amdgpu_pg_mask,
> uint,
> > 0444);
> >
> > +MODULE_PARM_DESC(ecc_mask, "ECC/EDC flags mask (0 = disable
> > +ECC/EDC)");
> 
> "0 = disable ECC/EDC" implies that they're enabled by default? Was that
> already the case before this patch?

[davep] Yes it was, and there was actually a problem in some cases where the CZ 
would hang which is why I added the param.
I was wondering if it would be better to default to them being off, but I 
wasn't sure how important maintaining original behavior is considered.
Actually, there are some bugs in the workaround function as it is, so it really 
should default to off.
I did some work implementing EDC for Vilas in research, but it was side tracked 
for a while.
I fixed up the workaround, but it really makes no sense to use it at all until 
the rest of the EDC code is in place, and isn't currently scheduled to happen.
Given that, I actually think it would be best to remove the workaround and 
leave EDC disabled until (if) it is scheduled to be completed.

> 
> 
> > @@ -1664,6 +1664,24 @@ static int
> gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev)
> > if (adev->asic_type != CHIP_CARRIZO)
> > return 0;
> >
> > +   DRM_INFO("gfx_v8_0_do_edc_gpr_workarounds(): ecc_flags:
> 0x%08x\n",
> > +adev->ecc_flags);
> > +
> > +   /*
> > +* Check if EDC has been requested.
> > +* For Carrizo, EDC is the best/safest mode WRT error handling.
> > +*/
> > +   if (!(adev->ecc_flags
> > + & (AMD_ECC_SUPPORT_BEST | AMD_ECC_SUPPORT_EDC))) {
> > +   DRM_INFO("gfx_v8_0_do_edc_gpr_workarounds(): "
> > +"skipping workarounds and not enabling EDC.\n");
> > +
> > +   return 0;
> > +   }
> > +
> > +   DRM_INFO("gfx_v8_0_do_edc_gpr_workarounds(): "
> > +"running workarounds and enabling EDC.\n");
> 
> These DRM_INFOs are too chatty, maybe make them e.g.
> DRM_DEBUG_DRIVER.
> 
> 
> --
> Earthling Michel Dänzer   |   http://www.amd.com
> Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

2017-04-11 Thread Panariti, David
Currently allows Carrizo EDC to be turned off.

davep


0001-drm-amdgpu-Add-kernel-parameter-to-manage-memory-err.patch
Description: 0001-drm-amdgpu-Add-kernel-parameter-to-manage-memory-err.patch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [pull] radeon drm-fixes-4.11

2017-03-29 Thread Panariti, David
Hi,

I'm still new to this stuff.
Is this informational or some action items?

thanks,
davep

> -Original Message-
> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> Of Alex Deucher
> Sent: Wednesday, March 29, 2017 12:55 PM
> To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org;
> airl...@gmail.com
> Cc: Deucher, Alexander 
> Subject: [pull] radeon drm-fixes-4.11
> 
> Hi Dave,
> 
> One small fix for radeon.
> 
> The following changes since commit
> d64a04720b0e64c1cd0726a3a27b360822fbee22:
> 
>   Merge branch 'drm-fixes-4.11' of git://people.freedesktop.org/~agd5f/linux
> into drm-fixes (2017-03-24 11:05:06 +1000)
> 
> are available in the git repository at:
> 
>   git://people.freedesktop.org/~agd5f/linux drm-fixes-4.11
> 
> for you to fetch changes up to
> ce4b4f228e51219b0b79588caf73225b08b5b779:
> 
>   drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
> (2017-03-27 16:17:30 -0400)
> 
> 
> Michel Dänzer (1):
>   drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flags
> 
>  drivers/gpu/drm/radeon/radeon_ttm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx