Re: [PATCH] PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0.

2022-02-10 Thread Tom de Vries via Gcc-patches

On 2/3/22 22:00, Roger Sayle wrote:


This patch addresses the "increased register pressure" regression on
nvptx-none caused by my change to transition the backend to a
STORE_FLAG_VALUE = 1 target.  This improved code generation for the
more common case of producing 0/1 Boolean values, but unfortunately
made things marginally worse when a 0/-1 mask value is desired.
Unfortunately, nvptx kernels are extremely sensitive to changes in
register usage, which was observable in the reported PR.

This patch provides optimizations for -(cond ? 1 : 0), effectively
simplify this into cond ? -1 : 0, where these ternary operators are
provided by nvptx's selp instruction, and for the specific case of
SImode, using (restoring) nvptx's "set" instruction (which avoids
the need for a predicate register).

This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
with a "make" and "make -k check" with no new failures.  Unfortunately,
the exact register usage of a nvptx kernel depends upon the version of
the Cuda drivers being used (and the hardware), but I believe this
change should resolve the PR (for Thomas) by improving code generation
for the cases that regressed.  Ok for mainline?




LGTM, applied.

Thanks,
- Tom


2022-02-03  Roger Sayle  

gcc/ChangeLog
PR target/104345
* config/nvptx/nvptx.md (sel_true): Fix indentation.
(sel_false): Likewise.
(define_code_iterator eqne): New code iterator for EQ and NE.
(*selp_neg_): New define_insn_and_split to optimize
the negation of a selp instruction.
(*selp_not_): New define_insn_and_split to optimize
the bitwise not of a selp instruction.
(*setcc_int): Use set instruction for neg:SI of a selp.

gcc/testsuite/ChangeLog
PR target/104345
* gcc.target/nvptx/neg-selp.c: New test case.


Thanks in advance,
Roger
--



RE: [PATCH] PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0.

2022-02-04 Thread Roger Sayle


Hi Thomas,

Very many thanks for your help investigating this problem.

> > This patch addresses the "increased register pressure" regression
> > on nvptx-none caused by my change to transition the backend to
> > a STORE_FLAG_VALUE = 1 target.
> 
> Yes, "addresses", but unfortunately doesn't "resolve".  ;-|

Doh!

> I'm confirming the improved code generation (less registers used, less
> instructions emitted) in cases where it triggers -- but unfortunately it
> doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'
> scenario.

Looking over the nvptx code currently generated for reduction-cplx-dbl.c,
it appears nearly optimal and it's difficult to see what could have regressed.
[It makes almost no uses of Boolean types, so is relatively unaffected
by a STORE_FLAG_VALUE change].  One remaining possibility is that
the "register usage" regression is not in reduction-cplx-dbl.c itself but
in __muldc3 in libgcc.a.  [I believe kernel resource usage is computed
including all called functions].

Might this be easy to test on your configuration, moving libgcc.a from
one build to another?

If it is __muldc3 regressing, then the other nvptx patches mentioned
previously, and perhaps even improvements to isnan and isinf, may help.

Again apologies that the "using nvptx set.?32 for "cond ? -1: 0" patch, that
catches many of the issues observed in your initial PR analysis, isn't actually
the root cause of this particular case.

Thanks again,
Roger
--




Re: [PATCH] PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0.

2022-02-04 Thread Thomas Schwinge
Hi Roger!

On 2022-02-03T21:00:50+, "Roger Sayle"  wrote:
> This patch

Thanks!

> addresses the "increased register pressure" regression on
> nvptx-none caused by my change to transition the backend to a
> STORE_FLAG_VALUE = 1 target.

Yes, "addresses", but unfortunately doesn't "resolve".  ;-|

> This improved code generation for the
> more common case of producing 0/1 Boolean values, but unfortunately
> made things marginally worse when a 0/-1 mask value is desired.
> Unfortunately, nvptx kernels are extremely sensitive to changes in
> register usage, which was observable in the reported PR.
>
> This patch provides optimizations for -(cond ? 1 : 0), effectively
> simplify this into cond ? -1 : 0, where these ternary operators are
> provided by nvptx's selp instruction, and for the specific case of
> SImode, using (restoring) nvptx's "set" instruction (which avoids
> the need for a predicate register).

I'm confirming the improved code generation (less registers used, less
instructions emitted) in cases where it triggers -- but unfortunately it
doesn't in the PR104345 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'
scenario.

> This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
> with a "make" and "make -k check" with no new failures.  Unfortunately,
> the exact register usage of a nvptx kernel depends upon the version of
> the Cuda drivers being used (and the hardware), but I believe this
> change should resolve the PR (for Thomas) by improving code generation
> for the cases that regressed.  Ok for mainline?

So, testing your patch in isolation, it does *not* resolve PR104345,
unfortunately.  I'll next test in combination with your other pending
patches:

  - "nvptx: Expand QI mode operations using SI mode instructions".
  - "nvptx: Fix and use BI mode logic instructions (e.g. and.pred)"


Grüße
 Thomas


> gcc/ChangeLog
>   PR target/104345
>   * config/nvptx/nvptx.md (sel_true): Fix indentation.
>   (sel_false): Likewise.
>   (define_code_iterator eqne): New code iterator for EQ and NE.
>   (*selp_neg_): New define_insn_and_split to optimize
>   the negation of a selp instruction.
>   (*selp_not_): New define_insn_and_split to optimize
>   the bitwise not of a selp instruction.
>   (*setcc_int): Use set instruction for neg:SI of a selp.
>
> gcc/testsuite/ChangeLog
>   PR target/104345
>   * gcc.target/nvptx/neg-selp.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
> diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> index 92768dd..651ba20 100644
> --- a/gcc/config/nvptx/nvptx.md
> +++ b/gcc/config/nvptx/nvptx.md
> @@ -892,7 +892,7 @@
>
>  (define_insn "sel_true"
>[(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> -(if_then_else:HSDIM
> + (if_then_else:HSDIM
> (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
> (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
> (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
> @@ -901,7 +901,7 @@
>
>  (define_insn "sel_true"
>[(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
> -(if_then_else:SDFM
> + (if_then_else:SDFM
> (ne (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
> (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
> (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
> @@ -910,7 +910,7 @@
>
>  (define_insn "sel_false"
>[(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> -(if_then_else:HSDIM
> + (if_then_else:HSDIM
> (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
> (match_operand:HSDIM 2 "nvptx_nonmemory_operand" "Ri")
> (match_operand:HSDIM 3 "nvptx_nonmemory_operand" "Ri")))]
> @@ -919,13 +919,63 @@
>
>  (define_insn "sel_false"
>[(set (match_operand:SDFM 0 "nvptx_register_operand" "=R")
> -(if_then_else:SDFM
> + (if_then_else:SDFM
> (eq (match_operand:BI 1 "nvptx_register_operand" "R") (const_int 0))
> (match_operand:SDFM 2 "nvptx_nonmemory_operand" "RF")
> (match_operand:SDFM 3 "nvptx_nonmemory_operand" "RF")))]
>""
>"%.\\tselp%t0\\t%0, %3, %2, %1;")
>
> +(define_code_iterator eqne [eq ne])
> +
> +;; Split negation of a predicate into a conditional move.
> +(define_insn_and_split "*selp_neg_"
> +  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> + (neg:HSDIM (eqne:HSDIM
> +  (match_operand:BI 1 "nvptx_register_operand" "R")
> +  (const_int 0]
> +  ""
> +  "#"
> +  "&& 1"
> +  [(set (match_dup 0)
> + (if_then_else:HSDIM
> +   (eqne (match_dup 1) (const_int 0))
> +   (const_int -1)
> +   (const_int 0)))])
> +
> +;; Split bitwise not of a predicate into a conditional move.
> +(define_insn_and_split "*selp_not_"
> +  [(set (match_operand:HSDIM 0 "nvptx_register_operand" "=R")
> + (not:HSDIM (eqne:HSDIM
> +