subject:"\[Bug target\/49263\] SH Target\: underutilized \"TST #imm, R0\" instruction"

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-30 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #50 from Oleg Endo  ---
Actually, let's take any further discussion of shift patterns to PR 54089.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-30 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #49 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #48)
> I made tests (including *.c files from GCC testsuite) and everything looks
> fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand'
> is too wide. It is possible to narrow it down as much as possible by adding
> distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and
> then check it and fail if not set:
> 

For this kind of change, the whole GCC test suite needs to be ran for at least
big/little -m2,-m4 variants.


+(define_insn_and_split "ashrsi3_libcall_expand"
+  [(parallel [(set (match_operand:SI 0 "arith_reg_dest")
+   (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand")
+   (match_operand:SI 2 "const_int_operand"))
+   )(clobber (reg:SI T_REG))
+   (clobber (reg:SI PR_REG))
+  ])]

The 'parallel' construct looks strange.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-30 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #48 from Alexander Klepikov  
---
I made tests (including *.c files from GCC testsuite) and everything looks fine
for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand' is too
wide. It is possible to narrow it down as much as possible by adding distinct
attribute and set when emitting 'ashrsi3_libcall_collapsed' and then check it
and fail if not set:

(define_attr "libcall_collapsed"
 "ashrsi3,nil"
 (const_string "nil"))

(define_insn "ashrsi3_libcall_collapsed"
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
(ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0")
 (match_operand:SI 2 "const_int_operand")))
   (clobber (reg:SI T_REG))
   (clobber (reg:SI PR_REG))]
  "TARGET_SH1"
  "OOPS"
  [(set_attr "type" "dyn_shift")
   (set_attr "libcall_collapsed" "ashrsi3")
   (set_attr "needs_delay_slot" "yes")])

 if (get_attr_libcall_collapsed(insn) != LIBCALL_COLLAPSED_ASHRSI3)
return false;

It will be super safe then but ugly a little bit.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-29 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #47 from Oleg Endo  ---
(In reply to Eric Gallager from comment #46)
> Reminder that patches go to the gcc-patches mailing list

It's OK.  We're just throwing around ideas, nothing final

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-29 Thread egallager at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #46 from Eric Gallager  ---
Reminder that patches go to the gcc-patches mailing list

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-29 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #45 from Alexander Klepikov  
---
>I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?


Good news. I've made a proof of concept. It works at least sometimes - on
simple tests.

$ cat f.c
#define A 0x
#define P ((unsigned char *)A)
#define F 64
#define S 8

unsigned char f_non_zero(unsigned char v){
return (v & F) != 0;
}

unsigned f_sym_non_zero(void){
return (*P & F) != 0;
}

unsigned f_sym_mask(void){
return (*P & F) == F;
}

int f_rshift(char v){
return v >> S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -mb -m2e -da -S f.c

$ cat f.s
.file   "f.c"
.text
.text
.align 1
.align 2
.global _f_non_zero
.type   _f_non_zero, @function
_f_non_zero:
mov r4,r0
sts.l   pr,@-r15
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.size   _f_non_zero, .-_f_non_zero
.align 1
.align 2
.global _f_sym_non_zero
.type   _f_sym_non_zero, @function
_f_sym_non_zero:
mov.l   .L6,r1
sts.l   pr,@-r15
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.L7:
.align 2
.L6:
.long   -65536
.size   _f_sym_non_zero, .-_f_sym_non_zero
.align 1
.align 2
.global _f_sym_mask
.type   _f_sym_mask, @function
_f_sym_mask:
mov.l   .L10,r1
sts.l   pr,@-r15
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.L11:
.align 2
.L10:
.long   -65536
.size   _f_sym_mask, .-_f_sym_mask
.align 1
.align 2
.global _f_rshift
.type   _f_rshift, @function
_f_rshift:
mov.l   .L14,r1
sts.l   pr,@-r15
jsr @r1
exts.b  r4,r4
mov r4,r0
lds.l   @r15+,pr
rts
nop
.L15:
.align 2
.L14:
.long   ___ashiftrt_r4_8
.size   _f_rshift, .-_f_rshift
.ident  "GCC: (GNU) 13.1.0"

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -ml -m2e -da -S f.c

$ cat f.s
.file   "f.c"
.text
.little
.text
.align 1
.align 2
.global _f_non_zero
.type   _f_non_zero, @function
_f_non_zero:
mov r4,r0
sts.l   pr,@-r15
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.size   _f_non_zero, .-_f_non_zero
.align 1
.align 2
.global _f_sym_non_zero
.type   _f_sym_non_zero, @function
_f_sym_non_zero:
mov.l   .L6,r1
sts.l   pr,@-r15
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.L7:
.align 2
.L6:
.long   -65536
.size   _f_sym_non_zero, .-_f_sym_non_zero
.align 1
.align 2
.global _f_sym_mask
.type   _f_sym_mask, @function
_f_sym_mask:
mov.l   .L10,r1
sts.l   pr,@-r15
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
negcr0,r0
lds.l   @r15+,pr
rts
nop
.L11:
.align 2
.L10:
.long   -65536
.size   _f_sym_mask, .-_f_sym_mask
.align 1
.align 2
.global _f_rshift
.type   _f_rshift, @function
_f_rshift:
mov.l   .L14,r1
sts.l   pr,@-r15
jsr @r1
exts.b  r4,r4
mov r4,r0
lds.l   @r15+,pr
rts
nop
.L15:
.align 2
.L14:
.long   ___ashiftrt_r4_8
.size   _f_rshift, .-_f_rshift
.ident  "GCC: (GNU) 13.1.0"

Splitting takes place at split1 pass as expected. Here is the patch itself.

$ cat gcc-13.1.0-ashrsi3_libcall.patch
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh-protos.h
gcc-13.1.0/gcc/config/sh/sh-protos.h
--- gcc-13.1.0.orig/gcc/config/sh/sh-protos.h   2023-04-26 10:09:39.0
+0300
+++ gcc-13.1.0/gcc/config/sh/sh-protos.h2023-05-29 11:45:05.134723435
+0300
@@ -78,6 +78,7 @@
 extern void gen_shifty_op (int, rtx *);
 extern void gen_shifty_hi_op (int, rtx *);
 extern bool expand_ashiftrt (rtx *);
+extern bool expand_ashrsi3_libcall (rtx *);//delete
 extern bool sh_dynamicalize_shift_p (rtx);
 extern int shl_and_kind (rtx, rtx, int *);
 extern int shl_and_length (rtx);
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.cc gcc-13.1.0/gcc/config/sh/sh.cc
--- gcc-13.1.0.orig/gcc/config/sh/sh.cc 2023-04-26 10:09:39.0 +0300
+++ gcc-13.1.0/gcc/config/sh/sh.cc  2023-05-29 17:09:54.602787537 +0300
@@ -3875,11 +3877,37 @@
   wrk = gen_reg_rtx (Pmode);

   /* Load

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-28 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #44 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #43)
> 
> Well, not really. Look what's happening during expand pass when 'ashrsi3' is
> expanding. Function 'expand_ashiftrt' is called and what it does at the end
> - it explicitly emits 3 insns:
> [...]

> 
> By the way, right shift for integers expands to only one 'lshiftrt' insn and
> that's why it can be catched and converted to 'tst'.
> 

Yeah, I might have dropped the ball on the right shift patterns back then and
only reworked the left shift patterns to do that. 


> 
> As far as I understand these insns could be catched later by a peephole and
> converted to 'tstsi_t' insn like it is done for other much simple insn
> sequences.

It's the combine RTL pass and split1 RTL pass that does most of this work here.
 Peephole pass in GCC is too simplistic for this.


> 
> Thank you for your time and detailed explanations! I agree with you on all
> points. Software cannot be perfect and it's OK for GCC not to be super
> optimized, so this part better sholud be left intact.

We can't have it perfect, but we can try ;)

> 
> By the way, I tried to link library to my project and I figured out that
> linker is smart enough to link only necessary library functions even without
> LTO. So increase in size is about 100 or 200 bytes, that is acceptable.
> Thank you very much for help!

You're welcome.

Yes, to strip out unused library functions it doesn't need LTO.  But using LTO
for embedded/MCU  firmware, I find it can reduce the code size by about 20%. 
For example, it can also inline small library functions in your code (if the
library was also compiled with LTO).

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-28 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #43 from Alexander Klepikov  
---
> > Thank you! I have an idea. If it's impossible to defer initial optimization,
> > maybe it's possible to emit some intermediate insn and catch it and optimize
> > later?
> 
> This is basically what is supposed to be happening there already.

Well, not really. Look what's happening during expand pass when 'ashrsi3' is
expanding. Function 'expand_ashiftrt' is called and what it does at the end -
it explicitly emits 3 insns:

wrk = gen_reg_rtx (Pmode);

  //This one
  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);

  sprintf (func, "__ashiftrt_r4_%d", value);
  rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;

  //This one
  emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab));

  //And this one
  emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));

As far as I understand these insns could be catched later by a peephole and
converted to 'tstsi_t' insn like it is done for other much simple insn
sequences.

What I'm thinkig about is to emit only one, say 'compound', insn. Which could
be then splitted later somwhere in split pass to function call, to those 3
insns.

I wrote test code that emits only one bogus insn. This insn expands to pure asm
code. Of course, that asm code is invalid, because it is impossible to place a
libcall label at the end of function with pure asm code injection. But then all
what is could be coverted to 'tst', converts to 'tst'. And all pure right
shifts converts to invalid asm code, of course.

That's why I am thinking about possibility of emitting some intermediate insn
at expand pass that will defer it real expanding. But I still don't know how to
do it right and even if it is possible.

By the way, right shift for integers expands to only one 'lshiftrt' insn and
that's why it can be catched and converted to 'tst'.

> 
> However, it's a bit of a dilemma.
> 
> 1) If we don't have a dynamic shift insn and we smash the constant shift
> into individual 
> stitching shifts early, it might open some new optimization opportunities,
> e.g. by sharing intermediate shift results.  Not sure though if that
> actually happens in practice though.
> 
> 2) Whether to use the dynamic shift insn or emit a function call or use
> stitching shifts sequence, it all has an impact on register allocation and
> other instruction use.  This can be problematic during the course of RTL
> optimization passes.
> 
> 3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a
> shorter stitching shift sequence.  Which one is better depends on the
> surrounding code.  I don't think there is anything good there to make the
> proper choice.
> 
> Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR
> 52628, PR 58017

Thank you for your time and detailed explanations! I agree with you on all
points. Software cannot be perfect and it's OK for GCC not to be super
optimized, so this part better sholud be left intact.

By the way, I tried to link library to my project and I figured out that linker
is smart enough to link only necessary library functions even without LTO. So
increase in size is about 100 or 200 bytes, that is acceptable. Thank you very
much for help!

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-26 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #42 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #41)
> 
> Thank you! I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?

This is basically what is supposed to be happening there already.

However, it's a bit of a dilemma.

1) If we don't have a dynamic shift insn and we smash the constant shift into
individual 
stitching shifts early, it might open some new optimization opportunities, e.g.
by sharing intermediate shift results.  Not sure though if that actually
happens in practice though.

2) Whether to use the dynamic shift insn or emit a function call or use
stitching shifts sequence, it all has an impact on register allocation and
other instruction use.  This can be problematic during the course of RTL
optimization passes.

3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a
shorter stitching shift sequence.  Which one is better depends on the
surrounding code.  I don't think there is anything good there to make the
proper choice.

Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR 52628,
PR 58017


> > BTW, have you tried it on a more recent GCC?  There have also been some
> > optimizations in the middle-end (a bit more backend independent) for this
> > kind of thing.
> 
> I tried 13.1 about week or two ago with the same result.
> 

Good to know.  Thanks for checking it!

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-26 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #41 from Alexander Klepikov  
---
> > It looks like with optimization enabled it converts bitwise AND to right
> > shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> > optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> > cannot be converted. It seems that very first optimization spoils things.
> > 
> > But when we have numerous 'shar' instructions, optimization joins the game
> > again and converts them to 'tst'.
> 
> Yes, something like this is what I remember happening there.  I'll try to
> look into the issue with your test cases and see if it's possible to add
> some patterns to catch those.

Thank you! I have an idea. If it's impossible to defer initial optimization,
maybe it's possible to emit some intermediate insn and catch it and optimize
later?

> BTW, have you tried it on a more recent GCC?  There have also been some
> optimizations in the middle-end (a bit more backend independent) for this
> kind of thing.

I tried 13.1 about week or two ago with the same result.

> Have you tried to use whole program optimization via -flto and
> -ffunction-sections? It  should be able to strip out all unnecessary library
> functions.

Thank you for advice, I'll take a try.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-25 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #40 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #39)
> 
> I'm sorry, but .md lang is too complicated for me.

Yeah, it looks alien at first sight.  But it's where a lot of things happen
w.r.t. instruction selection.

> It looks like with optimization enabled it converts bitwise AND to right
> shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> cannot be converted. It seems that very first optimization spoils things.
> 
> But when we have numerous 'shar' instructions, optimization joins the game
> again and converts them to 'tst'.

Yes, something like this is what I remember happening there.  I'll try to look
into the issue with your test cases and see if it's possible to add some
patterns to catch those.

BTW, have you tried it on a more recent GCC?  There have also been some
optimizations in the middle-end (a bit more backend independent) for this kind
of thing.

> You are absolutely right, the code will be larger when we do right shifts.
> But there's situations when you can't use library. For exmple, SH7055 engine
> control unit can barely contain the program. The library just won't fit.

Have you tried to use whole program optimization via -flto and
-ffunction-sections? It  should be able to strip out all unnecessary library
functions.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-25 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #39 from Alexander Klepikov  
---
> The tst insn is mainly formed by the combine pass, which relies on certain
> insn patterns and combinations thereof.  See also sh.md, around line 530.

I'm sorry, but .md lang is too complicated for me.

> You can look at the debug output with the -fdump-rtl-all option to see
> what's happening in the RTL passes.

It looks like with optimization enabled it converts bitwise AND to right shift
and then optimizes again. But SH4 has 'shad' and 'shad' can be optimized to
'tst'. And SH2E has libcall instead of dynamic shift and libcll cannot be
converted. It seems that very first optimization spoils things.

But when we have numerous 'shar' instructions, optimization joins the game
again and converts them to 'tst'.

> What your patch is doing is to make it not emit the ashrsi3_n insn for
> constant shifts altogether?  I guess it will make code that actually needs
> those real shifts larger, as it will always emit the whole shift stitching
> sequence.  That might be a good thing or not.

You are absolutely right, the code will be larger when we do right shifts. But
there's situations when you can't use library. For exmple, SH7055 engine
control unit can barely contain the program. The library just won't fit.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-24 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #38 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #37)
> 
> As far as I understand from GCC sources, function I patched
> 'expand_ashiftrt' process only constant values of shift. As you can see
> earlier, I added your and other examples to tests.

OK, thanks for the additional test cases.  It really looks like the way the
constant shift is expanded (via ashrsi3_n insn) on SH1/SH2 is getting in the
way.

The tst insn is mainly formed by the combine pass, which relies on certain insn
patterns and combinations thereof.  See also sh.md, around line 530.

You can look at the debug output with the -fdump-rtl-all option to see what's
happening in the RTL passes.

What your patch is doing is to make it not emit the ashrsi3_n insn for constant
shifts altogether?  I guess it will make code that actually needs those real
shifts larger, as it will always emit the whole shift stitching sequence.  That
might be a good thing or not.


> It looks like really
> dynamic shifts translate to library calls.

So the option name '-mdisable-dynshift-libcall' doesn't make sense.
What it actually does is more like '-mdisable-constshift-libcall'.

> 
> Should I test more exotic situations? If so, could you please help me with
> really exotic or weired examples?

Have you had a look at the existing test cases for this in
gcc/testsuite/gcc.target/sh ?

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-24 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #37 from Alexander Klepikov  
---
> Can you also compile for little endian, and most of all, use -O2
> optimization level.  Some optimizations are not done below -O2.

Here's source file, I added functions with non-constant shifts

$ cat f.c
#define ADDR 0x
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
return (*P & FLAG) != 0;
}

unsigned int dyn_lshift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}

unsigned int dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> (y & 31);
}

unsigned int really_dyn_lshift (unsigned int x, unsigned int y)
{
  return x << y;
}

unsigned int really_dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> y;
}

With patch disabled, -O2 -mb:

$ cat f.s
.file   "f.c"
.text
.text
.align 1
.align 2
.global _f_char_var
.type   _f_char_var, @function
_f_char_var:
mov.l   .L4,r1
sts.l   pr,@-r15
jsr @r1
exts.b  r4,r4
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L5:
.align 2
.L4:
.long   ___ashiftrt_r4_6
.size   _f_char_var, .-_f_char_var
.align 1
.align 2
.global _f_unsigned_char_var
.type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
mov.l   .L8,r1
sts.l   pr,@-r15
jsr @r1
exts.b  r4,r4
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L9:
.align 2
.L8:
.long   ___ashiftrt_r4_6
.size   _f_unsigned_char_var, .-_f_unsigned_char_var
.align 1
.align 2
.global _f_symbol
.type   _f_symbol, @function
_f_symbol:
mov.l   .L12,r1
sts.l   pr,@-r15
mov.b   @r1,r4
mov.l   .L13,r1
jsr @r1
nop
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L14:
.align 2
.L12:
.long   -65536
.L13:
.long   ___ashiftrt_r4_6
.size   _f_symbol, .-_f_symbol
.align 1
.align 2
.global _f_symbol_zero
.type   _f_symbol_zero, @function
_f_symbol_zero:
mov.l   .L16,r1
mov.b   @r1,r0
tst #64,r0
rts
movtr0
.L17:
.align 2
.L16:
.long   -65536
.size   _f_symbol_zero, .-_f_symbol_zero
.align 1
.align 2
.global _f_symbol_non_zero
.type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
mov.l   .L20,r1
sts.l   pr,@-r15
mov.b   @r1,r4
mov.l   .L21,r1
jsr @r1
nop
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L22:
.align 2
.L20:
.long   -65536
.L21:
.long   ___ashiftrt_r4_6
.size   _f_symbol_non_zero, .-_f_symbol_non_zero
.align 1
.align 2
.global _dyn_lshift
.type   _dyn_lshift, @function
_dyn_lshift:
mov.l   .L25,r1
sts.l   pr,@-r15
jsr @r1
mov r5,r0
lds.l   @r15+,pr
rts
nop
.L26:
.align 2
.L25:
.long   ___ashlsi3_r0
.size   _dyn_lshift, .-_dyn_lshift
.align 1
.align 2
.global _dyn_rshift
.type   _dyn_rshift, @function
_dyn_rshift:
mov.l   .L29,r1
sts.l   pr,@-r15
jsr @r1
mov r5,r0
lds.l   @r15+,pr
rts
nop
.L30:
.align 2
.L29:
.long   ___lshrsi3_r0
.size   _dyn_rshift, .-_dyn_rshift
.align 1
.align 2
.global _really_dyn_lshift
.type   _really_dyn_lshift, @function
_really_dyn_lshift:
mov.l   .L33,r1
sts.l   pr,@-r15
jsr @r1
mov r5,r0
lds.l   @r15+,pr
rts
nop
.L34:
.align 2
.L33:
.long   ___ashlsi3_r0
.size   _really_dyn_lshift, .-_really_dyn_lshift
.align 1
.align 2
.global _really_dyn_rshift
.type   _really_dyn_rshift, @function
_really_dyn_rshift:
mov.l   .L37,r1
sts.l   pr,@-r15
jsr @r1
mov r5,r0
lds.l   @r15+,pr
rts
nop
.L38:
.align 2
.L37:
.long   ___lshrsi3_r0
.size   _really_dyn_rshift, .-_really_dyn_rshift
.ident  "GCC: (GNU) 12.3.0"

With patch disabled, -O2 -ml

$ cat f.s
.file   "f.c"
.text
.little
.text
.align 1
.align 2

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-24 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #36 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #35)
> 
> As I understand, you meant the following (I added new functions at the end
> of file):
> 
> $ cat f.c
> #define ADDR 0x
> #define P ((unsigned char *)ADDR)
> #define FLAG 0x40
> #define S 7
> 

Yes, that's what I meant, thanks.

Can you also compile for little endian, and most of all, use -O2 optimization
level.  Some optimizations are not done below -O2.

> 
> I choose that name because I wanted to disable dynamic shift instructions
> for all CPUs. I did not hope that it will affect SH-2E code in such way.
> 
> I can rewrite the patch so that it only affects CPUs that do not support
> dynamic shifts and disables library call for dynamic shifts. I'll do it
> anyway because I need it badly. How do you think, what name of option would
> be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want,
> please suggest another one. Thank you!

'-mdisable-dynshift-libcall' would be more appropriate for what it tries to do,
I think.  Although that is a whole different issue ... but what is it going to
do for real dynamic shifts on SH2?

What kind of code is it supposed to emit for things like

unsigned int dyn_shift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-24 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #35 from Alexander Klepikov  
---
(In reply to Oleg Endo from comment #34)
> Bit-tests of char and unsigned char should be covered by the test-suite and
> should work -- at least originally.  However, what might be triggering this
> problem is the '== FLAG' comparison.  When I was working on this issue I
> only used '== 0' or '!= 0' comparison.  I can imagine that your test code
> triggers some other middle end optimizations and hence we get this.

Yes, I am sure that the problem is the '== FLAG' comparison. Before I reported
that bug, I tried to bypass it and this macro does not produce shift
instructions even on GCC 4.7:

#define BIT_MASK_IS_SET_(VALUE, BITMASK)\
({int _value = VALUE & BITMASK,\
_result;\
if (_value == BITMASK){\
_result = 1;\
}\
else {\
_result = 0;\
}\
_result;})

So this is definitely the comparison.

> 
> Can you try to rewrite your test code to something like this?
> 
> unsigned int f(char v){
> return (v & FLAG) != 0;
> }
> 
> ... and see if it generates the tst instruction as expected?
> 

As I understand, you meant the following (I added new functions at the end of
file):

$ cat f.c
#define ADDR 0x
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
return (*P & FLAG) != 0;
}

Compiler flags: -c -mrenesas -m2e -mb -O -fno-toplevel-reorder

With patch disabled:

$ cat f_clean.s
.file   "f.c"
.text
.text
.align 1
.global _f_char_var
.type   _f_char_var, @function
_f_char_var:
sts.l   pr,@-r15
mov.l   .L3,r1
jsr @r1
exts.b  r4,r4
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L4:
.align 2
.L3:
.long   ___ashiftrt_r4_6
.size   _f_char_var, .-_f_char_var
.align 1
.global _f_unsigned_char_var
.type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
sts.l   pr,@-r15
mov.l   .L7,r1
jsr @r1
exts.b  r4,r4
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L8:
.align 2
.L7:
.long   ___ashiftrt_r4_6
.size   _f_unsigned_char_var, .-_f_unsigned_char_var
.align 1
.global _f_symbol
.type   _f_symbol, @function
_f_symbol:
sts.l   pr,@-r15
mov.l   .L11,r1
mov.b   @r1,r4
mov.l   .L12,r1
jsr @r1
nop
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L13:
.align 2
.L11:
.long   -65536
.L12:
.long   ___ashiftrt_r4_6
.size   _f_symbol, .-_f_symbol
.align 1
.global _f_symbol_zero
.type   _f_symbol_zero, @function
_f_symbol_zero:
mov.l   .L15,r1
mov.b   @r1,r0
tst #64,r0
rts
movtr0
.L16:
.align 2
.L15:
.long   -65536
.size   _f_symbol_zero, .-_f_symbol_zero
.align 1
.global _f_symbol_non_zero
.type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
sts.l   pr,@-r15
mov.l   .L19,r1
mov.b   @r1,r4
mov.l   .L20,r1
jsr @r1
nop
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts
nop
.L21:
.align 2
.L19:
.long   -65536
.L20:
.long   ___ashiftrt_r4_6
.size   _f_symbol_non_zero, .-_f_symbol_non_zero
.ident  "GCC: (GNU) 12.3.0"

With patch enabled:

$ cat f.s
.file   "f.c"
.text
.text
.align 1
.global _f_char_var
.type   _f_char_var, @function
_f_char_var:
mov r4,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.size   _f_char_var, .-_f_char_var
.align 1
.global _f_unsigned_char_var
.type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
mov r4,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.size   _f_unsigned_char_var, .-_f_unsigned_char_var
.align 1
.global _f_symbol
.type   _f_symbol, @function
_f_symbol:
mov.l   .L4,r1
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.L5:
.align 2
.L4:
.long   -65536
.size   _f_symbol, .-_f_symbol
.align 1
.global _f_symbol_zero
.type   _f_symbol_zero, @function
_f_symbol_zero:

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-23 Thread olegendo at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #34 from Oleg Endo  ---
(In reply to Alexander Klepikov from comment #33)
> Created attachment 55142 [details]
> Disable dynamic shift instructions patch

First of all, thanks for digging into this.  This issue has been a can of
worms, due to all sorts of reasons.

As you have discovered, some code patterns take the shift instruction route,
which is basically decided earlier by the various middle-end optimizations. 
There have also been some changes to those parts recently, but I haven't been
watching what it does for SH.

> unsigned int f(char v){
> return (v & FLAG) == FLAG;
> }

Bit-tests of char and unsigned char should be covered by the test-suite and
should work -- at least originally.  However, what might be triggering this
problem is the '== FLAG' comparison.  When I was working on this issue I only
used '== 0' or '!= 0' comparison.  I can imagine that your test code triggers
some other middle end optimizations and hence we get this.

Can you try to rewrite your test code to something like this?

unsigned int f(char v){
return (v & FLAG) != 0;
}

... and see if it generates the tst instruction as expected?


> I also compiled my project with '-m2e' and new '-mdisable-dynshift'
> options and tested it in SH-2E mone on Renesas's emulator that comes
> with High-performance Embedded Workshop and all unit tests run as expected.

I'm not sure what the purpose of the '-mdisable-dynshift' option would be here
though.  For '-m2e' TARGET_DYNSHIFT is already 'false'.  So the option seems
misnamed.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-23 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #33 from Alexander Klepikov  
---
Created attachment 55142
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55142=edit
Disable dynamic shift instructions patch

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-23 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #32 from Alexander Klepikov  
---
I'm not sure whether I should write here or open new discussion, but these
topics are related very closely. I've been writing a patch to eliminate the
generation of dynamic shift instructions 'shad' and 'shld' completely at least
for SH4 CPU. And then I get a surprising result - in all the examples I gave
earlier, library call converted to 'tst' instructions!

Here is the patch itself (I also will attach a file):

--- ../gcc-12.3.0.orig/gcc/config/sh/sh.cc  2023-05-08 15:14:39.681161695
+0300
+++ ./gcc/config/sh/sh.cc   2023-05-23 12:23:25.964375731 +0300
@@ -3061,7 +3061,7 @@
   else
 insn_count = ashl_lshr_seq[shift_amount_i].insn_count;

-  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST);
+  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST) && !
disable_dynshift;
 }

 /* Assuming we have a value that has been sign-extended by at least one bit,
@@ -3812,8 +3812,10 @@
   rtx wrk;
   char func[18];
   int value;
+  int long_shift  = disable_dynshift ? 30 : 19;
+  int short_shift = disable_dynshift ? 15 : 5;

-  if (TARGET_DYNSHIFT)
+  if (TARGET_DYNSHIFT && ! disable_dynshift)
 {
   if (!CONST_INT_P (operands[2]))
{
@@ -3851,7 +3853,7 @@
   emit_insn (gen_ashrsi2_31 (operands[0], operands[1]));
   return true;
 }
-  else if (value >= 16 && value <= 19)
+  else if (value >= 16 && value <= long_shift)
 {
   wrk = gen_reg_rtx (SImode);
   emit_insn (gen_ashrsi2_16 (wrk, operands[1]));
@@ -3862,7 +3864,7 @@
   return true;
 }
   /* Expand a short sequence inline, longer call a magic routine.  */
-  else if (value <= 5)
+  else if (value <= short_shift)
 {
   wrk = gen_reg_rtx (SImode);
   emit_move_insn (wrk, operands[1]);
diff -ur ../gcc-12.3.0.orig/gcc/config/sh/sh.opt ./gcc/config/sh/sh.opt
--- ../gcc-12.3.0.orig/gcc/config/sh/sh.opt 2023-05-08 15:14:39.689161810
+0300
+++ ./gcc/config/sh/sh.opt  2023-05-23 10:45:36.814371159 +0300
@@ -301,3 +301,7 @@
 mlra
 Target Var(sh_lra_flag) Init(0) Save
 Use LRA instead of reload (transitional).
+
+mdisable-dynshift
+Target Var(disable_dynshift) Init(0)
+Disable dynamic shift 'shad' and 'shld' instructions

And here are my tests:
$ cat f.c
#define ADDR 0x
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f(char v){
return (v & FLAG) == FLAG;
}

unsigned char f_(unsigned char v){
return (v & FLAG) == FLAG;
}

unsigned char f1(void){
return (*P & FLAG) == FLAG;
}

int f_signed_rshift(int v){
return v >> S;
}

int f_signed_lshift(int v){
return v << S;
}

unsigned int f_unsigned_rshift(unsigned int v){
return v >> S;
}

unsigned int f_unsigned_lshift(unsigned int v){
return v << S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -c -mrenesas -m2e -mb -O
-fno-toplevel-reorder -mdisable-dynshift -S f.c
$ cat f.s
.file   "f.c"
.text
.text
.align 1
.global _f
.type   _f, @function
_f:
mov r4,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.size   _f, .-_f
.align 1
.global _f_
.type   _f_, @function
_f_:
mov r4,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.size   _f_, .-_f_
.align 1
.global _f1
.type   _f1, @function
_f1:
mov.l   .L4,r1
mov.b   @r1,r0
tst #64,r0
mov #-1,r0
rts
negcr0,r0
.L5:
.align 2
.L4:
.long   -65536
.size   _f1, .-_f1
.align 1
.global _f_signed_rshift
.type   _f_signed_rshift, @function
_f_signed_rshift:
mov r4,r0
sharr0
sharr0
sharr0
sharr0
sharr0
sharr0
rts
sharr0
.size   _f_signed_rshift, .-_f_signed_rshift
.align 1
.global _f_signed_lshift
.type   _f_signed_lshift, @function
_f_signed_lshift:
mov r4,r0
shll2   r0
shll2   r0
add r0,r0
rts
shll2   r0
.size   _f_signed_lshift, .-_f_signed_lshift
.align 1
.global _f_unsigned_rshift
.type   _f_unsigned_rshift, @function
_f_unsigned_rshift:
mov r4,r0
shlr2   r0
shlr2   r0
shlrr0
rts
shlr2   r0
.size   _f_unsigned_rshift, .-_f_unsigned_rshift
.align 1
.global _f_unsigned_lshift
.type   _f_unsigned_lshift, @function
_f_unsigned_lshift:
mov r4,r0
shll2   r0
shll2   r0
add r0,r0
rts
shll2   r0
.size   _f_unsigned_lshift, .-_f_unsigned_lshift
.ident  "GCC: (GNU) 12.3.0"

I also compiled my project with '-m2e' and new '-mdisable-dynshift' options and

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2023-05-12 Thread klepikov.alex+bugs at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Alexander Klepikov  changed:

   What|Removed |Added

 CC||klepikov.alex+bugs at gmail 
dot co
   ||m

--- Comment #31 from Alexander Klepikov  
---
I've found new cases for SH2 and SH2E CPUs only:

#define FLAG 0x40

unsigned int f(char v){
return (v & FLAG) == FLAG;
}

For both big and little endian translates to dynamic shift call:

-O -m2 (or -m2e)

_f:
sts.l   pr,@-r15
mov.l   .L3,r1
jsr @r1
exts.b  r4,r4
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts 
nop
.L3:
.long   ___ashiftrt_r4_6

And

#define FLAG 0x40
#define ADDR 0x
#define P ((unsigned char *)ADDR)

unsigned int f(void){
return (*P & FLAG) == FLAG;
}

Translates to

_f:
sts.l   pr,@-r15
mov.l   .L3,r1
mov.b   @r1,r4
mov.l   .L4,r1
jsr @r1
nop
mov r4,r0
and #1,r0
lds.l   @r15+,pr
rts 
nop
.L3:
.long   -65536
.L4:
.long   ___ashiftrt_r4_6

Assembler output does not depend on ADDR value, but depends on variable (or
pointer) type. When type is integer, assembler code uses 'tst' for all options
'-m4', '-m2', '-m2e':

#define FLAG 0x40

unsigned int f(unsigned int v){
return (v & FLAG) == FLAG;
}

translates to 

_f:
mov r4,r0
tst #64,r0
mov #-1,r0
rts 
negcr0,r0

and

#define FLAG 0x40
#define ADDR 0x
#define P ((unsigned int *)ADDR)

unsigned int f(void){
return (*P & FLAG) == FLAG;
}

translates to 

_f:
mov.l   .L2,r1
mov.l   @r1,r0
tst #64,r0
mov #-1,r0
rts 
negcr0,r0
.L2:
.long   -65536

Interesting that when '-m4' flag is specified, later GCC always translates to
code with 'tst' instruction. I played with godbolt and that's what I found. GCC
ver 4 uses dynamic shift 'shad' with '-m4' option and library call with '-m2'
or '-m2e' options. GCC 9.5 and later uses 'tst' with '-m4' and library call
with both '-m2' and '-m2e' when FLAG==0x40 and 'shll' instruction with both
'-m2' and '-m2e' when FLAG==0x80. I remind you that this is happening when char
type used only.

Maybe SH4 solution can be extended to support SH2/SH2E?

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2018-10-09 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #30 from Oleg Endo  ---
(In reply to Eric Gallager from comment #29)
> 
> So maybe it's worth splitting up into sub-issues?

It'd be better to, yes.  But at the moment I don't have a lot of time to go
through all the cases and factor out the individual cases.  Please leave this
open.  It will be useful if I (or others) get back to active SH development in
the future.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2018-10-08 Thread egallager at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #29 from Eric Gallager  ---
(In reply to Oleg Endo from comment #28)
> (In reply to Eric Gallager from comment #27)
> > (In reply to Oleg Endo from comment #26)
> > > Author: olegendo
> > > Date: Mon Jan 26 23:56:05 2015
> > > New Revision: 220144
> 
> Well, it fixed some of the cases mentioned in this PR, but not all.  It's
> quite a  broad issue actually.

So maybe it's worth splitting up into sub-issues?

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2018-07-09 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #28 from Oleg Endo  ---
(In reply to Eric Gallager from comment #27)
> (In reply to Oleg Endo from comment #26)
> > Author: olegendo
> > Date: Mon Jan 26 23:56:05 2015
> > New Revision: 220144

Well, it fixed some of the cases mentioned in this PR, but not all.  It's quite
a  broad issue actually.

[Bug target/49263] SH Target: underutilized "TST #imm, R0" instruction

2018-07-08 Thread egallager at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Eric Gallager  changed:

   What|Removed |Added

 CC||egallager at gcc dot gnu.org

--- Comment #27 from Eric Gallager  ---
(In reply to Oleg Endo from comment #26)
> Author: olegendo
> Date: Mon Jan 26 23:56:05 2015
> New Revision: 220144
> 
> URL: https://gcc.gnu.org/viewcvs?rev=220144=gcc=rev
> Log:
> gcc/
>   PR target/49263
>   * config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before
>   remove_insn.
>   * config/sh/sh.md (tstsi_t): Don't try to optimize constant with right
>   shifts if it already fits into K08.
> 
> gcc/testsuite/
>   PR target/49263
>   * gcc.target/sh/pr49263-4.c: New.
> 
> 
> Added:
> trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c
> Modified:
> trunk/gcc/ChangeLog
> trunk/gcc/config/sh/sh.c
> trunk/gcc/config/sh/sh.md
> trunk/gcc/testsuite/ChangeLog

Did this fix it?

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2015-01-26 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #26 from Oleg Endo olegendo at gcc dot gnu.org ---
Author: olegendo
Date: Mon Jan 26 23:56:05 2015
New Revision: 220144

URL: https://gcc.gnu.org/viewcvs?rev=220144root=gccview=rev
Log:
gcc/
PR target/49263
* config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before
remove_insn.
* config/sh/sh.md (tstsi_t): Don't try to optimize constant with right
shifts if it already fits into K08.

gcc/testsuite/
PR target/49263
* gcc.target/sh/pr49263-4.c: New.


Added:
trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/sh/sh.c
trunk/gcc/config/sh/sh.md
trunk/gcc/testsuite/ChangeLog

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2015-01-24 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #25 from Oleg Endo olegendo at gcc dot gnu.org ---
Author: olegendo
Date: Sat Jan 24 13:04:53 2015
New Revision: 220081

URL: https://gcc.gnu.org/viewcvs?rev=220081root=gccview=rev
Log:
gcc/
PR target/49263
PR target/53987
PR target/64345
PR target/59533
PR target/52933
PR target/54236
PR target/51244
* config/sh/sh-protos.h
(sh_extending_set_of_reg::can_use_as_unextended_reg,
sh_extending_set_of_reg::use_as_unextended_reg,
sh_is_nott_insn, sh_movt_set_dest, sh_movrt_set_dest, sh_is_movt_insn,
sh_is_movrt_insn, sh_insn_operands_modified_between_p,
sh_reg_dead_or_unused_after_insn, sh_in_recog_treg_set_expr,
sh_recog_treg_set_expr, sh_split_treg_set_expr): New functions.
(sh_treg_insns): New class.
* config/sh/sh.c (TARGET_LEGITIMATE_COMBINED_INSN): Define target hook.
(scope_counter): New class.
(sh_legitimate_combined_insn, sh_is_nott_insn, sh_movt_set_dest,
sh_movrt_set_dest, sh_reg_dead_or_unused_after_insn,
sh_extending_set_of_reg::can_use_as_unextended_reg,
sh_extending_set_of_reg::use_as_unextended_reg, sh_recog_treg_set_expr,
sh_in_recog_treg_set_expr, sh_try_split_insn_simple,
sh_split_treg_set_expr): New functions.
(addsubcosts): Handle treg_set_expr.
(sh_rtx_costs): Handle IF_THEN_ELSE and ZERO_EXTRACT.
(sh_rtx_costs): Use arith_reg_operand in SIGN_EXTEND and ZERO_EXTEND.
(sh_rtx_costs): Handle additional bit test patterns in EQ and AND cases.
(sh_insn_operands_modified_between_p): Make non-static.
* config/sh/predicates.md (zero_extend_movu_operand): Allow
simple_mem_operand in addition to displacement_mem_operand.
(zero_extend_operand): Don't allow zero_extend_movu_operand.
(treg_set_expr, treg_set_expr_not_const01,
arith_reg_or_treg_set_expr): New predicates.
* config/sh/sh.md (tstsi_t): Use arith_reg_operand and
arith_or_int_operand instead of logical_operand.  Convert to
insn_and_split.  Try to optimize constant operand in splitter.
(tsthi_t, tstqi_t): Fold into *tstmode_t.  Convert to insn_and_split.
(*tstqi_t_zero): Delete.
(*tstmode_t_subregs): Add !sh_in_recog_treg_set_expr split condition.
(tstsi_t_and_not): Delete.
(tstmode_t_zero_extract_eq): Rename to *tstmode_t_zero_extract.
Convert to insn_and_split.
(unnamed split, tstsi_t_zero_extract_xor,
tstsi_t_zero_extract_subreg_xor_little,
tstsi_t_zero_extract_subreg_xor_big): Delete.
(*tstsi_t_shift_mask): New insn_and_split.
(cmpeqsi_t, cmpgesi_t): Add new split for const_int 0 operands and try
to recombine with surrounding insns when splitting.
(*negtstsi): Add !sh_in_recog_treg_set_expr condition.
(cmp_div0s_0, cmp_div0s_1, *cmp_div0s_0, *cmp_div0s_1): Rewrite as ...
(cmp_div0s, *cmp_div0s_1, *cmp_div0s_2, *cmp_div0s_3, *cmp_div0s_4,
*cmp_div0s_5, *cmp_div0s_6): ... these new insn_and_split patterns.
(*cbranch_div0s: Delete.
(*addc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
Try to recombine with surrounding insns when splitting.  Add operand
order variants.
(*addc_t_r, *addc_r_t): Use treg_set_expr_not_const01.
(*addc_r_r_1, *addc_r_lsb, *addc_r_r_lsb, *addc_r_lsb_r, *addc_r_msb,
*addc_r_r_msb, *addc_2r_msb): Delete.
(*addc_2r_lsb): Rename to *addc_2r_t.  Use treg_set_expr.  Add operand
order variant.
(*addc_negreg_t): New insn_and_split.
(*subc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
Try to recombine with surrounding insns when splitting.
Add operand order variants.  
(*subc_negt_reg, *subc_negreg_t, *reg_lsb_t, *reg_msb_t): New
insn_and_split patterns.
(*rotcr): Use arith_reg_or_treg_set_expr.  Try to recombine with
surrounding insns when splitting.
(unnamed rotcr split): Use arith_reg_or_treg_set_expr.
(*rotcl): Likewise.  Add zero_extract variant.
(*ashrsi2_31): New insn_and_split.
(*negc): Convert to insn_and_split.  Use treg_set_expr.
(*zero_extendmodesi2_disp_mem): Update comment.
(movrt_negc, *movrt_negc, nott): Add !sh_in_recog_treg_set_expr split
condition.
(*mov_t_msb_neg, mov_neg_si_t): Use treg_set_expr.  Try to recombine
with surrounding insns when splitting.
(any_treg_expr_to_reg): New insn_and_split.
(*neg_zero_extract_0, *neg_zero_extract_1, *neg_zero_extract_2,
*neg_zero_extract_3, *neg_zero_extract_4, *neg_zero_extract_5,
*neg_zero_extract_6, *zero_extract_0, *zero_extract_1,
*zero_extract_2): New single bit zero extract patterns.
(bld_reg, *bld_regqi): Fold into bldmode_reg.
(*get_thread_pointersi, store_gbr, *movmode_gbr_load,
*movmode_gbr_load, *movmode_gbr_load, *movmode_gbr_load,
*movdi_gbr_load): Use arith_reg_dest instead of register_operand for
set destination.
(set_thread_pointersi, load_gbr): Use arith_reg_operand instead of
register_operand for

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2014-12-30 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #23 from Oleg Endo olegendo at gcc dot gnu.org ---
Author: olegendo
Date: Tue Dec 30 18:44:27 2014
New Revision: 219111

URL: https://gcc.gnu.org/viewcvs?rev=219111root=gccview=rev
Log:
gcc/testsuite/
PR target/49263
* gcc.target/sh/pr49263-1.c: New.
* gcc.target/sh/pr49263-2.c: New.

Added:
trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c
trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c
Modified:
trunk/gcc/testsuite/ChangeLog

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2014-12-30 Thread olegendo at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #24 from Oleg Endo olegendo at gcc dot gnu.org ---
Author: olegendo
Date: Tue Dec 30 19:11:42 2014
New Revision: 219113

URL: https://gcc.gnu.org/viewcvs?rev=219113root=gccview=rev
Log:
gcc/testsuite/
PR target/49263
* gcc.target/sh/sh.exp (check_effective_target_sh2a): New.
* gcc.target/sh/pr49263-3.c: New.

Added:
trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c
Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/sh/sh.exp

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2013-12-17 Thread olegendo at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #22 from Oleg Endo olegendo at gcc dot gnu.org ---
(In reply to Oleg Endo from comment #21)
 What happens is that the sequence is expanded to RTL as follows:
 
 (insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
 (and:SI (reg/v:SI 162 [ xb ])
 (const_int 33 [0x21]))) sh_tmp.cpp:17 -1
  (nil))
 (insn 8 7 9 2 (set (reg:SI 147 t)
 (eq:SI (reg:SI 163 [ D.1856 ])
 (const_int 0 [0]))) sh_tmp.cpp:17 -1
  (nil))
 (jump_insn 9 8 10 2 (set (pc)
 (if_then_else (eq (reg:SI 147 t)
 (const_int 0 [0]))
 (label_ref:SI 15)
 (pc))) sh_tmp.cpp:17 301 {*cbranch_t}
  (int_list:REG_BR_PROB 3900 (nil))
  - 15)
 (note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
 (insn 11 10 12 4 (set (reg:SI 164)
 (const_int 0 [0])) sh_tmp.cpp:18 -1
  (nil))
 (insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
 (reg:SI 164)) sh_tmp.cpp:18 -1
  (nil))
 
 
 and insn 11 becomes dead code and is eliminated.
 All of that happens long time before combine, so the tst combine patterns
 have no chance to reconstruct the original code.
 

Adding an early peephole pass as described in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59533#c2 and then adding the
following peephole:

;; Peephole after initial expansion.
(define_peephole2
  [(set (match_operand:SI 0 arith_reg_dest)
(and:SI (match_operand:SI 1 arith_reg_operand)
(match_operand:SI 2 logical_operand)))
   (set (reg:SI T_REG) (eq:SI (match_dup 0) (const_int 0)))]
  TARGET_SH1  can_create_pseudo_p ()
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 1) (match_dup 2))
  (const_int 0)))
   (set (match_dup 0) (and:SI (match_dup 1) (match_dup 2)))])

... fixes the problem and results in more uses of the tst #imm,r0 insn
according to the CSiBE set.  On the other hand there is a total code size
increase of 792 bytes on the whole set.  Below are some things that get worse
in the Linux source (mm/filemap.c):

mov.b   @(15,r1),r0-mov.b   @(15,r1),r0
cmp/pz  r0   tst #128,r0 // cmp/pz has less
bf  .L1016   bf  .L1001  // pressure on r0


mov.b   @(15,r0),r0 -   mov.b   @(15,r0),r0
tst #4,r0sharr0
bf  .L107sharr0
 tst #1,r0


add #16,r0  -   add #16,r0
mov.b   @(15,r0),r0  mov.b   @(15,r0),r0
tst #16,r0   mov #-4,r1
bf/s.L509shadr1,r0
 tst #1,r0
 bf/s.L509

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2013-12-08 Thread olegendo at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #21 from Oleg Endo olegendo at gcc dot gnu.org ---
(In reply to Oleg Endo from comment #18)
 
 It seems that combine is trying to look for the following patterns:
 
 Failed to match this instruction:
 (set (pc)
 (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
 (const_int 85 [0x55]))
 (const_int 0 [0]))
 (label_ref:SI 15)
 (pc)))

Implementing such a combine pattern like ...
(define_insn_and_split *tst_cbranch
  [(set (pc)
(if_then_else (ne (and:SI (match_operand:SI 0 logical_operand)
  (match_operand:SI 1 const_int_operand))
  (const_int 0))
  (label_ref (match_operand 2))
  (pc)))
   (clobber (reg:SI T_REG))]
  TARGET_SH1
  #
   1
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 0) (match_dup 1))
  (const_int 0)))
   (set (pc) (if_then_else (eq (reg:SI T_REG) (const_int 0))
   (label_ref (match_dup 2))
   (pc)))])


results in code such as following code:
mov #33,r1
mov r5,r0
tst #33,r0
bf/s.L3
and r5,r1
mov.l   r1,@r4
.L3:
rts
nop

which is worse.
What happens is that the sequence is expanded to RTL as follows:

(insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
(and:SI (reg/v:SI 162 [ xb ])
(const_int 33 [0x21]))) sh_tmp.cpp:17 -1
 (nil))
(insn 8 7 9 2 (set (reg:SI 147 t)
(eq:SI (reg:SI 163 [ D.1856 ])
(const_int 0 [0]))) sh_tmp.cpp:17 -1
 (nil))
(jump_insn 9 8 10 2 (set (pc)
(if_then_else (eq (reg:SI 147 t)
(const_int 0 [0]))
(label_ref:SI 15)
(pc))) sh_tmp.cpp:17 301 {*cbranch_t}
 (int_list:REG_BR_PROB 3900 (nil))
 - 15)
(note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 11 10 12 4 (set (reg:SI 164)
(const_int 0 [0])) sh_tmp.cpp:18 -1
 (nil))
(insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
(reg:SI 164)) sh_tmp.cpp:18 -1
 (nil))


and the cse1 pass decides that the result of the and operation can be shared
and replaces the operand in insn 12 with reg:SI 163:

(insn 12 11 15 3 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
(reg:SI 163 [ D.1856 ])) sh_tmp.cpp:18 258 {movsi_ie}
 (expr_list:REG_DEAD (reg:SI 164)
(expr_list:REG_DEAD (reg/v/f:SI 161 [ x ])
(nil

and insn 11 becomes dead code and is eliminated.
All of that happens long time before combine, so the tst combine patterns have
no chance to reconstruct the original code.

A sequence such as

mov r5,r0
mov #0,r1
tst #33,r0
bf  .L3
mov.l   r1,@r4
.L3:
rts
nop

could probably be achieved by combining insn 7 and insn 8 shortly after RTL
expansion, or even during the expansion of insn 8 (by looking at previous
already expanded insns and emitting a tst insn directly).
The idea would be to reduce dependencies on the tested register which allows
better scheduling.  In addition to that, on SH4A mov #imm8,Rn is an MT group
instruction which has a higher probability of being executed in parallel.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2012-10-31 Thread olegendo at gcc dot gnu.org



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263



--- Comment #20 from Oleg Endo olegendo at gcc dot gnu.org 2012-10-31 
13:47:07 UTC ---

(In reply to comment #19)

 Another thing I've noticed...

 Cases such as:

 

 mov.l   r0,@r2! LS

 mov r13,r0! MT

 and #7,r0 ! EX

 tst r0,r0 ! MT

 bt/s.L8   ! BR

 mov.l   r0,@(16,r1)

 

 where the result of the and op is re-used would be slightly better as:

 

 mov.l   r0,@r2   ! LS

 mov r13,r0   ! MT

 tst #7,r0! MT

 and #7,r0! EX

 bt/s.L8  ! BR

 mov.l   r0,@(16,r1)

 

 because it reduces dependency on the result of the and op and thus has a 
 higher

 chance to be executed in parallel.



Other similar cases where hoisting the tst insn would make sense:

mov.b   @(13,r2),r0

extu.b  r0,r0

tst #1,r0

bt  .L53



mov.b   @(13,r1),r0

extu.b  r0,r0

tst r0,r0

bt/s.L37

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2012-10-28 Thread olegendo at gcc dot gnu.org



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263



--- Comment #19 from Oleg Endo olegendo at gcc dot gnu.org 2012-10-28 
22:01:47 UTC ---

Another thing I've noticed...

Cases such as:



mov.l   r0,@r2! LS

mov r13,r0! MT

and #7,r0 ! EX

tst r0,r0 ! MT

bt/s.L8   ! BR

mov.l   r0,@(16,r1)



where the result of the and op is re-used would be slightly better as:



mov.l   r0,@r2   ! LS

mov r13,r0   ! MT

tst #7,r0! MT

and #7,r0! EX

bt/s.L8  ! BR

mov.l   r0,@(16,r1)



because it reduces dependency on the result of the and op and thus has a higher

chance to be executed in parallel.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2012-08-27 Thread olegendo at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Oleg Endo olegendo at gcc dot gnu.org changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
   Last reconfirmed||2012-08-27
 Resolution|FIXED   |
 AssignedTo|unassigned at gcc dot   |olegendo at gcc dot gnu.org
   |gnu.org |
 Ever Confirmed|0   |1

--- Comment #18 from Oleg Endo olegendo at gcc dot gnu.org 2012-08-27 
19:51:42 UTC ---
Not quite so done, actually.  The following example case does not work
properly:


void test00 (int* x, int xb)
{
  if (! (xb  128))
x[0] = 0;
}

-O2 -m4:
mov r5,r0
and #128,r0
tst r0,r0
bf  .L3
mov.l   r0,@r4
.L3:
rts
nop



void test01 (int* x, int xb)
{
  if (! (xb  0x55))
x[0] = 0;
}

-O2 -m4:
mov r5,r0
and #85,r0
tst r0,r0
bf  .L7
mov.l   r0,@r4
.L7:
rts
nop


It seems that combine is trying to look for the following patterns:

Failed to match this instruction:
(set (pc)
(if_then_else (ne (zero_extract:SI (reg:SI 5 r5 [ xb ])
(const_int 1 [0x1])
(const_int 7 [0x7]))
(const_int 0 [0]))
(label_ref:SI 15)
(pc)))


Failed to match this instruction:
(set (pc)
(if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
(const_int 85 [0x55]))
(const_int 0 [0]))
(label_ref:SI 15)
(pc)))



Another case that could see some improvement ...

bool test04 (int* x, int xb)
{
  return ! (xb  0x55);
}

-O2 -m4 (OK):
mov r5,r0
tst #85,r0
rts
movtr0


bool test02 (int* x, int xb)
{
  return ! (xb  (1  6));
}

-O2 -m4 (NG):
mov r5,r0
mov #-6,r1
shldr1,r0
xor #1,r0
rts
and #1,r0

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2012-02-26 Thread olegendo at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #16 from olegendo at gcc dot gnu.org 2012-02-26 13:31:36 UTC ---
Author: olegendo
Date: Sun Feb 26 13:31:32 2012
New Revision: 184585

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=184585
Log:
PR target/49263
* gcc.target/sh/pr49263.c: New.


Added:
trunk/gcc/testsuite/gcc.target/sh/pr49263.c
Modified:
trunk/gcc/testsuite/ChangeLog

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2012-02-26 Thread olegendo at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

olegendo at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||olegendo at gcc dot gnu.org
 Resolution||FIXED

--- Comment #17 from olegendo at gcc dot gnu.org 2012-02-26 23:26:28 UTC ---
I guess this one is done now.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-12-28 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #15 from Oleg Endo oleg.e...@t-online.de 2011-12-29 00:34:53 UTC 
---
(In reply to comment #14)
 With trunk rev 181517 I have observed the following problem, which happens 
 when
 compiling for -m2*, -m3*, -m4* and -Os:
 

This is still present as of rev 182713 and seems to be a different issue.
I've created PR51697 for it.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-11-20 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #14 from Oleg Endo oleg.e...@t-online.de 2011-11-20 13:04:39 UTC 
---
With trunk rev 181517 I have observed the following problem, which happens when
compiling for -m2*, -m3*, -m4* and -Os:

The function compiled with -m2 -Os

int test_int64_lowword (long long* x)
{
  return *x  0x ? -20 : -40;
}

results in the following code:

mov#0,r2! 42movsi_i/3[length = 2]
tstr2,r2! 44cmpeqsi_t/1[length = 2]
bf.s.L4! 45branch_false[length = 2]
mov.l@(4,r4),r3! 12movsi_i/5[length = 2]
tstr3,r3! 46cmpeqsi_t/1[length = 2]
.L4:
bt.L3! 14branch_true[length = 2]
mov#-20,r0! 4movsi_i/3[length = 2]
rts
nop! 52*return_i[length = 4]
.L3:
rts! 54*return_i[length = 2]
mov#-40,r0! 5movsi_i/3[length = 2]

When compiled with -O1 or -O2 the high word test is completely eliminated
correctly:

mov.l@(4,r4),r1! 10movsi_i/5[length = 2]
tstr1,r1! 17cmpeqsi_t/1[length = 2]
bt.L4! 18branch_true[length = 2]
mov#-20,r0! 4movsi_i/3[length = 2]
rts
nop! 50*return_i[length = 4]
.L4:
rts! 52*return_i[length = 2]
mov#-40,r0! 5movsi_i/3[length = 2]

I'm not sure whether this is actually related to this PR, but have noticed it
with the test cases of this PR.  It seems only to happen for comparison-like
insns.  If I remember correctly, this problem did not exist when I started
working on this PR.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-14 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #12 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-10-14 
23:06:06 UTC ---
(In reply to comment #11)
 Created attachment 25491 [details]
 Proposed patch including test case

Looks fine.  A very minor style nits:

 +  if (GET_CODE (XEXP (x, 0)) == AND  /* tst instruction.  */

This comment looks a bit bogus.  A full sentence comment would
be better.

 +
 +

There are some extra empty lines.  GNU/GCC coding style says
that only one empty line is needed.  I know that there are
extra empty lines already, but we should not add new ones :-)

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-14 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #13 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-10-15 
02:32:56 UTC ---
Author: kkojima
Date: Sat Oct 15 02:32:53 2011
New Revision: 180020

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=180020
Log:
PR target/49263
* config/sh/sh.h (ZERO_EXTRACT_ANDMASK): New macro.
* config/sh/sh.c (sh_rtx_costs): Add test instruction case.
* config/sh/sh.md (tstsi_t): Name existing insn.  Make inner
and instruction commutative.
(tsthi_t, tstqi_t, tstqi_t_zero, tstsi_t_and_not,
tstsi_t_zero_extract_eq, tstsi_t_zero_extract_xor,
tstsi_t_zero_extract_subreg_xor_little,
tstsi_t_zero_extract_subreg_xor_big): New insns.
(*movsicc_t_false, *movsicc_t_true): Replace space with tab in
asm output.
(*andsi_compact): Reorder alternatives so that K08 is considered
first.
* gcc.target/sh/pr49263.c: New.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/sh/sh.c
trunk/gcc/config/sh/sh.h
trunk/gcc/config/sh/sh.md
trunk/gcc/testsuite/ChangeLog

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-13 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Oleg Endo oleg.e...@t-online.de changed:

   What|Removed |Added

  Attachment #24412|0   |1
is obsolete||

--- Comment #11 from Oleg Endo oleg.e...@t-online.de 2011-10-13 22:54:54 UTC 
---
Created attachment 25491
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25491
Proposed patch including test case


  3) only zero_extract special cases
 
 looks to be dominant.

Yes.  I've briefly looked through the test sources.  A popular use case
for bit test instructions seem to be single bit tests, which the patch
is basically adding. 

 I see.  I also expect that the experts have some idea for
 this issue.

Hm .. http://gcc.gnu.org/ml/gcc/2011-10/msg00189.html
Eric pointed me to the i386 back-end.  Unfortunately, what I found there is
where I originally started...

;; Combine likes to form bit extractions for some tests.  Humor it.

I.e. it is also coded against the behavior of the combine pass with a bunch
of pattern variations.  I guess that's the way it's supposed to be done then :T

 I don't think that it's too much.  Those numbers can be easily
 collected for CSiBE.  If your patterns are named, you could
 simply add -dap -save-temps to the compiler option which is
 specified when ruining CSiBE's create-config and then get
 the occurrences of testsi_6, for example, with something like
   grep testsi_6 `find . -name *.s -print` | wc -l
 after running the CSiBE size test.

Ah, right!  With the attached latest patch applied to trunk rev 179778
the numbers for 
  -ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd
-freg-struct-return
look something like that:

tstsi_t: 1391
tsthi_t: 4
tstqi_t: 23
tstqi_t_zero: 667
tstsi_t_and_not: 598
tstsi_t_zero_extract_eq: 70
tstsi_t_zero_extract_xor: 923

Notice that the split contributes to the tstsi_t number.
Also, the 3 patterns 
  tstsi_t_zero_extract_xor
  tstsi_t_zero_extract_subreg_xor_little
  tstsi_t_zero_extract_subreg_xor_big

are basically one and the same. On SH4A the subreg variants are required,
because tstsi_t_zero_extract_xor will never match.

I've also added a special case to sh_rtx_costs to detect at least the tstsi_t
pattern. However, the other patterns are not really covered by that and the 
combine pass calculates the cost as a sum of all the operations of the pattern.
I guess the selection of the test instruction could be stimulated a bit more 
by a more accurate costs calculation, but my feeling is that it won't do a lot.


Cheers,
Oleg

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-10 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #9 from Oleg Endo oleg.e...@t-online.de 2011-10-10 23:48:17 UTC 
---
Created attachment 25461
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=25461
CSiBE comparisons

(In reply to comment #8)
 
 Another combine pass to reduce size less than 0.3% on one target
 would be not acceptable, I guess.  

I'm sorry, I forgot to mention that it was just a proof of concept hack
of mine, just to see whether it has any chance to work at all.
I think it would be better to change/fix the behavior of the combine pass
in this regard, so that it tries matching combined patterns without
sophisticated transformations. I will try asking on the gcc list about that.

 ~10 new patterns would be
 overkill for that result, though I'm still expecting that a few
 patterns of them were dominant. 

Yep, even if it turned out to be actually only 8 patterns in total, but
still.. I haven't looked at the issue with SH4A but most likely it would add
another one or two patterns... so basically ~10 :)

 Could you get numbers which pattern
 was used in the former option?

I think it would be a bit too much checking out each individual pattern.
Instead I grouped them into what they are effectively doing.
While I was at it, I also added your peephole idea, and a top 10 listing of
the individual files.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-10 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #10 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-10-11 
01:47:03 UTC ---
(In reply to comment #9)
 3) only zero_extract special cases

looks to be dominant.

 I'm sorry, I forgot to mention that it was just a proof of concept hack
 of mine, just to see whether it has any chance to work at all.
 I think it would be better to change/fix the behavior of the combine pass
 in this regard, so that it tries matching combined patterns without
 sophisticated transformations. I will try asking on the gcc list about that.

I see.  I also expect that the experts have some idea for
this issue.

 I think it would be a bit too much checking out each individual pattern.

I don't think that it's too much.  Those numbers can be easily
collected for CSiBE.  If your patterns are named, you could
simply add -dap -save-temps to the compiler option which is
specified when ruining CSiBE's create-config and then get
the occurrences of testsi_6, for example, with something like
  grep testsi_6 `find . -name *.s -print` | wc -l
after running the CSiBE size test.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-09 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #7 from Oleg Endo oleg.e...@t-online.de 2011-10-09 23:34:45 UTC 
---
(In reply to comment #6)

 Yep, maintenance burden but I don't mean ack/nak for anything.
 If it's enough fruitful, we should take that route.  When it
 gives 5% improvement in the usual working set like as CSiBE,
 hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
 look worth to add many patterns for that.
 
  Isn't there a way to tell the combine pass not to do so, but instead first 
  look
  deeper at what is in the MD?
 
 I don't know how to do it cleanly.
 

I've tried out a couple of things and got some CSiBE numbers based on 
trunk rev 179430. Unfortunately only code size comparisons, no run time 
performance numbers. The tests were compiled with 
-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return

Option 1)
  Use many (~10) patterns in the MD and some cost calculation tuning.
  The last patch required some adaptation, because the combine pass 
  started trying to match things slightly differently. I've also 
  noticed that it requires a special case for one pattern on SH4A...

  size of all modules: 2916390 - 2909026-7364 / -0.252504 %
  avg difference over all modules: -409.11 / -0.273887 %
  max: compiler   22808 - 22812   +4 / +0.017538 %
  min: libpng-1.2.5   99120 - 97804-1316 / -1.327684 %

Option 2)
  I've added another combine pass which has the make_compound_operation
  function turned off. The make_compound_operation function is used to
  produce zero_extract patterns. If the resulting simplified pattern does
  not match anything in the MD, combine reverts the transformation and 
  proceeds with the next insn. That way, it never tries to match the 
  tst #imm pattern in the MD. 
  With this option only ~5 patterns seem to be required and a small
  extension of the costs calculation.

  size of all modules:  2916390 - 2909170-7220 / -0.247566 %
  avg difference over all modules: -401.11 / -0.254423 %
  max: compiler   22808 - 22812   +4 / +0.017538 %
  min: libpng-1.2.5   99120 - 97836-1284 / -1.295400 %

Not so spectacular on average. It highly depends on the type of SW being
compiled, but it hits quite a lot of files in CSiBE.

Option 2 seems more robust even if it seems less effective, what do you think?

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-10-09 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #8 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-10-10 
01:31:42 UTC ---
(In reply to comment #7)
 Option 2 seems more robust even if it seems less effective, what do you think?

Another combine pass to reduce size less than 0.3% on one target
would be not acceptable, I guess.  ~10 new patterns would be
overkill for that result, though I'm still expecting that a few
patterns of them were dominant.  Could you get numbers which pattern
was used in the former option?

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-26 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #5 from Oleg Endo oleg.e...@t-online.de 2011-06-26 22:30:05 UTC 
---
 Yes, that peephole doesn't catch all the patterns which could
 make tst #imm8,r0 use.  Perhaps it would be a good idea to get
 numbers for the test like CSiBE test with the vanilla and new
 insns/peepholes patched compilers.  Something covers 80% of
 the possible cases in the usual working set, it would be enough
 successful for such a micro-optimization, I guess.

I'd also like to see some numbers on those micro-improvements. I'll have a look
at CSiBE.
Anyway, why not just add all the currently known-to-work cases? What are your
concerns regarding that? I can imagine that it is a maintenance burden to keep
all those definitions and special cases in the MD up-to-date (bit rot etc). Do
you have anything other than that in mind? 

It seems that others have been trying to solve the same problem in a very
similar way: http://patchwork.ozlabs.org/patch/58832/ ;)

I've figured that the following pattern works quite well for this particular
case. It seems to catch all the bit patterns. (sh_tst_const_ok and
sh_zero_extract_val are added functions in sh.c)

(define_insn_and_split *tstsi3
  [(set (reg:SI T_REG)
(zero_extract:SI (match_operand:SI 0 arith_reg_operand )
(match_operand:SI 1 const_int_operand)
(match_operand:SI 2 const_int_operand)))]
  TARGET_SH1  sh_tst_const_ok (sh_zero_extract_val (operands[1],
operands[2]))
  #
   1
  [(const_int 0)]
  {
int tstval = sh_zero_extract_val (operands[1], operands[2]);
emit_insn (gen_tstsi (operands[0], GEN_INT (tstval)));
DONE;
  }
)

...however, the problem with that one is that the T bit is inverted afterwards,
and thus the following branch logic (or whatever) gets inverted as well. One
option could be to emit some T bit inversion insn after gen_tstsi and then
optimize it away later on with e.g. a peephole (?), but I haven't tried that
yet.

The actual problem is that the combine pass first sees the andsi, eq, ...
insns and correctly matches them to those in the MD. But instead of continuing
to match patterns from the MD it expands the and insn into something like
zero_extract, which in turn will never be combined back into the tst insn.
Isn't there a way to tell the combine pass not to do so, but instead first look
deeper at what is in the MD?


Regarding the peephole:

 (satisfies_constraint_I08 (operands[1])
   || satisfies_constraint_K08 (operands[1])

I guess this might generate wrong code for e.g. if (x  -2). When x has any
bits[31:1] set this must return true. The code after the peephole optimization
will look only at the lower 8 bits and would possibly return false for x =
0xFF00, which is wrong. So it should be satisfies_constraint_K08 only,
shouldn't it?


 
 Cost patch looks fine to me.  Could you propose it as a separate
 patch on gcc-patches list with an appropriate ChangeLog entry?
 When proposing it, please refer how you've tested it.  Also
 the numbers got with the patch are highly welcome.
 
 BTW, do you have FSF copyright assignment for your GCC work?
 Although the cost patch itself is essentially several lines which
 doesn't require copyright assignment, the other changes you've
 proposed clearly require the paper work, I think.

I'll contact you directly regarding that.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-26 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #6 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-06-27 
05:14:36 UTC ---
(In reply to comment #5)
 Anyway, why not just add all the currently known-to-work cases? What are your
 concerns regarding that? I can imagine that it is a maintenance burden to keep
 all those definitions and special cases in the MD up-to-date (bit rot etc). Do
 you have anything other than that in mind? 

Yep, maintenance burden but I don't mean ack/nak for anything.
If it's enough fruitful, we should take that route.  When it
gives 5% improvement in the usual working set like as CSiBE,
hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
look worth to add many patterns for that.

 Isn't there a way to tell the combine pass not to do so, but instead first 
 look
 deeper at what is in the MD?

I don't know how to do it cleanly.

 I guess this might generate wrong code for e.g. if (x  -2). When x has any
 bits[31:1] set this must return true. The code after the peephole optimization
 will look only at the lower 8 bits and would possibly return false for x =
 0xFF00, which is wrong. So it should be satisfies_constraint_K08 only,
 shouldn't it?

You are right.  That peephole was simply 'something like this'.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-22 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #4 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-06-22 
22:34:04 UTC ---
Yes, that peephole doesn't catch all the patterns which could
make tst #imm8,r0 use.  Perhaps it would be a good idea to get
numbers for the test like CSiBE test with the vanilla and new
insns/peepholes patched compilers.  Something covers 80% of
the possible cases in the usual working set, it would be enough
successful for such a micro-optimization, I guess.

Cost patch looks fine to me.  Could you propose it as a separate
patch on gcc-patches list with an appropriate ChangeLog entry?
When proposing it, please refer how you've tested it.  Also
the numbers got with the patch are highly welcome.

BTW, do you have FSF copyright assignment for your GCC work?
Although the cost patch itself is essentially several lines which
doesn't require copyright assignment, the other changes you've
proposed clearly require the paper work, I think.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-19 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #3 from Oleg Endo oleg.e...@t-online.de 2011-06-19 16:42:01 UTC 
---
Thanks for having a look at it Kaz.
Yes, the fiddling with the combine pass is fragile. I've tried out your
peephole idea. It works in most cases but not all the time. E.g. it doesn't
catch the following example:

int test_imm (int i)
{
  return (i  3) ? 20 : 40;
}


mov#3,r1
andr1,r4
tstr4,r4
bt/s.L5
mov#40,r0
mov#20,r0
.L5:
rts
nop


Also the following... 

int test_imm (int* i)
{
   return (*i  255) ? 20 : 40;
}

becomes on little endian (OK):
mov.b@r4,r1
tstr1,r1
bt/s.L10
mov#40,r0
mov#20,r0
.L10:
rts
nop

but on big endian (NG):
mov.l@r4,r1
extu.br1,r1
tstr1,r1
bt/s.L10
mov#40,r0
mov#20,r0
.L10:
rts
nop


I'll give the thing another try.

Regarding the change to the andcosts function, the following should be better:

--- sh.c(revision 175188)
+++ sh.c(working copy)
@@ -243,7 +243,7 @@
 static int flow_dependent_p (rtx, rtx);
 static void flow_dependent_p_1 (rtx, const_rtx, void *);
 static int shiftcosts (rtx);
-static int andcosts (rtx);
+static int and_xor_ior_costs (rtx, int code);
 static int addsubcosts (rtx);
 static int multcosts (rtx);
 static bool unspec_caller_rtx_p (rtx);
@@ -2841,14 +2841,14 @@
 return shift_insns[value];
 }

-/* Return the cost of an AND operation.  */
+/* Return the cost of an AND/XOR/IOR operation.  */

 static inline int
-andcosts (rtx x)
+and_xor_ior_costs (rtx x, int code)
 {
   int i;

-  /* Anding with a register is a single cycle and instruction.  */
+  /* register x register is a single cycle instruction.  */
   if (!CONST_INT_P (XEXP (x, 1)))
 return 1;

@@ -2864,17 +2864,17 @@
 }

   /* These constants are single cycle extu.[bw] instructions.  */
-  if (i == 0xff || i == 0x)
+  if ((i == 0xff || i == 0x)  code == AND)
 return 1;
-  /* Constants that can be used in an and immediate instruction in a single
+  /* Constants that can be used in an instruction with an immediate are a
single
  cycle, but this requires r0, so make it a little more expensive.  */
   if (CONST_OK_FOR_K08 (i))
 return 2;
-  /* Constants that can be loaded with a mov immediate and an and.
+  /* Constants that can be loaded with a mov immediate need one more cycle.
  This case is probably unnecessary.  */
   if (CONST_OK_FOR_I08 (i))
 return 2;
-  /* Any other constants requires a 2 cycle pc-relative load plus an and.
+  /* Any other constant requires an additional 2 cycle pc-relative load.
  This case is probably unnecessary.  */
   return 3;
 }
@@ -3043,7 +3043,9 @@
   return true;

 case AND:
-  *total = COSTS_N_INSNS (andcosts (x));
+case XOR:
+case IOR:
+  *total = COSTS_N_INSNS (and_xor_ior_costs (x, code));
   return true;

 case MULT:

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-12 Thread kkojima at gcc dot gnu.org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

Kazumoto Kojima kkojima at gcc dot gnu.org changed:

   What|Removed |Added

 Target||sh*-*-*

--- Comment #2 from Kazumoto Kojima kkojima at gcc dot gnu.org 2011-06-12 
23:11:19 UTC ---
It looks that playing with the internal behavior of the combine
pass is a bit fragile.  Perhaps an alternative way is to define
a peephole for tst #imm8,r0, something like:

;; A peephole for the TST immediate instruction.

(define_peephole2
  [(set (match_operand:SI 2 arith_reg_operand r)
(and:SI (match_operand:SI 0 arith_reg_operand z)
(match_operand:SI 1 const_int_operand i)))
   (set (reg:SI T_REG) (eq:SI (match_dup 2) (const_int 0)))]
  TARGET_SH1
peep2_reg_dead_p (2, operands[2])
(satisfies_constraint_I08 (operands[1])
   || satisfies_constraint_K08 (operands[1]))
  [(set (reg:SI T_REG)
(eq:SI (and:SI (match_dup 0) (match_dup 1)) (const_int 0)))]
  
{
  operands[1] = GEN_INT (INTVAL (operands[1])  0xff);
})

which will work at -O2 or -fpeephole2, though there are pros
and cons of peephole approach.

[Bug target/49263] SH Target: underutilized TST #imm, R0 instruction

2011-06-01 Thread oleg.e...@t-online.de

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49263

--- Comment #1 from Oleg Endo oleg.e...@t-online.de 2011-06-01 20:42:00 UTC 
---
Created attachment 24412
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24412
Proposed Patch

Although the patch gets the job done, programmer's sense tells me it is fishy,
or at least pretty much brute forced cure of the symptoms, not the cause. It's
my first GCC patch, so any feedback is highly appreciated.

What I did was looking at the RTL, in particular the combine pass, identifying
patterns it failed to find a shortcut (tst insn) for and adding them to the
insn descriptions. 

I also had to expand the costs calculation of the AND instruction to cover AND,
OR and XOR (on SH they are the same anyways), or else the cost of a matched
replacement insn would result in a rejection in the combine pass.

50 matches

Mail list logo