[Bug target/55295] [SH] Add support for fipr instruction

2023-03-21 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #17 from Oleg Endo  ---
(In reply to Luke Benstead from comment #16)
> OK so perhaps adding __builtin_sh_fipr is a good first step?

Yeah, you can try and see if it produces any useful results for you.

[Bug target/55295] [SH] Add support for fipr instruction

2023-03-21 Thread kazade at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #16 from Luke Benstead  ---
OK so perhaps adding __builtin_sh_fipr is a good first step?

[Bug target/55295] [SH] Add support for fipr instruction

2023-03-21 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #15 from Oleg Endo  ---
It's been too long since I've looked into it.  Maybe some middle-end parts got
more suitable over the time, but it was difficult to make it generate the fipr
instruction automatically due to the reasons stated above.

[Bug target/55295] [SH] Add support for fipr instruction

2023-03-21 Thread kazade at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

Luke Benstead  changed:

   What|Removed |Added

 CC||kazade at gmail dot com

--- Comment #14 from Luke Benstead  ---
Was there a particular reason why this patch wasn't merged? It would be really
cool to see GCC generate fipr like it does fsrra etc. 

Is there anything I can do to help?

[Bug target/55295] [SH] Add support for fipr instruction

2015-03-02 Thread olegendo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #13 from Oleg Endo olegendo at gcc dot gnu.org ---
(In reply to Manu Evans from comment #12)
 Hey, I'm still following this with great interest.
 
 Is it possible to make an intrinsic for this instruction so it can be issued
 at will?

Yes, that's what I wanted to do.  Automatic detection of the fipr insn would be
restricted to relaxed FP math (e.g. -ffast-math), because it has reduced FP
precision.  So it makes sense to add a __builtin_sh_fipr.


 What I'm still more interested in at this point, would be some support for
 passing vectors in registers, making it possible to eliminate so much of
 that fmov noise.

See also PR 56592, PR 13423, PR 64305.

Unfortunately those are not so trivial to solve and I have little time at the
moment.


[Bug target/55295] [SH] Add support for fipr instruction

2015-03-01 Thread turkeyman at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #12 from Manu Evans turkeyman at gmail dot com ---
Hey, I'm still following this with great interest.

Is it possible to make an intrinsic for this instruction so it can be issued at
will?

What I'm still more interested in at this point, would be some support for
passing vectors in registers, making it possible to eliminate so much of that
fmov noise.


[Bug target/55295] [SH] Add support for fipr instruction

2015-03-01 Thread olegendo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #11 from Oleg Endo olegendo at gcc dot gnu.org ---
A note on the side...
As mentioned above, fipr can also be used to do a 3D dot product.  However,
GCC's vector extensions do not allow specifying vectors of length 3.  To
support that I guess the easiest way is to do it with a bunch of combine
patterns which canonicalize/split into the vector version.  Last time I've
tried, register allocation was pretty bad for that as mentioned in comment #10.
 Probably it will require some specific pre-allocation before RA.


[Bug target/55295] [SH] Add support for fipr instruction

2014-12-09 Thread olegendo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

--- Comment #10 from Oleg Endo olegendo at gcc dot gnu.org ---
(In reply to Oleg Endo from comment #9)
 Created attachment 34213 [details]
 Combine patterns for matching fipr
 
 An updated patch for trunk.  As for the redundant fp moves and/or ferries
 through fpul, those seem to be caused by the lack of various vec_* patterns.
 See also PR 13423.

An alternative pattern for the core fipr insn could be:

(define_insn fipr_compact
  [(set (match_operand:V4SF 0 fp_arith_reg_operand =f)
(vec_concat:V4SF
  (vec_concat:V2SF
(vec_select:SF (match_operand:V4SF 1 fp_arith_reg_operand %0)
   (parallel [(const_int 0)]))
(vec_select:SF (match_dup 1) (parallel [(const_int 1)])))
  (vec_concat:V2SF
(vec_select:SF (match_dup 1) (parallel [(const_int 2)]))
(plus:SF
  (plus:SF (vec_select:SF (mult:V4SF (match_dup 1)
 (match_operand:V4SF 2
   fp_arith_reg_operand f))
  (parallel [(const_int 0)]))
   (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2))
  (parallel [(const_int 1)])))
  (plus:SF (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2))
  (parallel [(const_int 2)]))
   (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2))
  (parallel [(const_int 3)])))
   (clobber (reg:SI FPSCR_STAT_REG))
   (use (reg:SI FPSCR_MODES_REG))]
  TARGET_SH4
  fipr%2,%0
  [(set_attr type fp)
   (set_attr fp_mode single)])

However, I'm not sure whether register allocation understands this properly. 
Matching fipr insn during combine has other issues, such as v4sf register
construction from individual sf values.  Before investigating the issue at
combine level, playing along with the vectorizer seems more promising.  For
that vector load/store patterns need to be added (PR 13423) first.


[Bug target/55295] [SH] Add support for fipr instruction

2014-12-07 Thread olegendo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295

Oleg Endo olegendo at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #28671|0   |1
is obsolete||

--- Comment #9 from Oleg Endo olegendo at gcc dot gnu.org ---
Created attachment 34213
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=34213action=edit
Combine patterns for matching fipr

An updated patch for trunk.  As for the redundant fp moves and/or ferries
through fpul, those seem to be caused by the lack of various vec_* patterns. 
See also PR 13423.


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-13 Thread olegendo at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #8 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-13 18:21:37 
UTC ---

(In reply to comment #5)

 

 This is another reason for adding a new ABI, BTW.



Just for the record, I've opened a new PR 56592 for this.


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-05 Thread olegendo at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #5 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-05 12:28:22 
UTC ---

(In reply to comment #4)

 

 Why is a new ABI important?

 



Because currently, there is no way to pass something like



struct { float x, y, z, w };



as function arguments in registers, although the default SH ABI could allow

passing up to 3 of such vectors.  The same applies to



typedef float v4sf __attribute__ ((vector_size (16)));



or 



std::arrayfloat, 4



However, code that does that will be incompatible with existing calling

conventions etc, thus a new (additional and optional) ABI.



 4.9? That sounds like it could be years off... :(



4.8 is about to be released soon.  4.9 should follow at around the same time

next year.  Of course you can still grab the current development version and

use it anytime.



 

 I'm not sure what you mean by 'inline-asm style intrinsics'?



Something like:



static inline void* get_gbr (void) throw ()

{

  void* retval;

  __asm__ volatile (stc gbr, %0 : =r (retval) : );

  return retval;

}







 Last time I used inline-asm blocks in GCC it totally broke the optimisation. 
 It

 wouldn't reorder across inline-asm blocks, and it couldn't eliminate any

 redundant load/stores appearing within the block in the event the value was

 already resident.

 

 Can you give me a small demonstration of what you mean?

 I found whenever I touch inline-asm, the block just grows and grows in scope

 upwards until my whole tight routine is written in asm... but that was some

 years back, GCC3 era.

 



Yes, there are some limits of what the compiler can do with an asm block.  It

won't analyze the contents of the asm block, only the placeholders.  Thus it

usually can't eliminate redundant loads/stores.





 

 I'll report examples here as I find compelling situations.

 

 But on a tangent, can you explain this behaviour? It's really ruining my code:

 

 float testfunc(float v, float v2)

 {

 return v*v2 + v;

 }

 

 Compiled with: -O3 -mfused-madd

 

 testfunc:

 .LFB1:

 .cfi_startproc

 mov.l.L3,r1  ;

 lds.l@r1+,fpscr  ; - why does it mess with fpscr?

 add#-4,r1

 fmovfr5,fr0

 add#4,r1   ; - +4 after -4... redundant?

 fmacfr0,fr4,fr0

 rts

 lds.l@r1+,fpscr

 .L4:

 .align 2

 .L3:

 .long__fpscr_values

 .cfi_endproc

 

 There's a lot of rubbish in there... I expect:

 

 testfunc:

 .LFB1:

 .cfi_startproc

 fmovfr5,fr0

 fmacfr0,fr4,fr0

 rts

 .cfi_endproc

 



The fpscr value is changed because its default setting is to operate on

double-precision float values.  This is the default configuration of the

compiler.  You can change it by using e.g. -m4-single, which will assume that

FPSCR setting is configured for single-precision at function entry/return.



The +4 -4 thing is a known problem and stems from the fact that the FPSCR

load/store insns are available only as post-inc/pre-dec.



 

 I'm also noticing that -ffast-math is inhibiting fmac emission in some cases:

 

 Compiled with: -O3 -mfused-madd -ffast-math

 

 testfunc:

 .LFB1:

 .cfi_startproc

 mov.l.L3,r1

 lds.l@r1+,fpscr

 fldi1fr0 ; what is a 1.0 doing here?

 add#-4,r1

 add#4,r1

 faddfr4,fr0 ; v+1 ??

 fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does...

 rts

 lds.l@r1+,fpscr

 

 What's going on there? That doesn't even look correct...



The transformation is legitimate, although unlucky, since using fmac would be

better in this case.



The original expression 'v*v2 + v' is converted to '(1 + v2)*v' and that's what

the code does.  Probably you compiled for little endian and got confused by the

floating point register ordering for arguments.  It goes like ...

fr5 = arg 0

fr4 = arg 1

fr7 = arg 2

fr6 = arg 3

...



This is another reason for adding a new ABI, BTW.


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-05 Thread turkeyman at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #6 from Manu Evans turkeyman at gmail dot com 2013-03-05 12:53:26 
UTC ---

Awesome, thanks for the info and help!



Strange -m4-single won't work with my toolchain, it says 'not compatible with

this configuration' _



Looking forward to all these fixes! :)


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-05 Thread olegendo at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #7 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-06 01:05:14 
UTC ---

(In reply to comment #5)

  

  I'm also noticing that -ffast-math is inhibiting fmac emission in some 
  cases:

  

  Compiled with: -O3 -mfused-madd -ffast-math

  

  testfunc:

  .LFB1:

  .cfi_startproc

  mov.l.L3,r1

  lds.l@r1+,fpscr

  fldi1fr0 ; what is a 1.0 doing here?

  add#-4,r1

  add#4,r1

  faddfr4,fr0 ; v+1 ??

  fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does...

  rts

  lds.l@r1+,fpscr

  

  What's going on there? That doesn't even look correct...

 

 The transformation is legitimate, although unlucky, since using fmac would be

 better in this case.

 



I've opened a new PR 56547 for this issue.


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-04 Thread olegendo at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #3 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-04 21:50:58 
UTC ---

(In reply to comment #2)

 +1

 

 I'm seeing the same pattern.

 Infact, I'm noticing a lot of my maths code seems to be performing a lot of

 redundant moves.



Some examples would be great regarding this matter, although I can already

imagine what the code looks like.  One of the problems is the auto-inc-dec pass

(see PR 50749).  A long time ago the rule of thumb for SH4 programmers was

read float values with post-inc addressing in your C code, and write float

values with pre-dec addressing.  This does not work anymore, since all memory

accesses are turned into array like index based addresses internally in the

compiler.  Then the auto-inc-dec RTL pass is supposed to find post-inc and

pre-dec addressing mode opportunities, but it fails to do so in most cases.

I have started writing a replacement RTL pass that would try to optimize

addressing mode selections.  I hope to get it in for GCC 4.9.



Anyway, if you have some example code that you can share, it would be really

appreciated and helpful during development for testing purposes.



 Are there actually any builtins/intrinsics available for the SH4?

 How do I access the awesome vector operations without breaking out the inline

 asm?



There aren't that many HW vector ops on SH4, just fipr and ftrv.  At the

moment, there are no builtins for those, so you'd have to use inline asm

intrinsics.  Like I mentioned in comment #1, I'd rather make the compiler

figure out opportunities from portable generic code.  Although for ftrv the

patterns might be a bit  complicated, also because the compiler then has to

manage the 2nd FPU regs bank...



 It would be nice to have some intrinsics that understand vectors as sequences

 of 4 float regs, and automate a sequential (vector) load.



That would be the job of the address-mode-selection RTL pass.  It would also

improve overall code quality on SH.  The fastest way to load 4 float vectors is

to use 2x fmov.d.  The compiler could also do that automatically, but this

requires FPSCR switching, which unfortunately also needs some rework (e.g. see

PR 53513, PR 6526).



And on top of that, we also have PR 13423.  It seems that the proper fix for

this is a new reworked (vector) ABI for SH.



 

 Also, the ftrv opcode doesn't seem to be accessible either.



True.  I really hope that I'll find enough time to brush up SH FPU code

generation for GCC 4.9.  Until then, I'd suggest to use inline-asm style

intrinsics.


[Bug target/55295] [SH] Add support for fipr instruction

2013-03-04 Thread turkeyman at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #4 from Manu Evans turkeyman at gmail dot com 2013-03-05 01:55:08 
UTC ---

(In reply to comment #3)

 (In reply to comment #2)

  +1

  

  I'm seeing the same pattern.

  Infact, I'm noticing a lot of my maths code seems to be performing a lot of

  redundant moves.

 

 Some examples would be great regarding this matter, although I can already

 imagine what the code looks like.  One of the problems is the auto-inc-dec 
 pass

 (see PR 50749).  A long time ago the rule of thumb for SH4 programmers was

 read float values with post-inc addressing in your C code, and write float

 values with pre-dec addressing.  This does not work anymore, since all memory

 accesses are turned into array like index based addresses internally in the

 compiler.  Then the auto-inc-dec RTL pass is supposed to find post-inc and

 pre-dec addressing mode opportunities, but it fails to do so in most cases.

 I have started writing a replacement RTL pass that would try to optimize

 addressing mode selections.  I hope to get it in for GCC 4.9.

 

 Anyway, if you have some example code that you can share, it would be really

 appreciated and helpful during development for testing purposes.

 

  Are there actually any builtins/intrinsics available for the SH4?

  How do I access the awesome vector operations without breaking out the 
  inline

  asm?

 

 There aren't that many HW vector ops on SH4, just fipr and ftrv.  At the

 moment, there are no builtins for those, so you'd have to use inline asm

 intrinsics.  Like I mentioned in comment #1, I'd rather make the compiler

 figure out opportunities from portable generic code.  Although for ftrv the

 patterns might be a bit  complicated, also because the compiler then has 
 to

 manage the 2nd FPU regs bank...



  It would be nice to have some intrinsics that understand vectors as 
  sequences

  of 4 float regs, and automate a sequential (vector) load.

 

 That would be the job of the address-mode-selection RTL pass.  It would also

 improve overall code quality on SH.  The fastest way to load 4 float vectors 
 is

 to use 2x fmov.d.  The compiler could also do that automatically, but this

 requires FPSCR switching, which unfortunately also needs some rework (e.g. see

 PR 53513, PR 6526).

 

 And on top of that, we also have PR 13423.  It seems that the proper fix for

 this is a new reworked (vector) ABI for SH.



Well I hope you find the time for all this, the (small) sh4 community will love

you! :)



Why is a new ABI important?





  Also, the ftrv opcode doesn't seem to be accessible either.

 

 True.  I really hope that I'll find enough time to brush up SH FPU code

 generation for GCC 4.9.  Until then, I'd suggest to use inline-asm style

 intrinsics.



4.9? That sounds like it could be years off... :(



I'm not sure what you mean by 'inline-asm style intrinsics'?

Last time I used inline-asm blocks in GCC it totally broke the optimisation. It

wouldn't reorder across inline-asm blocks, and it couldn't eliminate any

redundant load/stores appearing within the block in the event the value was

already resident.



Can you give me a small demonstration of what you mean?

I found whenever I touch inline-asm, the block just grows and grows in scope

upwards until my whole tight routine is written in asm... but that was some

years back, GCC3 era.





I'll report examples here as I find compelling situations.



But on a tangent, can you explain this behaviour? It's really ruining my code:



float testfunc(float v, float v2)

{

return v*v2 + v;

}



Compiled with: -O3 -mfused-madd



testfunc:

.LFB1:

.cfi_startproc

mov.l.L3,r1  ;

lds.l@r1+,fpscr  ; - why does it mess with fpscr?

add#-4,r1

fmovfr5,fr0

add#4,r1   ; - +4 after -4... redundant?

fmacfr0,fr4,fr0

rts

lds.l@r1+,fpscr

.L4:

.align 2

.L3:

.long__fpscr_values

.cfi_endproc



There's a lot of rubbish in there... I expect:



testfunc:

.LFB1:

.cfi_startproc

fmovfr5,fr0

fmacfr0,fr4,fr0

rts

.cfi_endproc





I'm also noticing that -ffast-math is inhibiting fmac emission in some cases:



Compiled with: -O3 -mfused-madd -ffast-math



testfunc:

.LFB1:

.cfi_startproc

mov.l.L3,r1

lds.l@r1+,fpscr

fldi1fr0 ; what is a 1.0 doing here?

add#-4,r1

add#4,r1

faddfr4,fr0 ; v+1 ??

fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does...

rts

lds.l@r1+,fpscr



What's going on there? That doesn't even look correct...



Cheers!


[Bug target/55295] [SH] Add support for fipr instruction

2012-11-12 Thread olegendo at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #1 from Oleg Endo olegendo at gcc dot gnu.org 2012-11-12 22:39:27 
UTC ---

I forgot to mention that at least there should be a target specific built-in

function to generate the fipr insn.  There is already a SHmedia built-in for

that, so adding one for SH4* shouldn't be a big deal.  However, ideally the

compiler would discover fipr opportunities by itself (when compiling with

-ffast-math).