[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #17 from Oleg Endo --- (In reply to Luke Benstead from comment #16) > OK so perhaps adding __builtin_sh_fipr is a good first step? Yeah, you can try and see if it produces any useful results for you.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #16 from Luke Benstead --- OK so perhaps adding __builtin_sh_fipr is a good first step?
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #15 from Oleg Endo --- It's been too long since I've looked into it. Maybe some middle-end parts got more suitable over the time, but it was difficult to make it generate the fipr instruction automatically due to the reasons stated above.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 Luke Benstead changed: What|Removed |Added CC||kazade at gmail dot com --- Comment #14 from Luke Benstead --- Was there a particular reason why this patch wasn't merged? It would be really cool to see GCC generate fipr like it does fsrra etc. Is there anything I can do to help?
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #13 from Oleg Endo olegendo at gcc dot gnu.org --- (In reply to Manu Evans from comment #12) Hey, I'm still following this with great interest. Is it possible to make an intrinsic for this instruction so it can be issued at will? Yes, that's what I wanted to do. Automatic detection of the fipr insn would be restricted to relaxed FP math (e.g. -ffast-math), because it has reduced FP precision. So it makes sense to add a __builtin_sh_fipr. What I'm still more interested in at this point, would be some support for passing vectors in registers, making it possible to eliminate so much of that fmov noise. See also PR 56592, PR 13423, PR 64305. Unfortunately those are not so trivial to solve and I have little time at the moment.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #12 from Manu Evans turkeyman at gmail dot com --- Hey, I'm still following this with great interest. Is it possible to make an intrinsic for this instruction so it can be issued at will? What I'm still more interested in at this point, would be some support for passing vectors in registers, making it possible to eliminate so much of that fmov noise.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #11 from Oleg Endo olegendo at gcc dot gnu.org --- A note on the side... As mentioned above, fipr can also be used to do a 3D dot product. However, GCC's vector extensions do not allow specifying vectors of length 3. To support that I guess the easiest way is to do it with a bunch of combine patterns which canonicalize/split into the vector version. Last time I've tried, register allocation was pretty bad for that as mentioned in comment #10. Probably it will require some specific pre-allocation before RA.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #10 from Oleg Endo olegendo at gcc dot gnu.org --- (In reply to Oleg Endo from comment #9) Created attachment 34213 [details] Combine patterns for matching fipr An updated patch for trunk. As for the redundant fp moves and/or ferries through fpul, those seem to be caused by the lack of various vec_* patterns. See also PR 13423. An alternative pattern for the core fipr insn could be: (define_insn fipr_compact [(set (match_operand:V4SF 0 fp_arith_reg_operand =f) (vec_concat:V4SF (vec_concat:V2SF (vec_select:SF (match_operand:V4SF 1 fp_arith_reg_operand %0) (parallel [(const_int 0)])) (vec_select:SF (match_dup 1) (parallel [(const_int 1)]))) (vec_concat:V2SF (vec_select:SF (match_dup 1) (parallel [(const_int 2)])) (plus:SF (plus:SF (vec_select:SF (mult:V4SF (match_dup 1) (match_operand:V4SF 2 fp_arith_reg_operand f)) (parallel [(const_int 0)])) (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2)) (parallel [(const_int 1)]))) (plus:SF (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2)) (parallel [(const_int 2)])) (vec_select:SF (mult:V4SF (match_dup 1) (match_dup 2)) (parallel [(const_int 3)]))) (clobber (reg:SI FPSCR_STAT_REG)) (use (reg:SI FPSCR_MODES_REG))] TARGET_SH4 fipr%2,%0 [(set_attr type fp) (set_attr fp_mode single)]) However, I'm not sure whether register allocation understands this properly. Matching fipr insn during combine has other issues, such as v4sf register construction from individual sf values. Before investigating the issue at combine level, playing along with the vectorizer seems more promising. For that vector load/store patterns need to be added (PR 13423) first.
[Bug target/55295] [SH] Add support for fipr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 Oleg Endo olegendo at gcc dot gnu.org changed: What|Removed |Added Attachment #28671|0 |1 is obsolete|| --- Comment #9 from Oleg Endo olegendo at gcc dot gnu.org --- Created attachment 34213 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=34213action=edit Combine patterns for matching fipr An updated patch for trunk. As for the redundant fp moves and/or ferries through fpul, those seem to be caused by the lack of various vec_* patterns. See also PR 13423.
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #8 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-13 18:21:37 UTC --- (In reply to comment #5) This is another reason for adding a new ABI, BTW. Just for the record, I've opened a new PR 56592 for this.
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #5 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-05 12:28:22 UTC --- (In reply to comment #4) Why is a new ABI important? Because currently, there is no way to pass something like struct { float x, y, z, w }; as function arguments in registers, although the default SH ABI could allow passing up to 3 of such vectors. The same applies to typedef float v4sf __attribute__ ((vector_size (16))); or std::arrayfloat, 4 However, code that does that will be incompatible with existing calling conventions etc, thus a new (additional and optional) ABI. 4.9? That sounds like it could be years off... :( 4.8 is about to be released soon. 4.9 should follow at around the same time next year. Of course you can still grab the current development version and use it anytime. I'm not sure what you mean by 'inline-asm style intrinsics'? Something like: static inline void* get_gbr (void) throw () { void* retval; __asm__ volatile (stc gbr, %0 : =r (retval) : ); return retval; } Last time I used inline-asm blocks in GCC it totally broke the optimisation. It wouldn't reorder across inline-asm blocks, and it couldn't eliminate any redundant load/stores appearing within the block in the event the value was already resident. Can you give me a small demonstration of what you mean? I found whenever I touch inline-asm, the block just grows and grows in scope upwards until my whole tight routine is written in asm... but that was some years back, GCC3 era. Yes, there are some limits of what the compiler can do with an asm block. It won't analyze the contents of the asm block, only the placeholders. Thus it usually can't eliminate redundant loads/stores. I'll report examples here as I find compelling situations. But on a tangent, can you explain this behaviour? It's really ruining my code: float testfunc(float v, float v2) { return v*v2 + v; } Compiled with: -O3 -mfused-madd testfunc: .LFB1: .cfi_startproc mov.l.L3,r1 ; lds.l@r1+,fpscr ; - why does it mess with fpscr? add#-4,r1 fmovfr5,fr0 add#4,r1 ; - +4 after -4... redundant? fmacfr0,fr4,fr0 rts lds.l@r1+,fpscr .L4: .align 2 .L3: .long__fpscr_values .cfi_endproc There's a lot of rubbish in there... I expect: testfunc: .LFB1: .cfi_startproc fmovfr5,fr0 fmacfr0,fr4,fr0 rts .cfi_endproc The fpscr value is changed because its default setting is to operate on double-precision float values. This is the default configuration of the compiler. You can change it by using e.g. -m4-single, which will assume that FPSCR setting is configured for single-precision at function entry/return. The +4 -4 thing is a known problem and stems from the fact that the FPSCR load/store insns are available only as post-inc/pre-dec. I'm also noticing that -ffast-math is inhibiting fmac emission in some cases: Compiled with: -O3 -mfused-madd -ffast-math testfunc: .LFB1: .cfi_startproc mov.l.L3,r1 lds.l@r1+,fpscr fldi1fr0 ; what is a 1.0 doing here? add#-4,r1 add#4,r1 faddfr4,fr0 ; v+1 ?? fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does... rts lds.l@r1+,fpscr What's going on there? That doesn't even look correct... The transformation is legitimate, although unlucky, since using fmac would be better in this case. The original expression 'v*v2 + v' is converted to '(1 + v2)*v' and that's what the code does. Probably you compiled for little endian and got confused by the floating point register ordering for arguments. It goes like ... fr5 = arg 0 fr4 = arg 1 fr7 = arg 2 fr6 = arg 3 ... This is another reason for adding a new ABI, BTW.
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #6 from Manu Evans turkeyman at gmail dot com 2013-03-05 12:53:26 UTC --- Awesome, thanks for the info and help! Strange -m4-single won't work with my toolchain, it says 'not compatible with this configuration' _ Looking forward to all these fixes! :)
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #7 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-06 01:05:14 UTC --- (In reply to comment #5) I'm also noticing that -ffast-math is inhibiting fmac emission in some cases: Compiled with: -O3 -mfused-madd -ffast-math testfunc: .LFB1: .cfi_startproc mov.l.L3,r1 lds.l@r1+,fpscr fldi1fr0 ; what is a 1.0 doing here? add#-4,r1 add#4,r1 faddfr4,fr0 ; v+1 ?? fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does... rts lds.l@r1+,fpscr What's going on there? That doesn't even look correct... The transformation is legitimate, although unlucky, since using fmac would be better in this case. I've opened a new PR 56547 for this issue.
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #3 from Oleg Endo olegendo at gcc dot gnu.org 2013-03-04 21:50:58 UTC --- (In reply to comment #2) +1 I'm seeing the same pattern. Infact, I'm noticing a lot of my maths code seems to be performing a lot of redundant moves. Some examples would be great regarding this matter, although I can already imagine what the code looks like. One of the problems is the auto-inc-dec pass (see PR 50749). A long time ago the rule of thumb for SH4 programmers was read float values with post-inc addressing in your C code, and write float values with pre-dec addressing. This does not work anymore, since all memory accesses are turned into array like index based addresses internally in the compiler. Then the auto-inc-dec RTL pass is supposed to find post-inc and pre-dec addressing mode opportunities, but it fails to do so in most cases. I have started writing a replacement RTL pass that would try to optimize addressing mode selections. I hope to get it in for GCC 4.9. Anyway, if you have some example code that you can share, it would be really appreciated and helpful during development for testing purposes. Are there actually any builtins/intrinsics available for the SH4? How do I access the awesome vector operations without breaking out the inline asm? There aren't that many HW vector ops on SH4, just fipr and ftrv. At the moment, there are no builtins for those, so you'd have to use inline asm intrinsics. Like I mentioned in comment #1, I'd rather make the compiler figure out opportunities from portable generic code. Although for ftrv the patterns might be a bit complicated, also because the compiler then has to manage the 2nd FPU regs bank... It would be nice to have some intrinsics that understand vectors as sequences of 4 float regs, and automate a sequential (vector) load. That would be the job of the address-mode-selection RTL pass. It would also improve overall code quality on SH. The fastest way to load 4 float vectors is to use 2x fmov.d. The compiler could also do that automatically, but this requires FPSCR switching, which unfortunately also needs some rework (e.g. see PR 53513, PR 6526). And on top of that, we also have PR 13423. It seems that the proper fix for this is a new reworked (vector) ABI for SH. Also, the ftrv opcode doesn't seem to be accessible either. True. I really hope that I'll find enough time to brush up SH FPU code generation for GCC 4.9. Until then, I'd suggest to use inline-asm style intrinsics.
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #4 from Manu Evans turkeyman at gmail dot com 2013-03-05 01:55:08 UTC --- (In reply to comment #3) (In reply to comment #2) +1 I'm seeing the same pattern. Infact, I'm noticing a lot of my maths code seems to be performing a lot of redundant moves. Some examples would be great regarding this matter, although I can already imagine what the code looks like. One of the problems is the auto-inc-dec pass (see PR 50749). A long time ago the rule of thumb for SH4 programmers was read float values with post-inc addressing in your C code, and write float values with pre-dec addressing. This does not work anymore, since all memory accesses are turned into array like index based addresses internally in the compiler. Then the auto-inc-dec RTL pass is supposed to find post-inc and pre-dec addressing mode opportunities, but it fails to do so in most cases. I have started writing a replacement RTL pass that would try to optimize addressing mode selections. I hope to get it in for GCC 4.9. Anyway, if you have some example code that you can share, it would be really appreciated and helpful during development for testing purposes. Are there actually any builtins/intrinsics available for the SH4? How do I access the awesome vector operations without breaking out the inline asm? There aren't that many HW vector ops on SH4, just fipr and ftrv. At the moment, there are no builtins for those, so you'd have to use inline asm intrinsics. Like I mentioned in comment #1, I'd rather make the compiler figure out opportunities from portable generic code. Although for ftrv the patterns might be a bit complicated, also because the compiler then has to manage the 2nd FPU regs bank... It would be nice to have some intrinsics that understand vectors as sequences of 4 float regs, and automate a sequential (vector) load. That would be the job of the address-mode-selection RTL pass. It would also improve overall code quality on SH. The fastest way to load 4 float vectors is to use 2x fmov.d. The compiler could also do that automatically, but this requires FPSCR switching, which unfortunately also needs some rework (e.g. see PR 53513, PR 6526). And on top of that, we also have PR 13423. It seems that the proper fix for this is a new reworked (vector) ABI for SH. Well I hope you find the time for all this, the (small) sh4 community will love you! :) Why is a new ABI important? Also, the ftrv opcode doesn't seem to be accessible either. True. I really hope that I'll find enough time to brush up SH FPU code generation for GCC 4.9. Until then, I'd suggest to use inline-asm style intrinsics. 4.9? That sounds like it could be years off... :( I'm not sure what you mean by 'inline-asm style intrinsics'? Last time I used inline-asm blocks in GCC it totally broke the optimisation. It wouldn't reorder across inline-asm blocks, and it couldn't eliminate any redundant load/stores appearing within the block in the event the value was already resident. Can you give me a small demonstration of what you mean? I found whenever I touch inline-asm, the block just grows and grows in scope upwards until my whole tight routine is written in asm... but that was some years back, GCC3 era. I'll report examples here as I find compelling situations. But on a tangent, can you explain this behaviour? It's really ruining my code: float testfunc(float v, float v2) { return v*v2 + v; } Compiled with: -O3 -mfused-madd testfunc: .LFB1: .cfi_startproc mov.l.L3,r1 ; lds.l@r1+,fpscr ; - why does it mess with fpscr? add#-4,r1 fmovfr5,fr0 add#4,r1 ; - +4 after -4... redundant? fmacfr0,fr4,fr0 rts lds.l@r1+,fpscr .L4: .align 2 .L3: .long__fpscr_values .cfi_endproc There's a lot of rubbish in there... I expect: testfunc: .LFB1: .cfi_startproc fmovfr5,fr0 fmacfr0,fr4,fr0 rts .cfi_endproc I'm also noticing that -ffast-math is inhibiting fmac emission in some cases: Compiled with: -O3 -mfused-madd -ffast-math testfunc: .LFB1: .cfi_startproc mov.l.L3,r1 lds.l@r1+,fpscr fldi1fr0 ; what is a 1.0 doing here? add#-4,r1 add#4,r1 faddfr4,fr0 ; v+1 ?? fmulfr5,fr0 ; (v+1)*v2 ?? That's not what the code does... rts lds.l@r1+,fpscr What's going on there? That doesn't even look correct... Cheers!
[Bug target/55295] [SH] Add support for fipr instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295 --- Comment #1 from Oleg Endo olegendo at gcc dot gnu.org 2012-11-12 22:39:27 UTC --- I forgot to mention that at least there should be a target specific built-in function to generate the fipr insn. There is already a SHmedia built-in for that, so adding one for SH4* shouldn't be a big deal. However, ideally the compiler would discover fipr opportunities by itself (when compiling with -ffast-math).