http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55295



--- Comment #4 from Manu Evans <turkeyman at gmail dot com> 2013-03-05 01:55:08 
UTC ---

(In reply to comment #3)

> (In reply to comment #2)

> > +1

> > 

> > I'm seeing the same pattern.

> > Infact, I'm noticing a lot of my maths code seems to be performing a lot of

> > redundant moves.

> 

> Some examples would be great regarding this matter, although I can already

> imagine what the code looks like.  One of the problems is the auto-inc-dec 
> pass

> (see PR 50749).  A long time ago the rule of thumb for SH4 programmers was

> "read float values with post-inc addressing in your C code, and write float

> values with pre-dec addressing".  This does not work anymore, since all memory

> accesses are turned into array like index based addresses internally in the

> compiler.  Then the auto-inc-dec RTL pass is supposed to find post-inc and

> pre-dec addressing mode opportunities, but it fails to do so in most cases.

> I have started writing a replacement RTL pass that would try to optimize

> addressing mode selections.  I hope to get it in for GCC 4.9.

> 

> Anyway, if you have some example code that you can share, it would be really

> appreciated and helpful during development for testing purposes.

> 

> > Are there actually any builtins/intrinsics available for the SH4?

> > How do I access the awesome vector operations without breaking out the 
> > inline

> > asm?

> 

> There aren't that many HW vector ops on SH4, just fipr and ftrv.  At the

> moment, there are no builtins for those, so you'd have to use inline asm

> intrinsics.  Like I mentioned in comment #1, I'd rather make the compiler

> figure out opportunities from portable generic code.  Although for ftrv the

> patterns might be a bit .... complicated, also because the compiler then has 
> to

> manage the 2nd FPU regs bank...

>

> > It would be nice to have some intrinsics that understand vectors as 
> > sequences

> > of 4 float regs, and automate a sequential (vector) load.

> 

> That would be the job of the address-mode-selection RTL pass.  It would also

> improve overall code quality on SH.  The fastest way to load 4 float vectors 
> is

> to use 2x fmov.d.  The compiler could also do that automatically, but this

> requires FPSCR switching, which unfortunately also needs some rework (e.g. see

> PR 53513, PR 6526).

> 

> And on top of that, we also have PR 13423.  It seems that the proper fix for

> this is a new reworked (vector) ABI for SH.



Well I hope you find the time for all this, the (small) sh4 community will love

you! :)



Why is a new ABI important?





> > Also, the ftrv opcode doesn't seem to be accessible either.

> 

> True.  I really hope that I'll find enough time to brush up SH FPU code

> generation for GCC 4.9.  Until then, I'd suggest to use inline-asm style

> intrinsics.



4.9? That sounds like it could be years off... :(



I'm not sure what you mean by 'inline-asm style intrinsics'?

Last time I used inline-asm blocks in GCC it totally broke the optimisation. It

wouldn't reorder across inline-asm blocks, and it couldn't eliminate any

redundant load/stores appearing within the block in the event the value was

already resident.



Can you give me a small demonstration of what you mean?

I found whenever I touch inline-asm, the block just grows and grows in scope

upwards until my whole tight routine is written in asm... but that was some

years back, GCC3 era.





I'll report examples here as I find compelling situations.



But on a tangent, can you explain this behaviour? It's really ruining my code:



float testfunc(float v, float v2)

{

    return v*v2 + v;

}



Compiled with: -O3 -mfused-madd



testfunc:

.LFB1:

    .cfi_startproc

    mov.l    .L3,r1      ;

    lds.l    @r1+,fpscr  ; <- why does it mess with fpscr?

    add    #-4,r1

    fmov    fr5,fr0

    add    #4,r1       ; <- +4 after -4... redundant?

    fmac    fr0,fr4,fr0

    rts    

    lds.l    @r1+,fpscr

.L4:

    .align 2

.L3:

    .long    __fpscr_values

    .cfi_endproc



There's a lot of rubbish in there... I expect:



testfunc:

.LFB1:

    .cfi_startproc

    fmov    fr5,fr0

    fmac    fr0,fr4,fr0

    rts    

    .cfi_endproc





I'm also noticing that -ffast-math is inhibiting fmac emission in some cases:



Compiled with: -O3 -mfused-madd -ffast-math



testfunc:

.LFB1:

    .cfi_startproc

    mov.l    .L3,r1

    lds.l    @r1+,fpscr

    fldi1    fr0         ; what is a 1.0 doing here?

    add    #-4,r1

    add    #4,r1

    fadd    fr4,fr0     ; v+1 ??

    fmul    fr5,fr0     ; (v+1)*v2 ?? That's not what the code does...

    rts    

    lds.l    @r1+,fpscr



What's going on there? That doesn't even look correct...



Cheers!

Reply via email to