Hi!

On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
> #define FN 4
> typedef struct { double a[FN]; } A;
> 
> A foo (const A *a) { return *a; }
> A bar (const A a) { return a; }
> ///////
> 
> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
> (by subreg/cse/fwprop/cprop) as:
> -------
> foo:
> .LFB0:
>         .cfi_startproc
>         blr
> 
> bar:
> .LFB1:
>               .cfi_startproc
>       lfd 2,8(3)
>       lfd 1,0(3)
>       blr
> --------

I think you swapped foo and bar here?

> If the size of "A" is larger than any INT mode size, RTL insns would be 
> generated as:
>    13: r125:V2DI=[r112:DI+0x20]
>    14: r126:V2DI=[r112:DI+0x30]
>    15: [r112:DI]=r125:V2DI
>    16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>    17: r127:DF=[r112:DI]
>    18: r128:DF=[r112:DI+0x8]
>    19: r129:DF=[r112:DI+0x10]
>    20: r130:DF=[r112:DI+0x18]
> ------------
> 
> I'm thinking about ways to improve this.
> Metod1: One way may be changing the memory copy by referencing the type 
> of the struct if the size of struct is not too big. And generate insns 
> like the below:
>    13: r125:DF=[r112:DI+0x20]
>    15: r126:DF=[r112:DI+0x28]
>    17: r127:DF=[r112:DI+0x30]
>    19: r128:DF=[r112:DI+0x38]
>    14: [r112:DI]=r125:DF
>    16: [r112:DI+0x8]=r126:DF
>    18: [r112:DI+0x10]=r127:DF
>    20: [r112:DI+0x18]=r128:DF
>    21: r129:DF=[r112:DI]
>    22: r130:DF=[r112:DI+0x8]
>    23: r131:DF=[r112:DI+0x10]
>    24: r132:DF=[r112:DI+0x18]

This is much worse though?  The expansion with memcpy used V2DI, which
typically is close to 2x faster than DFmode accesses.

Or are you trying to avoid small reads of large stores here?  Those
aren't so bad, large reads of small stores is the nastiness we need to
avoid.

The code we have now does

   15: [r112:DI]=r125:V2DI
...
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]

Can you make this optimised to not use a memory temporary at all, just
immediately assign from r125 to r127 and r128?

> Method2: One way may be enhancing CSE to make it able to treat one large
> memory slot as two(or more) combined slots: 
>    13: r125:V2DI#0=[r112:DI+0x20]
>    13': r125:V2DI#8=[r112:DI+0x28]
>    15: [r112:DI]#0=r125:V2DI#0
>    15': [r112:DI]#8=r125:V2DI#8
> 
> This may seems more hack in CSE.

The current CSE pass we have is the pass most in need of a full rewrite
we have, since many many years.  It does a lot of things, important
things that we should not lose, but it does a pretty bad job of CSE.

> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".

:BLK can never be optimised well.  It always has to live in memory, by
definition.


Segher

Reply via email to