Segher Boessenkool <seg...@kernel.crashing.org> writes:

> Hi!
>
> On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>> 
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>> ///////
>> 
>> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
>> (by subreg/cse/fwprop/cprop) as:
>> -------
>> foo:
>> .LFB0:
>>         .cfi_startproc
>>         blr
>> 
>> bar:
>> .LFB1:
>>              .cfi_startproc
>>      lfd 2,8(3)
>>      lfd 1,0(3)
>>      blr
>> --------
>
> I think you swapped foo and bar here?
Oh, thanks!
>
>> If the size of "A" is larger than any INT mode size, RTL insns would be 
>> generated as:
>>    13: r125:V2DI=[r112:DI+0x20]
>>    14: r126:V2DI=[r112:DI+0x30]
>>    15: [r112:DI]=r125:V2DI
>>    16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>>    17: r127:DF=[r112:DI]
>>    18: r128:DF=[r112:DI+0x8]
>>    19: r129:DF=[r112:DI+0x10]
>>    20: r130:DF=[r112:DI+0x18]
>> ------------
>> 
>> I'm thinking about ways to improve this.
>> Metod1: One way may be changing the memory copy by referencing the type 
>> of the struct if the size of struct is not too big. And generate insns 
>> like the below:
>>    13: r125:DF=[r112:DI+0x20]
>>    15: r126:DF=[r112:DI+0x28]
>>    17: r127:DF=[r112:DI+0x30]
>>    19: r128:DF=[r112:DI+0x38]
>>    14: [r112:DI]=r125:DF
>>    16: [r112:DI+0x8]=r126:DF
>>    18: [r112:DI+0x10]=r127:DF
>>    20: [r112:DI+0x18]=r128:DF
>>    21: r129:DF=[r112:DI]
>>    22: r130:DF=[r112:DI+0x8]
>>    23: r131:DF=[r112:DI+0x10]
>>    24: r132:DF=[r112:DI+0x18]
>
> This is much worse though?  The expansion with memcpy used V2DI, which
> typically is close to 2x faster than DFmode accesses.
Using V2DI, it help to access 2x bytes at one time than DF/DI.
While since those readings can be executed in parallel, it would be not
too bad via using DF/DI.

>
> Or are you trying to avoid small reads of large stores here?  Those
> aren't so bad, large reads of small stores is the nastiness we need to
> avoid.
Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse
could eleminate those memory accesses on same size.
>
> The code we have now does
>
>    15: [r112:DI]=r125:V2DI
> ...
>    17: r127:DF=[r112:DI]
>    18: r128:DF=[r112:DI+0x8]
>
> Can you make this optimised to not use a memory temporary at all, just
> immediately assign from r125 to r127 and r128?
r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn
17/18' are expanded for different gimple stmt:
  D.3331 = a;  ==> 'insn 15' is generated for struct assignment here.
  return D.3331; ==> 'insn 17/18' are prepared for return registers.

I'm trying to eliminate thos  memory temporary, and did not find a good
way.  Method1-3 are the ideas which I'm trying to use to delete those
temporaries.

>
>> Method2: One way may be enhancing CSE to make it able to treat one large
>> memory slot as two(or more) combined slots: 
>>    13: r125:V2DI#0=[r112:DI+0x20]
>>    13': r125:V2DI#8=[r112:DI+0x28]
>>    15: [r112:DI]#0=r125:V2DI#0
>>    15': [r112:DI]#8=r125:V2DI#8
>> 
>> This may seems more hack in CSE.
>
> The current CSE pass we have is the pass most in need of a full rewrite
> we have, since many many years.  It does a lot of things, important
> things that we should not lose, but it does a pretty bad job of CSE.
>
>> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
>
> :BLK can never be optimised well.  It always has to live in memory, by
> definition.

Thanks for your sugguestions!

BR,
Jeff (Jiufu)
>
>
> Segher

Reply via email to