Segher Boessenkool <seg...@kernel.crashing.org> writes: > Hi! > > On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote: >> #define FN 4 >> typedef struct { double a[FN]; } A; >> >> A foo (const A *a) { return *a; } >> A bar (const A a) { return a; } >> /////// >> >> If FN<=2; the size of "A" fits into TImode, then this code can be optimized >> (by subreg/cse/fwprop/cprop) as: >> ------- >> foo: >> .LFB0: >> .cfi_startproc >> blr >> >> bar: >> .LFB1: >> .cfi_startproc >> lfd 2,8(3) >> lfd 1,0(3) >> blr >> -------- > > I think you swapped foo and bar here? Oh, thanks! > >> If the size of "A" is larger than any INT mode size, RTL insns would be >> generated as: >> 13: r125:V2DI=[r112:DI+0x20] >> 14: r126:V2DI=[r112:DI+0x30] >> 15: [r112:DI]=r125:V2DI >> 16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; >> 17: r127:DF=[r112:DI] >> 18: r128:DF=[r112:DI+0x8] >> 19: r129:DF=[r112:DI+0x10] >> 20: r130:DF=[r112:DI+0x18] >> ------------ >> >> I'm thinking about ways to improve this. >> Metod1: One way may be changing the memory copy by referencing the type >> of the struct if the size of struct is not too big. And generate insns >> like the below: >> 13: r125:DF=[r112:DI+0x20] >> 15: r126:DF=[r112:DI+0x28] >> 17: r127:DF=[r112:DI+0x30] >> 19: r128:DF=[r112:DI+0x38] >> 14: [r112:DI]=r125:DF >> 16: [r112:DI+0x8]=r126:DF >> 18: [r112:DI+0x10]=r127:DF >> 20: [r112:DI+0x18]=r128:DF >> 21: r129:DF=[r112:DI] >> 22: r130:DF=[r112:DI+0x8] >> 23: r131:DF=[r112:DI+0x10] >> 24: r132:DF=[r112:DI+0x18] > > This is much worse though? The expansion with memcpy used V2DI, which > typically is close to 2x faster than DFmode accesses. Using V2DI, it help to access 2x bytes at one time than DF/DI. While since those readings can be executed in parallel, it would be not too bad via using DF/DI.
> > Or are you trying to avoid small reads of large stores here? Those > aren't so bad, large reads of small stores is the nastiness we need to > avoid. Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse could eleminate those memory accesses on same size. > > The code we have now does > > 15: [r112:DI]=r125:V2DI > ... > 17: r127:DF=[r112:DI] > 18: r128:DF=[r112:DI+0x8] > > Can you make this optimised to not use a memory temporary at all, just > immediately assign from r125 to r127 and r128? r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn 17/18' are expanded for different gimple stmt: D.3331 = a; ==> 'insn 15' is generated for struct assignment here. return D.3331; ==> 'insn 17/18' are prepared for return registers. I'm trying to eliminate thos memory temporary, and did not find a good way. Method1-3 are the ideas which I'm trying to use to delete those temporaries. > >> Method2: One way may be enhancing CSE to make it able to treat one large >> memory slot as two(or more) combined slots: >> 13: r125:V2DI#0=[r112:DI+0x20] >> 13': r125:V2DI#8=[r112:DI+0x28] >> 15: [r112:DI]#0=r125:V2DI#0 >> 15': [r112:DI]#8=r125:V2DI#8 >> >> This may seems more hack in CSE. > > The current CSE pass we have is the pass most in need of a full rewrite > we have, since many many years. It does a lot of things, important > things that we should not lose, but it does a pretty bad job of CSE. > >> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". > > :BLK can never be optimised well. It always has to live in memory, by > definition. Thanks for your sugguestions! BR, Jeff (Jiufu) > > > Segher