On Thu, Aug 20, 2020 at 06:33:29PM -0500, Segher Boessenkool wrote: > > These patches allow the load of the address to not be physically adjacent to > > the actual load or store, which should allow for better code. > > Why is that? That is not what it does anyway? /confused
It does allow that. Perhaps I'm not being clear. Consider this example: extern int a, b, c; int sum (void) { return a + b + c; } With my patches it generates: sum: pld 8,a@got@pcrel .Lpcrel1: pld 10,b@got@pcrel .Lpcrel2: pld 9,c@got@pcrel .Lpcrel3: .reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8) lwz 3,0(8) .reloc .Lpcrel2-8,R_PPC64_PCREL_OPT,.-(.Lpcrel2-8) lwz 10,0(10) .reloc .Lpcrel3-8,R_PPC64_PCREL_OPT,.-(.Lpcrel3-8) lwz 9,0(9) add 3,3,10 add 3,3,9 extsw 3,3 blr Thus it separates the load of the 3 external addresses from the actual LWZ used to load the values. For example, in a recent Spec 2017 build for power10, over all of the benchmarks: Total PCREL_OPTs found for load/store: 41,440 Times the PLD/PLA was separated from the load/store: 17,893 Times the PLD/PLA was adjacent to the load/store: 23,547 Number of PCREL_OPT loads: 38,657 Number of PCREL_OPT loads separated from the PLD: 15,768 Number of PCREL_OPT loads adjancent to the PLD: 22,889 Number of PCREL_OPT stores: 2,783 Number of PCREL_OPT stores separated from the PLD: 2,125 Number of PCREL_OPT stores adjancent to the PLD: 658 Where it wins is if the external variable is in a shared library. There the PLD is in fact a load, and having some separation from the dependent load/store helps. > > In order to do this, the pass that converts the load address and load/store > > must occur late in the compilation cycle. > > That does not follow afaics. > > > In particular, the second scheduler > > pass will duplicate and optimize some of the references and it will produce > > an > > invalid program. In the past, Segher has said that we should be able to > > move > > it earlier. > > I said that you shouldn't require this to be the very last pass. There > is no reason for that, and that will not scale (what if a second pass > shows up that also requires this!) The patches I submitted don't require it to be the 'last' pass. In fact, I put it after sched2 because earlier versions of the patch could not be moved earlier. There are 11 passes after sched2 before final. However, it turns out that in the last spin of the patches, I added the necessary clobbers and such, so it can now go any where after register allocation. I built 3 versions of the compiler: The first version had the pass after sched2 (version in patches); The second version had the pass before sched2; (and) The third version had the pass immediately after reload. I built Spec 2017 with the two compilers. Unlike before, there were no linker failures. I also wrote a perl script to verify that each PCREL_OPT relocation only targeted one PLD/PLA with one load or store. > It also makes it impossible to do normal late optimisations on code > produced here (optimisations like peephole, cprop_hardreg, dce). Now, it can do those optimizations. > I also said that you should use the DF framework, not parse all RTL by > hand and getting it all wrong, as *everyone* does: this stuff is hard. Bill has said he would look into helping convert it to DF format. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797