On Thu, Aug 20, 2020 at 06:33:29PM -0500, Segher Boessenkool wrote:
> > These patches allow the load of the address to not be physically adjacent to
> > the actual load or store, which should allow for better code.
> 
> Why is that?  That is not what it does anyway?  /confused

It does allow that.  Perhaps I'm not being clear.

Consider this example:

        extern int a, b, c;

        int sum (void)
        {
          return a + b + c;
        }

With my patches it generates:

        sum:
                pld 8,a@got@pcrel
        .Lpcrel1:

                pld 10,b@got@pcrel
        .Lpcrel2:

                pld 9,c@got@pcrel
        .Lpcrel3:

                .reloc .Lpcrel1-8,R_PPC64_PCREL_OPT,.-(.Lpcrel1-8)
                lwz 3,0(8)

                .reloc .Lpcrel2-8,R_PPC64_PCREL_OPT,.-(.Lpcrel2-8)
                lwz 10,0(10)

                .reloc .Lpcrel3-8,R_PPC64_PCREL_OPT,.-(.Lpcrel3-8)
                lwz 9,0(9)

                add 3,3,10
                add 3,3,9
                extsw 3,3
                blr


Thus it separates the load of the 3 external addresses from the actual LWZ used
to load the values.

For example, in a recent Spec 2017 build for power10, over all of the 
benchmarks:

    Total PCREL_OPTs found for load/store:                      41,440
    Times the PLD/PLA was separated from the load/store:        17,893
    Times the PLD/PLA was adjacent to the load/store:           23,547

    Number of PCREL_OPT loads:                                  38,657
    Number of PCREL_OPT loads separated from the PLD:           15,768
    Number of PCREL_OPT loads adjancent to the PLD:             22,889

    Number of PCREL_OPT stores:                                  2,783
    Number of PCREL_OPT stores separated from the PLD:           2,125
    Number of PCREL_OPT stores adjancent to the PLD:               658

Where it wins is if the external variable is in a shared library.  There the
PLD is in fact a load, and having some separation from the dependent load/store
helps.


> > In order to do this, the pass that converts the load address and load/store
> > must occur late in the compilation cycle.
> 
> That does not follow afaics.
> 
> > In particular, the second scheduler
> > pass will duplicate and optimize some of the references and it will produce 
> > an
> > invalid program.  In the past, Segher has said that we should be able to 
> > move
> > it earlier.
> 
> I said that you shouldn't require this to be the very last pass.  There
> is no reason for that, and that will not scale (what if a second pass
> shows up that also requires this!)

The patches I submitted don't require it to be the 'last' pass.  In fact, I put
it after sched2 because earlier versions of the patch could not be moved
earlier.  There are 11 passes after sched2 before final.

However, it turns out that in the last spin of the patches, I added the
necessary clobbers and such, so it can now go any where after register
allocation.  I built 3 versions of the compiler:

    The first version had the pass after sched2 (version in patches);
    The second version had the pass before sched2; (and)
    The third version had the pass immediately after reload.

I built Spec 2017 with the two compilers.  Unlike before, there were no linker
failures.  I also wrote a perl script to verify that each PCREL_OPT relocation
only targeted one PLD/PLA with one load or store.

> It also makes it impossible to do normal late optimisations on code
> produced here (optimisations like peephole, cprop_hardreg, dce).

Now, it can do those optimizations.

> I also said that you should use the DF framework, not parse all RTL by
> hand and getting it all wrong, as *everyone* does: this stuff is hard.

Bill has said he would look into helping convert it to DF format.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.ibm.com, phone: +1 (978) 899-4797

Reply via email to