[ I’ll give the state of the code that I finished with, Bin’s answers should be 
similar to mine, but, if he improved things, they could be better ]

On Oct 10, 2014, at 2:13 PM, Jeff Law <l...@redhat.com> wrote:
> So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.
> 
> I can fuse AB or AC, but not BC.  In fact, moving B & C together may 
> significantly harm performance.

We only can choose 1 ordering for fusion consideration, so you have to pick 
between AB or AC as the ordering.  Once you decide, then the other is hidden 
from consideration.

The priorities merely are used to select an instruction ordering to consider 
pair wise (adjacent) opportunities.

They can chain and stack, so, if you want A B C to fuse in the new ABC 
instruction, it will.

> So my question is can a given insn have different fusion priorities depending 
> on its scheduling context?

Area for future improvement.  In theory you can feed any other state into the 
creation of the priority but only the single insn feeds into it today.

> So perhaps an example.  Let's say I have an insn stream with the following 
> kinds of instructions, all ready at the same time.
> 
> AAAAAAAABBBBCCCC
> 
> Can I create 8 distinct fusion priorities such that I ultimately schedule
> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
> 
> I guess another way to ask the question, are fusion priorities static based 
> on the insn/alternative, or can they vary?  And if they can vary, can they 
> vary each tick of the scheduler?

So, it wasn’t created to solve that problem.  If the scheduler would tend to 
put them next to each other because structurally, they fit that way, then 
peephole already should have a chance to see them that way.

> Now the next issue is I really don't want all those to fire 
> back-to-back-to-back.  I'd like some other insns to be inserted between each 
> fusion pair if they're in the ready list.

Scheduling figures out the way to layout instructions and will be done.  Fusion 
run before scheduling and is separate, it doesn’t replace or change scheduling.

> I guess the easiest way to get that is to assign the same fusion priority to 
> other insns in the ready queue, even though they don't participate in fusion. 
>  So
> 
> ABX(1) ABY(2).....
> 
> Where X & Y are some other arbitrary insns that don't participate in the AB 
> fusion, but will issue in the same cycle as the AB fused insn.

You’re conflating fusion with scheduling, they are separate.  Fusion doesn’t 
deal with time or cycles.  It exists only to fuse to separate instructions 
together.  Those products will be scheduled later and post scheduling, all the 
holes will be filled to the extent the scheduler is able to find work to do.

For example, given:

ld 0
inc
ld 1
inc
ld 2
inc
ld 3
inc

and we want to fuse ld <even> with ld <even>+1, we then get:

ld 01
ld 23
inc
inc

then, post scheduling, we get:

ld 01
inc
ld 23
inc

by the scheduler to fill the holes.

> Though I guess if we run fusion + peep2 between sched1 and sched2, that 
> problem would just resolve itself as we'd have fused AB together into a new 
> insn and we'd schedule normally with the fused insns and X, Y.

Yes, in my version, I ran it really early, before sched.  I needed to run 
before ira and run before other people changed the code so much, that they 
obscured the instructions I wanted to fuse.  One thing that happened that I saw 
was once we got to far along, the offsets used were loaded into register and 
instead of being r+n, it was r1+r2 and this blew the sorting.

>> So here comes this patch.  It adds a new sched_fusion pass just before
>> peephole2.

Hum…  In my version, that was just way too late.  I needed to generate pseudos 
and count on ira and others to clean up register the register moves for me.  
I’ve not tried his patch to see if it works as well as my version on my problem 
(combining adjacent loads and stores).

>>  The methodology is like:
>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>> pair, with fusion_priority as the major key and priority as the minor key.
>> 2) The back-end assigns priorities pair to each instruction, instructions
>> want to be fused together get same fusion_priority assigned.
> I think the bulk of my questions above are targetted at this part of the 
> change.  When are these assignments made and how much freedom does the 
> backend have to make/change those assignments.

It is free to do anything with the numbers it wants.  But, the port writer has 
carefully chosen the priorities to generate the best code, it is unlikely it 
would ever be wise to change the numbers.

> So another question, given a fused pair, is there any way to guarantee 
> ordering within the fused pair.

Sure, ensure one instruction has a lower number than the other.  That will 
order them reliably.

> This is useful to cut down on the number of peep2 patterns.  I guess we could 
> twiddle the priority in those cases to force a particular ordering of the 
> fused pair, right?

Right, assuming that there is an ordering for the instruction that makes sense. 
 Given:

mul r1,#13
inc r3
mul r5,#13
inc r4

and a machine that could do a V2 mul, given a pair:

mov r8,r1
mov r9,r2
mulv2 r8,#13
inc r3
inc r4
mov r1,r8
mov r5,r9

Here, we can see all the extra moves to ensure a register pair (n, n+1 of the 
mulv2 instruction) and the temporary created to hold the tuple (r8,r9).  In my 
peephole patterns I do this to ensure registers have the right register number. 
 I do this before register allocation, as it can choose the registers the best 
and ensure all the other instructions around it use those registers, and leave 
the moves in, if it can’t do better.  The moves tend to be able to fill 
scheduling holes, so, even if they remain, usually the cost for them shouldn’t 
be too bad.

On the other hand, if you have

left
left
right
right

There is no way to sort them to get:

left
right
left
right

and then fuse:

left_right
left_right

This would be impossible.

> I wonder if we could use this to zap all the hair I added to caller-save back 
> in the early 90s to try and widen the save/restore modes.  So instead of st; 
> st; call; ld; ld, we'd generate std; call; ldd.  It was a huge win for 
> floating point on the sparc processors of that time.  I don't expect you to 
> do that investigation.  Just thinking out loud.

Well, curiously enough, that sort of thing is exactly why I wrote the code.  I 
was thinking of cases where the user load lots of registers from memory and 
then plays a bit with them and then put stye data back into memory.  I wanted 
to put all the loads together and then try and find the loads that are next to 
each other and combine them into a single, larger load.  On a machine that can 
load n registers as once, you can then save n-1 instructions.  Likewise for 
stores.  Conceptually, there is nothing port specific about the optimization I 
wanted to do.

Reply via email to