On 09/30/14 03:22, Bin Cheng wrote:
Hi,
many load/store pairs as my old patch.  Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
Great generalization. And yes, you're absolutely right, what you're doing is building a fairly generic mechanism to mark insns that might fuse together.

So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.

I can fuse AB or AC, but not BC. In fact, moving B & C together may significantly harm performance.

So my question is can a given insn have different fusion priorities depending on its scheduling context?

So perhaps an example. Let's say I have an insn stream with the following kinds of instructions, all ready at the same time.

AAAAAAAABBBBCCCC

Can I create 8 distinct fusion priorities such that I ultimately schedule
AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)

I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler?



Now the next issue is I really don't want all those to fire back-to-back-to-back. I'd like some other insns to be inserted between each fusion pair if they're in the ready list. I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion. So

ABX(1) ABY(2).....

Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn.

Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.




So here comes this patch.  It adds a new sched_fusion pass just before
peephole2.  The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
I think the bulk of my questions above are targetted at this part of the change. When are these assignments made and how much freedom does the backend have to make/change those assignments.

So another question, given a fused pair, is there any way to guarantee ordering within the fused pair. This is useful to cut down on the number of peep2 patterns. I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right?

I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes. So instead of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win for floating point on the sparc processors of that time. I don't expect you to do that investigation. Just thinking out loud.





I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64.  I also collected instrument data about how many load/store
pairs are found.  For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.

The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500
Nice.  Very nice.

Overall it's a fairly simple change.  I'll look deeper into it next week.

jeff

Reply via email to