> Based on the previous discussions, I tried to implement a tree loop
> unroller for partial unrolling. I would like to queue this RFC patches
> for next stage1 review.
This is a great plan - GCC urgently requires a good unroller!
> * Cost-model for selecting the loop uses the same params used
> elsewhere in related optimizations. I was told that keeping this same
> would allow better tuning for all the optimizations.
I'd advise against using the existing params as is. Unrolling by 8x by default
way too aggressive and counterproductive. It was perhaps OK for in-order cores
20 years ago, but not today. The goal of unrolling is to create more ILP in
loops, not to generate huge blocks of repeated code which definitely won't fit
micro-op caches and loop buffers...
Also we need to enable this by default, at least with -O3, maybe even for small
(or rather tiny) loops in -O2 like LLVM does.
> * I have also implemented an option to limit loops based on memory
> streams. i.e., some micro-architectures where limiting the resulting
> memory streams is preferred and used to limit unrolling factor.
I'm not convinced this is needed once you tune the parameters for unrolling.
If you have say 4 read streams you must have > 10 instructions already so
you may want to unroll this 2x in -O3, but definitely not 8x. So I see the
issue as a problem caused by too aggressive unroll settings. I think if you
address that first, you're unlikely going to have an issue with too many
> * I expect that there will be some cost-model changes might be needed
> to handle (or provide ability to handle) various loop preferences of
> the micro-architectures. I am sending this patch for review early to
> get feedbacks on this.
Yes it should be feasible to have settings based on backend preference
and optimization level (so O3/Ofast will unroll more than O2).
> * Position of the pass in passes.def can also be changed. Example,
> unrolling before SLP.
As long as it runs before IVOpt so we get base+immediate addressing modes.