Based on the previous discussions, I tried to implement a tree loop
unroller for partial unrolling. I would like to queue this RFC patches
for next stage1 review.
* Cost-model for selecting the loop uses the same params used
elsewhere in related optimizations. I was told that keeping this same
would allow better tuning for all the optimizations.
* I have also implemented an option to limit loops based on memory
streams. i.e., some micro-architectures where limiting the resulting
memory streams is preferred and used to limit unrolling factor.
* I have tested this on variants of aarch64 and the results are
promising. I am in the process of running benchmarks on x86. I will
update the results later.
* I expect that there will be some cost-model changes might be needed
to handle (or provide ability to handle) various loop preferences of
the micro-architectures. I am sending this patch for review early to
get feedbacks on this.
* Position of the pass in passes.def can also be changed. Example,
unrolling before SLP.
* I have bootstrapped and regression tested on aarch64-linux-gnu.
There are no execution errors or ICEs. There are some testsuite
differences as expected. Few of them needs further evaluation and I am
doing that now.
Patches are organized as:
Patch1: Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop
unroller, if defined, will try to limit the unrolling factor based on
Patch2: Implements tree loop unroller using the infrastructure
provided. Pass itself is very simple.
Patch3: Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64.
Patch4: Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too-much tag collisions based on the discussions in