While loop unrolling helps to keep the pipeline busy in modern processors, it also can increase the memory streams resulting in collisions for the hardware prefetcher that can impact performance. This patch series tries to detect this and limit the loop unrolling.
Patch 1 : Add separate parms for rtl unroller: Patch2: Add number of hw prefetchers available to cpu_prefetch_tune so it can be used in loop unrolling decisions: Patch3: Prevent tree unroller from completely unrolling inner loops if that results in excessive strided-loads in outer loop: Patch4: Change iv_analyze_result to take const_rtx. This is just to make the next patch compile. No functional changes: Patch5: add aarch64_loop_unroll_adjust to limit partial unrolling in rtl based on strided-loads in loop: Bootstrapped and tested on aarch64-linux-gnu (with –funroll-all-loops). Testing on x86_64-linux-gnu ongoing. Thanks, Kugan