Hi Richard, On 16 February 2018 at 22:56, Richard Biener <richard.guent...@gmail.com> wrote: > On Thu, Feb 15, 2018 at 11:30 PM, Kugan Vivekanandarajah > <kugan.vivekanandara...@linaro.org> wrote: >> Hi Wilko, >> >> Thanks for your comments. >> >> On 14 February 2018 at 00:05, Wilco Dijkstra <wilco.dijks...@arm.com> wrote: >>> Hi Kugan, >>> >>>> Based on the previous discussions, I tried to implement a tree loop >>>> unroller for partial unrolling. I would like to queue this RFC patches >>>> for next stage1 review. >>> >>> This is a great plan - GCC urgently requires a good unroller! > > How so? > >>>> * Cost-model for selecting the loop uses the same params used >>>> elsewhere in related optimizations. I was told that keeping this same >>>> would allow better tuning for all the optimizations. >>> >>> I'd advise against using the existing params as is. Unrolling by 8x by >>> default is >>> way too aggressive and counterproductive. It was perhaps OK for in-order >>> cores >>> 20 years ago, but not today. The goal of unrolling is to create more ILP in >>> small >>> loops, not to generate huge blocks of repeated code which definitely won't >>> fit in >>> micro-op caches and loop buffers... >>> >> OK, I will create separate params. It is possible that I misunderstood >> it in the first place. > > To generate more ILP for modern out-of-order processors you need to be > able to do followup transforms that remove dependences. So rather than > inventing magic params we should look at those transforms and key > unrolling on them. Like we do in predictive commoning or other passes > that end up performing unrolling as part of their transform. > > Our measurements on x86 concluded that unrolling isn't worth it, in fact > it very often hurts. That was of course with saner params than the defaults > of the RTL unroller.
My preliminary benchmarking with x86 using default params slows no overall gain. Some gains and some regressions. I didn't play with the parameters to see if it improves. But for AArch64 - Falkor (with follow up tag collision avoidance for prefetching), we did see gains (again we could do better here): SPECint_base2006 1.37% SPECint_base2006 -0.73% SPECspeed2017_int_base -0.1% SPECspeed2017_fp_base 0.89% SPECrate2017_fp_base 1.72% We also noticed that sometimes the gains for passes like prefetch loop array comes mainly from unrolling rather than the software prefetches. > > Often you even have to fight with followup passes doing stuff that ends up > inreasing register pressure too much so we end up spilling. If we can have an approximate register pressure model that can be used while deciding unrolling factor, it might help to some extend. I saw Bin posting some patches for register pressure calculation. Do you think using that here will be helpful? In general, I agree that cost model can be more accurate but getting the right information within acceptable computation cost is the trick. Do you have any preference on cost model if we decided to have separate loop unroller pass? I.e., what information from loop should we use other than the usual parameters we have? > >> >>> Also we need to enable this by default, at least with -O3, maybe even for >>> small >>> (or rather tiny) loops in -O2 like LLVM does. >> It is enabled for -O3 and above now. > > So _please_ first get testcases we know unrolling will be beneficial on > and _also_ have a thorough description _why_. I will try to analyse the benchmarks whose performance is improving and create test cases. > >>> >>>> * I have also implemented an option to limit loops based on memory >>>> streams. i.e., some micro-architectures where limiting the resulting >>>> memory streams is preferred and used to limit unrolling factor. >>> >>> I'm not convinced this is needed once you tune the parameters for unrolling. >>> If you have say 4 read streams you must have > 10 instructions already so >>> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the >>> streams >>> issue as a problem caused by too aggressive unroll settings. I think if you >>> address that first, you're unlikely going to have an issue with too many >>> streams. >>> >> >> I will experiment with some microbenchmarks. I still think that it >> will be useful for some micro-architectures. Thats why, it its not >> enabled by default. If a back-end thinks that it is useful, they can >> enable limiting unroll factor based on memory streams. > > Note that without doing scheduling at the same time (basically interleaving > iterations rather than pasting them after each other) I have a hard time > believing that maxing memory streams is any good on any microarchitecture. Sorry, I didn't meant to say that. Rather, keep the emory streams below what prefetches will be happy with. > > So transform-wise you'd end up with "vectorizing" without "vectorizing" and > you > can share dependence analysis. > >>>> * I expect that there will be some cost-model changes might be needed >>>> to handle (or provide ability to handle) various loop preferences of >>>> the micro-architectures. I am sending this patch for review early to >>>> get feedbacks on this. >>> >>> Yes it should be feasible to have settings based on backend preference >>> and optimization level (so O3/Ofast will unroll more than O2). >>> >>>> * Position of the pass in passes.def can also be changed. Example, >>>> unrolling before SLP. >>> >>> As long as it runs before IVOpt so we get base+immediate addressing modes. >> Thats what I am doing now. > > Note I believe that IVOPTs should be moved a bit later than it is > placed right now. Ok, I will benchmark with this. Thanks, Kugan > > Richard. > >> Thanks, >> Kugan >> >>> >>> Wilco