On September 17, 2021 6:36:10 PM GMT+02:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> On Fri, 17 Sep 2021, Richard Biener wrote: >> >>> On Fri, 17 Sep 2021, Richard Sandiford wrote: >>> >>> > Richard Biener <rguent...@suse.de> writes: >>> > > This adds the capability to analyze the dependence of mixed >>> > > pointer/array accesses. The example is from where using a masked >>> > > load/store creates the pointer-based access when an otherwise >>> > > unconditional access is array based. Other examples would include >>> > > accesses to an array mixed with accesses from inlined helpers >>> > > that work on pointers. >>> > > >>> > > The idea is quite simple and old - analyze the data-ref indices >>> > > as if the reference was pointer-based. The following change does >>> > > this by changing dr_analyze_indices to work on the indices >>> > > sub-structure and storing an alternate indices substructure in >>> > > each data reference. That alternate set of indices is analyzed >>> > > lazily by initialize_data_dependence_relation when it fails to >>> > > match-up the main set of indices of two data references. >>> > > initialize_data_dependence_relation is refactored into a head >>> > > and a tail worker and changed to work on one of the indices >>> > > structures and thus away from using DR_* access macros which >>> > > continue to reference the main indices substructure. >>> > > >>> > > There are quite some vectorization and loop distribution opportunities >>> > > unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r, >>> > > 510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and >>> > > 544.nab_r see amendments in what they report with -fopt-info-loop while >>> > > the rest of the specrate set sees no changes there. Measuring runtime >>> > > for the set where changes were reported reveals nothing off-noise >>> > > besides 511.povray_r which seems to regress slightly for me >>> > > (on a Zen2 machine with -Ofast -march=native). >>> > > >>> > > Changes from the [RFC] version are properly handling bitfields >>> > > that we cannot take the address of and optimization of refs >>> > > that already are MEM_REFs and thus won't see any change. I've >>> > > also elided changing the set of vect_masked_stores targets in >>> > > favor of explicitely listing avx (but I did not verify if the >>> > > testcase passes on aarch64-sve or amdgcn). >>> > > >>> > > The improves cases like the following from Povray: >>> > > >>> > > for(i = 0; i < Sphere_Sweep->Num_Modeling_Spheres; i++) >>> > > { >>> > > VScaleEq(Sphere_Sweep->Modeling_Sphere[i].Center, Vector[X]); >>> > > Sphere_Sweep->Modeling_Sphere[i].Radius *= Vector[X]; >>> > > } >>> > > >>> > > where there is a plain array access mixed with abstraction >>> > > using T[] or T* arguments. That should be a not too uncommon >>> > > situation in the wild. The loop above is now vectorized and was not >>> > > without the change. >>> > > >>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu and I've >>> > > built and run SPEC CPU 2017 successfully. >>> > > >>> > > OK? >>> > >>> > Took a while to page this stuff back in :-/ >>> > >>> > I guess if we're adding alt_indices to the main data_reference, >>> > we'll need to free the access_fns in free_data_ref. It looked like >>> > the patch as posted might have a memory leak. >>> >>> Doh, yes - thanks for noticing. >>> >>> > Perhaps instead we could use local indices structures for the >>> > alt_indices and pass them in to initialize_data_dependence_relation? >>> > Not that that's very elegant… >>> >>> Yeah, I had that but then since for N data references we possibly >>> call initialize_data_dependence_relation N^2 times we'd do this >>> alternate analysis N^2 times at worst instead of N so it looked worth >>> caching it in the data reference. Since we have no idea why the >>> first try fails we're going to try this fallback in the majority >>> of cases that we cannot figure out otherwise so I didn't manage >>> to hand-wave the quadraticness away ;) OTOH one might argue >>> it's just a constant factor ontop of the number of >>> initialize_data_dependence_relation invocations. >>> >>> So I can probably be convinced either way - guess I'll gather >>> some statistics. >> >> I built SPEC 2017 CPU rate with -Ofast -march=znver2, overall there >> are >> >> 4433976 calls to the first stage initialize_data_dependence_relation >> (skipping the cases dr_may_alias returned false) >> 360248 (8%) ended up recursing with a set of alt_indices >> 83512 times we computed alt_indices of a DR (that's with the cache) >> 14905 (0.3%) times the recursive invocation ended with != chrec_dont_know >> >> thus when not doing the caching we'd compute alt_indices about 10 times >> more often. I didn't collect the number of distinct DRs (that's difficult >> at this point), but I'd estimate from the above that we have 3 times >> more "unused" alt_indices than used. >> >> OK, so that didn't really help me avoid flipping a coin ;) > >Sounds like a good justification for keeping the caching to me :-)
It also occurred to me that we currently build a (GCable) ADDR_EXPR AND MEM_REF for each computation which likely more than offsets any win by not caching the result. Richard. >Richard