On September 17, 2021 6:36:10 PM GMT+02:00, Richard Sandiford 
<richard.sandif...@arm.com> wrote:
>Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> On Fri, 17 Sep 2021, Richard Biener wrote:
>>
>>> On Fri, 17 Sep 2021, Richard Sandiford wrote:
>>> 
>>> > Richard Biener <rguent...@suse.de> writes:
>>> > > This adds the capability to analyze the dependence of mixed
>>> > > pointer/array accesses.  The example is from where using a masked
>>> > > load/store creates the pointer-based access when an otherwise
>>> > > unconditional access is array based.  Other examples would include
>>> > > accesses to an array mixed with accesses from inlined helpers
>>> > > that work on pointers.
>>> > >
>>> > > The idea is quite simple and old - analyze the data-ref indices
>>> > > as if the reference was pointer-based.  The following change does
>>> > > this by changing dr_analyze_indices to work on the indices
>>> > > sub-structure and storing an alternate indices substructure in
>>> > > each data reference.  That alternate set of indices is analyzed
>>> > > lazily by initialize_data_dependence_relation when it fails to
>>> > > match-up the main set of indices of two data references.
>>> > > initialize_data_dependence_relation is refactored into a head
>>> > > and a tail worker and changed to work on one of the indices
>>> > > structures and thus away from using DR_* access macros which
>>> > > continue to reference the main indices substructure.
>>> > >
>>> > > There are quite some vectorization and loop distribution opportunities
>>> > > unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r,
>>> > > 510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and
>>> > > 544.nab_r see amendments in what they report with -fopt-info-loop while
>>> > > the rest of the specrate set sees no changes there.  Measuring runtime
>>> > > for the set where changes were reported reveals nothing off-noise
>>> > > besides 511.povray_r which seems to regress slightly for me
>>> > > (on a Zen2 machine with -Ofast -march=native).
>>> > >
>>> > > Changes from the [RFC] version are properly handling bitfields
>>> > > that we cannot take the address of and optimization of refs
>>> > > that already are MEM_REFs and thus won't see any change.  I've
>>> > > also elided changing the set of vect_masked_stores targets in
>>> > > favor of explicitely listing avx (but I did not verify if the
>>> > > testcase passes on aarch64-sve or amdgcn).
>>> > >
>>> > > The improves cases like the following from Povray:
>>> > >
>>> > >    for(i = 0; i < Sphere_Sweep->Num_Modeling_Spheres; i++)
>>> > >      {
>>> > >         VScaleEq(Sphere_Sweep->Modeling_Sphere[i].Center, Vector[X]);
>>> > >         Sphere_Sweep->Modeling_Sphere[i].Radius *= Vector[X];
>>> > >      }
>>> > >
>>> > > where there is a plain array access mixed with abstraction
>>> > > using T[] or T* arguments.  That should be a not too uncommon
>>> > > situation in the wild.  The loop above is now vectorized and was not
>>> > > without the change.
>>> > >
>>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu and I've
>>> > > built and run SPEC CPU 2017 successfully.
>>> > >
>>> > > OK?
>>> > 
>>> > Took a while to page this stuff back in :-/
>>> > 
>>> > I guess if we're adding alt_indices to the main data_reference,
>>> > we'll need to free the access_fns in free_data_ref.  It looked like
>>> > the patch as posted might have a memory leak.
>>> 
>>> Doh, yes - thanks for noticing.
>>> 
>>> > Perhaps instead we could use local indices structures for the
>>> > alt_indices and pass them in to initialize_data_dependence_relation?
>>> > Not that that's very elegant…
>>> 
>>> Yeah, I had that but then since for N data references we possibly
>>> call initialize_data_dependence_relation N^2 times we'd do this
>>> alternate analysis N^2 times at worst instead of N so it looked worth
>>> caching it in the data reference.  Since we have no idea why the
>>> first try fails we're going to try this fallback in the majority
>>> of cases that we cannot figure out otherwise so I didn't manage
>>> to hand-wave the quadraticness away ;)  OTOH one might argue
>>> it's just a constant factor ontop of the number of
>>> initialize_data_dependence_relation invocations.
>>> 
>>> So I can probably be convinced either way - guess I'll gather
>>> some statistics.
>>
>> I built SPEC 2017 CPU rate with -Ofast -march=znver2, overall there
>> are
>>
>>  4433976     calls to the first stage initialize_data_dependence_relation
>>              (skipping the cases dr_may_alias returned false)
>>  360248 (8%) ended up recursing with a set of alt_indices
>>  83512       times we computed alt_indices of a DR (that's with the cache)
>>  14905 (0.3%) times the recursive invocation ended with != chrec_dont_know
>>
>> thus when not doing the caching we'd compute alt_indices about 10 times
>> more often.  I didn't collect the number of distinct DRs (that's difficult
>> at this point), but I'd estimate from the above that we have 3 times
>> more "unused" alt_indices than used.
>>
>> OK, so that didn't really help me avoid flipping a coin ;)
>
>Sounds like a good justification for keeping the caching to me :-)

It also occurred to me that we currently build a (GCable) ADDR_EXPR AND MEM_REF 
for each computation which likely more than offsets any win by not caching the 
result. 

Richard. 

>Richard

Reply via email to