On Wed, 18 Oct 2017, Jan Hubicka wrote:

> > > According to Agner's tables, gathers range from 12 ops (vgatherdpd)
> > > to 66 ops (vpgatherdd).  I assume that CPU needs to do following:
> > > 
> > > 1) transfer the offsets sse->ALU unit for address generation (3 cycles
> > >    each, 2 ops)
> > > 2) do the address calcualtion (2 ops, probably 4 ops because it does not 
> > > map naturally
> > >                          to AGU)
> > > 2) do the load (7 cycles each, 2 ops)
> > > 3) merge results (1 ops)
> > > 
> > > so I get 7 ops, not sure what remaining 5 do.
> > > 
> > > Agner does not account time, but According to
> > > http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt 
> > > the
> > > gather time ranges from 14 cycles (vgatherpd) to 20 cycles.  Here I guess 
> > > it is
> > > 3+1+7+1=12 so it seems to work.
> > > 
> > > If you implement gather by hand, you save the SSE->address caluclation 
> > > path and
> > > thus you can get faster.
> > 
> > I see.  It looks to me Zen should disable gather/scatter then completely
> > and we should implement manual gather/scatter code-generation in the
> > vectorizer (or lower it in vector lowering).  It sounds like they
> > only implemented it to have "complete" AVX2 support (ISTR scatter
> > is only in AVX512f).
> 
> Those instructions seems similarly expensive in Intel implementation.
> http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeXeon9_InstLatX64.txt
> lists latencies ranging from 18 to 32 cycles.
> 
> Of course it may also be the case that the utility is measuring gathers 
> incorrectly.
> according to Agner's table Skylake has optimized gathers, they used to be
> 12 to 34 uops on haswell and are no 4 to 5.
> > 
> > > > Note the most major source of impreciseness in the cost model
> > > > is from vec_perm because we lack the information of the
> > > > permutation mask which means we can't distinguish between
> > > > cross-lane and intra-lane permutes.
> > > 
> > > Besides that we lack information about what operation we do (addition
> > > or division?) which may be useful to pass down, especially because we do
> > > have relevant information handy in the x86_cost tables.  So I am thinking
> > > of adding extra parameter to the hook telling the operation.
> > 
> > Not sure.  The costs are all supposed to be relative to scalar cost
> > and I fear we get nearer to a GIGO syndrome when adding more information
> > here ;)
> 
> Yep, however there is setup cost (like loads/stores) which comes into game
> as well.  I will see how far i can get by making x86 costs more "realistic"

I think it should be always counting the cost of n scalar loads plus
an overhead depending on the microarchitecture.  As you say we're
not getting rid of any memory latencies (in the worst case).  From
Agner I read Skylake optimized gathers down to the actual memory
access cost, the overhead is basically well hidden.

Richard.

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to