On Wed, 18 Oct 2017, Jan Hubicka wrote: > > > According to Agner's tables, gathers range from 12 ops (vgatherdpd) > > > to 66 ops (vpgatherdd). I assume that CPU needs to do following: > > > > > > 1) transfer the offsets sse->ALU unit for address generation (3 cycles > > > each, 2 ops) > > > 2) do the address calcualtion (2 ops, probably 4 ops because it does not > > > map naturally > > > to AGU) > > > 2) do the load (7 cycles each, 2 ops) > > > 3) merge results (1 ops) > > > > > > so I get 7 ops, not sure what remaining 5 do. > > > > > > Agner does not account time, but According to > > > http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt > > > the > > > gather time ranges from 14 cycles (vgatherpd) to 20 cycles. Here I guess > > > it is > > > 3+1+7+1=12 so it seems to work. > > > > > > If you implement gather by hand, you save the SSE->address caluclation > > > path and > > > thus you can get faster. > > > > I see. It looks to me Zen should disable gather/scatter then completely > > and we should implement manual gather/scatter code-generation in the > > vectorizer (or lower it in vector lowering). It sounds like they > > only implemented it to have "complete" AVX2 support (ISTR scatter > > is only in AVX512f). > > Those instructions seems similarly expensive in Intel implementation. > http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeXeon9_InstLatX64.txt > lists latencies ranging from 18 to 32 cycles. > > Of course it may also be the case that the utility is measuring gathers > incorrectly. > according to Agner's table Skylake has optimized gathers, they used to be > 12 to 34 uops on haswell and are no 4 to 5. > > > > > > Note the most major source of impreciseness in the cost model > > > > is from vec_perm because we lack the information of the > > > > permutation mask which means we can't distinguish between > > > > cross-lane and intra-lane permutes. > > > > > > Besides that we lack information about what operation we do (addition > > > or division?) which may be useful to pass down, especially because we do > > > have relevant information handy in the x86_cost tables. So I am thinking > > > of adding extra parameter to the hook telling the operation. > > > > Not sure. The costs are all supposed to be relative to scalar cost > > and I fear we get nearer to a GIGO syndrome when adding more information > > here ;) > > Yep, however there is setup cost (like loads/stores) which comes into game > as well. I will see how far i can get by making x86 costs more "realistic"
I think it should be always counting the cost of n scalar loads plus an overhead depending on the microarchitecture. As you say we're not getting rid of any memory latencies (in the worst case). From Agner I read Skylake optimized gathers down to the actual memory access cost, the overhead is basically well hidden. Richard. -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)