> > Those instructions seems similarly expensive in Intel implementation. > > http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeXeon9_InstLatX64.txt > > lists latencies ranging from 18 to 32 cycles. > > > > Of course it may also be the case that the utility is measuring gathers > > incorrectly. > > according to Agner's table Skylake has optimized gathers, they used to be > > 12 to 34 uops on haswell and are no 4 to 5. > > > > > > > > Note the most major source of impreciseness in the cost model > > > > > is from vec_perm because we lack the information of the > > > > > permutation mask which means we can't distinguish between > > > > > cross-lane and intra-lane permutes. > > > > > > > > Besides that we lack information about what operation we do (addition > > > > or division?) which may be useful to pass down, especially because we do > > > > have relevant information handy in the x86_cost tables. So I am > > > > thinking > > > > of adding extra parameter to the hook telling the operation. > > > > > > Not sure. The costs are all supposed to be relative to scalar cost > > > and I fear we get nearer to a GIGO syndrome when adding more information > > > here ;) > > > > Yep, however there is setup cost (like loads/stores) which comes into game > > as well. I will see how far i can get by making x86 costs more "realistic" > > I think it should be always counting the cost of n scalar loads plus > an overhead depending on the microarchitecture. As you say we're > not getting rid of any memory latencies (in the worst case). From > Agner I read Skylake optimized gathers down to the actual memory > access cost, the overhead is basically well hidden.
Where did you find it? It does not seem to quite match the instruction latency table above. Honza > > Richard. > > -- > Richard Biener <rguent...@suse.de> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nuernberg)