> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Friday, 29 July 2022 18.06
> 
> On Fri, 29 Jul 2022 12:13:52 +0000
> Konstantin Ananyev <konstantin.anan...@huawei.com> wrote:
> 
> > Sorry, missed that part.
> >
> > >
> > > > Another question - who will do 'sfence' after the copying?
> > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > it be another API function for that: memcpy_nt_flush() or so?
> > >
> > > Outside. Only the developer knows when it is required, so it
> wouldn't make any sense to add the cost inside memcpy_nt().
> > >
> > > I don't think we should add a flush function; it would just be
> another name for an already existing function. Referring to the
> required
> > > operation in the memcpy_nt() function documentation should suffice.
> > >
> >
> > Ok, but again wouldn't it be arch specific?
> > AFAIK for x86 it needs to boil down to sfence, for other
> architectures - I don't know.
> > If you think there already is some generic one (rte_wmb?) that would
> always produce
> > correct instructions - sure let's use it.
> >
> >
> 
> It makes sense in a few select places to use non-temporal copy.
> But it would add unnecessary complexity to DPDK if every function in
> DPDK that could
> cause a copy had a non-temporal variant.

Agree.

Packet capturing is one of those few places where it makes sense - the 
improvement scales with the number of packet, not just with the number of 
packet bursts.

> 
> Maybe just having rte_memcpy have a threshold (config value?) that if
> copy is larger than
> a certain size, then it would automatically be non-temporal.  Small
> copies wouldn't matter,
> the optimization is more about not stopping cache size issues with
> large streams of data.

Small copies matter too, if there are many of them. As shown in my previous 
response, a burst of 32 packets will save 6.25 % of a 64 KB L1 data cache, when 
copying 64 byte or less from each packet. The saving is per packet, so it 
quickly adds up.

Copying a burst of 32 1518 byte packets trashes 2 * 32 * 1536 = 98 KB data 
cache, i.e. the entire L1 cache.

The threshold in glibc's memcpy() is much higher than 1536 byte. I don't think 
it will be possible to find a good threshold that works 99 % of the time. So we 
have to let the application developer make the choice.

Reply via email to