From: Linus Torvalds
> Sent: 10 July 2020 23:37
> On Tue, Jul 7, 2020 at 5:35 AM David Laight <[email protected]> wrote:
> >
> >
> > So separate copy and checksum passes should easily exceed 4 bytes/clock,
> > but I suspect that doing them together never does.
> > (Unless the buffer is too big for the L1 cache.)
> 
> Its' the "touch the caches twice" that is the problem".
> 
> And it's not the "buffer is too big for L1", it's "the source, the
> destination and any incidentals are too big for L1" with the
> additional noise from replacement policies etc.

That's really what I meant.
L1D is actually (probably) only 32kB.
I guess that gives you 8k for the buffer.

It is a shame you can't use the AVX instructions in kernel.
(Although saving them probably costs more than the gain.)
Then you could use something based on:
10:     load ymm,src+idx   // 32 bytes
        store ymm,tgt+idx
        addq sum0,ymm   // eight 32bit adds
        rotate ymm,16   // Pretty sure there in an instruction for this!
        addq sum1,ymm
        add idx,32
        jnz 10b
It is then possibly to determine the correct result from sum0/sum1.
On very recent Intel cpu that might even run at 1 iteration/clock!
(Probably needs and unroll and explicit interleave.)
At one iteration every 2 clocks it matches the ADDX[OC] loop
but includes the write.

        David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Reply via email to