From: Torbjorn Granlund <t...@gmplib.org> Date: Tue, 26 Mar 2013 21:29:38 +0100
> David Miller <da...@davemloft.net> writes: > > The ASI is defined on 64-byte cache lines. > > Overlapping is not an issue, since my 64-bytes-at-a-time loop loads > everything first before doing any stores. It has no difference in > behavior to the generic 64-bit copyi/copyd we have on sparc64. Each > loop iteration only writes to a full, aligned, 64-byte block of > memory. > > I see. I bulky structure, but that's necessary in this case. Avoiding > a load cache miss for write operations will save a lot of memory cycles. > This help both with latncy and actual memory bandwidth. > > If one is really cracy, it might actually help to perform a dummy read > operation of, say, 4Kibyte of the source operand. This will keep DRAM > busy, putting sata in L1 cache. Then we do the copy, using the ASN you > mentioned for writing; the loads will hit cache since we put them there. We have prefetch instruction on sparc which we could use for this. And in fact that's what I do for memcpy on the various Niagara chips, prefetch about 256 bytes ahead, and using cache line initializing stores. > > The T3 popount and hamdist timing numbers are awful. > > Is the C code perhaps faster? > > The C code won't be faster. It's slow on T3 because popc, like > multiplies, simply isn't pipelined at all. > > The C code stays away from popc, and instead uses bit tricks. I know, this was implied in this conversation. The bit tricks are still more expensive than popc at least on T3. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel