Re: [ast-developers] Re: libsum prefetch patch

Roland Mainz Wed, 02 May 2012 18:00:18 -0700

On Tue, May 1, 2012 at 7:22 AM, Glenn Fowler <[email protected]> wrote:
> On Tue, 1 May 2012 02:28:52 +0200 =?KOI8-R?B?z8zYx8Egy9LZ1sHOz9fTy8HR?= wrote:
>> Glenn, I attached an old patch for libsum to boost performance for 2
>> common hash algorithms (System5 sum and POSIX cksum). The patch was
>> originally developed by Roland to deal with a small performance loss
>> (<4%) when switching Solaris 11's cksum and sum utilities from their
>> old utility implementations over to the libcmd sum utilty
>> implementation.
>
>> Performance is greatly improved by using memory prefetch instructions.
>> These instructions are used to the next block of memory while we are
>> hashing the current block in parallel, more or less canceling memory
>> latency.
>
>> For a very fast AMD64 machine we use for benchmarking I got these results:
>> Without patch:
>> time ../snapshot20120430_plain/arch/linux.i386/bin/ksh -c 'builtin sum
>> ; for ((i=3D0 ; i < 10 ; i++ )) ; do sum -x sys5 /tmp/lf.tmp ; done
>> >/dev/null'
>> real    0m19.647s
>> user    0m15.870s
>> sys     0m3.722s
>
>> With patch applied:
>> time ../snapshot20120430_sumprefetch/arch/linux.i386/bin/ksh -c
>> 'builtin sum ; for ((i=3D0 ; i < 10 ; i++ )) ; do sum -x sys5
>> /tmp/lf.tmp ; done >/dev/null'
>> real    0m12.011s
>> user    0m8.447s
>> sys     0m3.527s
>
>> If the general idea is OK I like to extend it to other areas which do
>> time consuming list or tree walks, like the vmalloc allocator in
>> libast or the array operations in libshell.
>
> I appreciate the effort you put into coding the prefetch example
> but there's no way we would sprinkle #ifdef'd code like that across ast
> it would be a maintenance headache
> and what will the next speedup dujour do to the code
> this feels really close to coding in asm
> (although we have done manual unrolling and asm in a few spots)
>
> we'd have to step back and look for patterns in the algorithms used across ast
> and design an api that can limit the tweaking and ifdefs to the private side
> and keep the usage side fairly clean


Ok... what about a smaller patch which just boosts the performance of
teh SystemV/AT&T sum algorithm without the use of prefetch&&#ifdef
stunts ? The cksum was nearly as good as Sun's original except for
very small files so it doesn't matter that much.

> also
>
> how did you know to do
>        s0=s1=s2=s3=s4=s5=s6=s7=0U;
> why not s0..s3 or s0..s15?

The idea was to cover as many _modern_ CPU architectures as possible.
Most of them have at least 8 general-purpose registers with the caveat
that libshell is a shared library, which means depending on
architecture and API 1-3 registers need to be reserved to carry
pointers to the stack, global variables and other global resources
etc. around (but almost all CPUs with <=15 general purpose registers
have extra registers for stack&&context stuff).
Another issue learned from experience is that compilers can do
incredible dumb things if the number of active variables in an
algorithm exceeds the number of available registers.

In the case of the code below the main idea was to make sure the CPU
pipeline can be filled-up with independent instructions (e.g. no
interlocks touched) and process them in parallel. The statically-sized
for()-loop will allow the compiler to unroll it as it sees fit... and
if the CPU has more registers than 8 it will automatically allocate
more registers (because as part of the unrolling the compiler is
(hopefully) smart enougth to check if it can use "interleaving", e.g.
merge two following loop cycles into one sequence where the code is
more or less mixed in 1-2-1-2-1-2-... manner) and do even more stuff
in parallel:
-- snip --
+               /* Compiler will unroll for() loops per #pragma unroll */
+               for (i=0 ; i < (CBLOCK_SIZE/8) ; i++)
+               {
+                       /*
+                        * use s0-s7 to decouple calculations (this improves 
pipelining)
+                        * because each operation is completely independent 
from it's
+                        * siblings
+                        */
+                       s0+=b[0];
+                       s1+=b[1];
+                       s2+=b[2];
+                       s3+=b[3];
+                       s4+=b[4];
+                       s5+=b[5];
+                       s6+=b[6];
+                       s7+=b[7];
+
+                       b+=8;
+                       n-=8;
+               }
-- snip --
The code could only use s0...s3 with more or less the same effect,
except that older or less advanced optimisers (which don't do
unrolling etc.) will not benefit from any improvements at all.

> does it only work for += or -=, or will *= /= %= show improvement too?

Erm... this does not matter. The issue was that each line can do
independently its work and doesn't trigger interlocks which stall the
pipeline. Short: Parallel processing at work... :-)

> are there alignment issues on the prefetch blocks?

Erm... yes and no. The prefetch always fills whole cache lines. Even
if the prefetch hits in the mittdle or end of the next cache line the
whole line will be pre-fetched.
That's why I set CBLOCK_SIZE to 64... 64 bytes is (for modern CPUs)
the minimum cache line size (see below).

> will the next iteration compilers and/or hardware cut into the gains
> of the specialized code?

Erm... this algorithm is more or less immune against such "gain cuts".
If the cache line size is 128 every 2nd prefetch will be a no-op...
but that will definately not hurt for a pipelined CPU because the
prefetch is working in parallel to other operations and doesn't
interfer with calculations or even load/store operations. Worse case
is that a single instruction slot is wasted (and AFAIK all mondern
mainstream CPUs either have 64 or 128 byte cache lines[1]).

[1]=Yes, yes, I know... some machines have 32byte cache lines for the
L1 cache but AFAIK all of them also feature an L2 with 64byte cache
lines. The prefetch will then load a full 64byte L2 cache line,
followed loading the first part of it into the 32byte L1 cache line.
If we leave this L1 line for the next one we get one L1 miss which is
satisfied from L2 (<sarcasm>What a horrible penalty!</sarcasm>).

> some of these comments are tainted by a few month's bout with some really 
> recalcitrant bugs

Grumpf... I can understand you. I spend some weeks of my life hunting
such issues, too... ;-(

> most of them came down to just a few lines of code
> some of the hardest ones butted heads with compiler/optimizer implementations
> many of them were in seeminly simple passages
> more twisted code will affect the debugging and testing process

Well... OK. But see my proposal for the AT&T algorithm above... and
maybe in the long term a seperate AST header for prefetch macros may
be nice (for vmalloc etc.) ... it might help a lot in some cases (for
example one thing I'd like to investigate is whether prefetch in a
prefetchX-CAS-loadX has any benefits (atomic CAS operations are very
heavywheight... question is whether CPUs allow a prefetch during a CAS
and how long the CAS really takes)).

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Re: libsum prefetch patch

Reply via email to