I appreciate the effort you put into coding the prefetch example
but there's no way we would sprinkle #ifdef'd code like that across ast
it would be a maintenance headache
and what will the next speedup dujour do to the code
this feels really close to coding in asm
(although we have done manual unrolling and asm in a few spots)
we'd have to step back and look for patterns in the algorithms used across ast
and design an api that can limit the tweaking and ifdefs to the private side
and keep the usage side fairly clean
also
how did you know to do
s0=s1=s2=s3=s4=s5=s6=s7=0U;
why not s0..s3 or s0..s15?
does it only work for += or -=, or will *= /= %= show improvement too?
are there alignment issues on the prefetch blocks?
will the next iteration compilers and/or hardware cut into the gains
of the specialized code?
some of these comments are tainted by a few month's bout with some really
recalcitrant bugs
most of them came down to just a few lines of code
some of the hardest ones butted heads with compiler/optimizer implementations
many of them were in seeminly simple passages
more twisted code will affect the debugging and testing process
On Tue, 1 May 2012 02:28:52 +0200 =?KOI8-R?B?z8zYx8Egy9LZ1sHOz9fTy8HR?= wrote:
> Glenn, I attached an old patch for libsum to boost performance for 2
> common hash algorithms (System5 sum and POSIX cksum). The patch was
> originally developed by Roland to deal with a small performance loss
> (<4%) when switching Solaris 11's cksum and sum utilities from their
> old utility implementations over to the libcmd sum utilty
> implementation.
> Performance is greatly improved by using memory prefetch instructions.
> These instructions are used to the next block of memory while we are
> hashing the current block in parallel, more or less canceling memory
> latency.
> For a very fast AMD64 machine we use for benchmarking I got these results:
> Without patch:
> time ../snapshot20120430_plain/arch/linux.i386/bin/ksh -c 'builtin sum
> ; for ((i=3D0 ; i < 10 ; i++ )) ; do sum -x sys5 /tmp/lf.tmp ; done
> >/dev/null'
> real 0m19.647s
> user 0m15.870s
> sys 0m3.722s
> With patch applied:
> time ../snapshot20120430_sumprefetch/arch/linux.i386/bin/ksh -c
> 'builtin sum ; for ((i=3D0 ; i < 10 ; i++ )) ; do sum -x sys5
> /tmp/lf.tmp ; done >/dev/null'
> real 0m12.011s
> user 0m8.447s
> sys 0m3.527s
> If the general idea is OK I like to extend it to other areas which do
> time consuming list or tree walks, like the vmalloc allocator in
> libast or the array operations in libshell.
_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers