On Mon, 28 Sep 2009 16:27:10 +0200 Roland Mainz wrote: > Glenn Fowler wrote: > > it would be nice to see the improvements all of these ifdef sun* actually > > produce
> For a test file with 24884706 bytes in /tmp (=tmpfs/ramdisk): > - GNU "cksum" currently takes 181 seconds for 1000 iterations > - AST "cksum" (called as external program) currently takes 244 seconds > for 1000 iterations (partially caused by dragging more shared libraries > around, startup time issue with libast-based applications and some other > things) > - AST "cksum" called as ksh93 builtin takes 216 seconds for 1000 > iterations (e.g. ~~24 seconds are saved compared to the external > application) thanks for the data can you provide some base case data for a standalone ast app that exits after the optget() loop this will take into account the runtime shared lib loads plus i18n/l10n initialization that will provide a lower bound on optimzations for any ast app > The basic idea of the patch is to prefetch memory block x+1, then cksum > block x, then prefetch block x+2, then cksum block x+1 etc. This reduces > the time the code has to spend waiting for data becoming "ready" (e.g. > loaded in the L1 cache or similar). I'm not a chip designer but aren't such optimizations supposed to be done in the hw/fw? and wouldn't any hard-coded prefetch optimizations be sensitve to the L1 cache size? and wouldn't that be sensitive to the data blocking done by the algorithms? e.g., suppose sum(1) used sizeof(L1) as its blocking size would that effectively disable the hard-coded L1 prefetch calls? is there a relationship between sizeof(L1) and the optimal sizes for { mmap() read() write() } ? if there are performance conflicts between these sizes how do you decide which ones to hard code among a range of hw configurations? > BTW: I have a new patch queued for "cksum" which further improves the > performance (primarily by using a static table for "cksum"'s CRC data > and other stuff). those kinds of changes will easily make it into the upstream thanks