I think the point is that if the common usage is to sum many different files, or one file at a time over long spans of time then the performance of getting the bytes from the filesystem to user space may outweigh any cache optimization gains
the ast apps are already at a disadvantage because they pull in extra .so's over the base case(s) they are measured against what I need is a big view analysis of at least a few more variables so that resonable decisions can be made of ifdef'ing up the code e.g., what is the startup cost of the extra .so's? what are the effects, if any, of timing apps repeatedly over the same file vs timing the apps over enough files to blow fs cache(s) what are the interactions between io/mmap block sizes and L? cache block sizes being controlled by the prefetch calls? my suspicions are that tweaking the user io/mmap block sizes (which can be done in a general way for all apps, possibly with an ifdef in one place) may change the timings and diminish the effects of the explicit prefetch calls would it be enough to make them not worth it? I don't know without more data also are there performance results for the unhacked gnu sum vs the hacked gnu sum? are there performance results for the hacked gnu sum vs the solaris sum?