This is exactly what that guy Shaun Jackman was talking about earlier. I'm actually really surprised this is faster, if I can dig up his e-mail I'll forward him this, I remember him saying something about experimenting with exactly this.
Can you profile the difference in the number of I/O system calls? ----- Original Message ---- > From: Pádraig Brady <[email protected]> > To: Report bugs to <[email protected]> > Cc: Joey Degges <[email protected]> > Sent: Tue, March 2, 2010 2:20:34 AM > Subject: Taking advantage of L1 and L2 cache in sort > > Currently when sorting we take advantage of the RAM vs disk > speed bump by using a large mem buffer dependent on the size of RAM. > However we don't take advantage of the cache layer in the > memory hierarchy which has an increasing importance in modern > systems given the disparity between CPU and RAM speed increases. > > So let's test with a smaller buffer size to see can we > take advantage of cache locality in the comparisons. > (the system and test setup details are included below): > > Using the standard large buffer as a base measurement: > time sort < file > /dev/null > 31.6s (26.7 user) > Reducing this to 10M: > time sort -S10M < file > /dev/null > 27.3s (22.1 user) > Now with a smaller buffer there will be many temp files > created on disk, so to alleviate this overhead somewhat, > let's put them on a ramdisk (mount -t tmpfs /ram /ram): > time TMPDIR=/ram sort -S10M < file > /dev/null > 25.0s (22.1 user) > Now reducing the buffer to 1M to better match > my cache size mentioned below: > time TMPDIR=/ram sort -S1M < file > /dev/null > 23.8s (21.1 user) > Now with a smaller buffer, less data will be read > up front by `sort` and so increasing the readahead > done in the background should help (posix_fadvise(...SEQUENTIAL)): > time TMPDIR=/ram sort -S1M < file > /dev/null > 22.8s (21.1 user) > Note posix_fadvise(...WILLNEED) on the whole file runs > synchronously on my linux system at least. We don't want > that as we loose some processing time in parallel with > the kernel readahead, which is confirmed with: > time TMPDIR=/ram sort -S1M < file > /dev/null > 25.4s (21.1 user) > > So we just knocked 21% off the CPU usage and 28% off the run time. Woot! > That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL) > which is being added anyway as it's seen to speedup when reading > from faster flash devices which benefit from the larger readahead. > Code changes to manage the buffers internally rather than, > manually using the ram disk, should increase the gains further. > > Summary is a win*4. Faster, less CPU, much less RAM. Also there is > more granular access to the sys resources so we can process more in parallel > with readahead of the data and we contend less with other processes. > > cheers, > Pádraig. > > p.s. I haven't actually analysed the code or cache characteristics. > I'm just thinking about it and testing matches the theory at least. > > System details: > > $ grep -E "(model name|cache)" /proc/cpuinfo > model name : Intel(R) Pentium(R) M processor 1.70GHz > cache size : 2048 KB > $ echo "$(($(grep MemTot /proc/meminfo | > tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB" > 2GB > # hdparm -tT /dev/sda #mechanical disk > Timing cached reads: 1470 MB in 2.00 seconds = 734.98 MB/sec > Timing buffered disk reads: 92 MB in 3.04 seconds = 30.27 MB/sec > > > Test setup: > > $ shuf -i 1-100000000 -n 10000000 > file #88MB > $ export LANG=C > # echo 1 > /proc/sys/vm/drop_caches #before each test
