Hi Chen, I haven't done any testing yet, but I was planning on doing pretty much the same sort of tests as Pádraig reported. I tested on a single-processor machine with 1 MB cache, and I see similar results (see below). The next step would be to perform the merging in parallel on a multiprocessor machine.
Cheers, Shaun $ shuf -i 1-100000000 -n 10000000 >test $ du -b test 88888143 test $ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort <test >/dev/null real 0m27.517s user 0m25.653s sys 0m0.459s $ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S100M <test >/dev/null real 0m26.160s user 0m24.187s sys 0m0.868s $ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S1M <test >/dev/null real 0m21.485s user 0m19.351s sys 0m1.637s $ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S1M -T/dev/shm <test >/dev/null real 0m20.568s user 0m19.306s sys 0m0.914s $ time sort -S1M -T/dev/shm <test >/dev/null real 0m20.007s user 0m19.235s sys 0m0.766s $ time sort -S1M -T/dev/shm --compress-program=gzip <test >/dev/null real 1m25.868s user 1m43.431s sys 0m29.969s $ time sort -S1M -T/dev/shm --compress-program='gzip --fast' <test >/dev/null sort: couldn't execute gzip --fast: No such file or directory $ head /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 1 cpu MHz : 3194.233 cache size : 1024 KB physical id : 0 siblings : 2 On Tue, 2010-03-02 at 10:40 -0800, Chen Guo wrote: > Hi Shaun, > I think this is exactly what you were talking about. I'm really surprised > this is faster, but your idea was spot on. How goes your work with this? > > > > > ----- Forwarded Message ---- > > From: Pádraig Brady <p...@draigbrady.com> > > To: Report bugs to <bug-coreutils@gnu.org> > > Cc: Joey Degges <jdeg...@gmail.com> > > Sent: Tue, March 2, 2010 2:20:34 AM > > Subject: Taking advantage of L1 and L2 cache in sort > > > > Currently when sorting we take advantage of the RAM vs disk > > speed bump by using a large mem buffer dependent on the size of RAM. > > However we don't take advantage of the cache layer in the > > memory hierarchy which has an increasing importance in modern > > systems given the disparity between CPU and RAM speed increases. > > > > So let's test with a smaller buffer size to see can we > > take advantage of cache locality in the comparisons. > > (the system and test setup details are included below): > > > > Using the standard large buffer as a base measurement: > > time sort < file > /dev/null > > 31.6s (26.7 user) > > Reducing this to 10M: > > time sort -S10M < file > /dev/null > > 27.3s (22.1 user) > > Now with a smaller buffer there will be many temp files > > created on disk, so to alleviate this overhead somewhat, > > let's put them on a ramdisk (mount -t tmpfs /ram /ram): > > time TMPDIR=/ram sort -S10M < file > /dev/null > > 25.0s (22.1 user) > > Now reducing the buffer to 1M to better match > > my cache size mentioned below: > > time TMPDIR=/ram sort -S1M < file > /dev/null > > 23.8s (21.1 user) > > Now with a smaller buffer, less data will be read > > up front by `sort` and so increasing the readahead > > done in the background should help (posix_fadvise(...SEQUENTIAL)): > > time TMPDIR=/ram sort -S1M < file > /dev/null > > 22.8s (21.1 user) > > Note posix_fadvise(...WILLNEED) on the whole file runs > > synchronously on my linux system at least. We don't want > > that as we loose some processing time in parallel with > > the kernel readahead, which is confirmed with: > > time TMPDIR=/ram sort -S1M < file > /dev/null > > 25.4s (21.1 user) > > > > So we just knocked 21% off the CPU usage and 28% off the run time. Woot! > > That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL) > > which is being added anyway as it's seen to speedup when reading > > from faster flash devices which benefit from the larger readahead. > > Code changes to manage the buffers internally rather than, > > manually using the ram disk, should increase the gains further. > > > > Summary is a win*4. Faster, less CPU, much less RAM. Also there is > > more granular access to the sys resources so we can process more in parallel > > with readahead of the data and we contend less with other processes. > > > > cheers, > > Pádraig. > > > > p.s. I haven't actually analysed the code or cache characteristics. > > I'm just thinking about it and testing matches the theory at least. > > > > System details: > > > > $ grep -E "(model name|cache)" /proc/cpuinfo > > model name : Intel(R) Pentium(R) M processor 1.70GHz > > cache size : 2048 KB > > $ echo "$(($(grep MemTot /proc/meminfo | > > tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB" > > 2GB > > # hdparm -tT /dev/sda #mechanical disk > > Timing cached reads: 1470 MB in 2.00 seconds = 734.98 MB/sec > > Timing buffered disk reads: 92 MB in 3.04 seconds = 30.27 MB/sec > > > > > > Test setup: > > > > $ shuf -i 1-100000000 -n 10000000 > file #88MB > > $ export LANG=C > > # echo 1 > /proc/sys/vm/drop_caches #before each test