Re: Taking advantage of L1 and L2 cache in sort

Shaun Jackman Tue, 02 Mar 2010 12:24:45 -0800

Hi Chen,

I haven't done any testing yet, but I was planning on doing pretty much
the same sort of tests as Pádraig reported. I tested on a
single-processor machine with 1 MB cache, and I see similar results (see
below). The next step would be to perform the merging in parallel on a
multiprocessor machine.


Cheers,
Shaun

$ shuf -i 1-100000000 -n 10000000 >test
$ du -b test
88888143        test    
$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort <test >/dev/null

real    0m27.517s
user    0m25.653s
sys     0m0.459s

$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S100M <test 
>/dev/null

real    0m26.160s
user    0m24.187s
sys     0m0.868s

$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S1M <test >/dev/null

real    0m21.485s
user    0m19.351s
sys     0m1.637s

$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'; time sort -S1M -T/dev/shm 
<test >/dev/null

real    0m20.568s
user    0m19.306s
sys     0m0.914s

$ time sort -S1M -T/dev/shm <test >/dev/null
real    0m20.007s
user    0m19.235s
sys     0m0.766s

$ time sort -S1M -T/dev/shm --compress-program=gzip <test >/dev/null

real    1m25.868s
user    1m43.431s
sys     0m29.969s

$ time sort -S1M -T/dev/shm --compress-program='gzip --fast' <test >/dev/null
sort: couldn't execute gzip --fast: No such file or directory

$ head /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 1
cpu MHz         : 3194.233
cache size      : 1024 KB
physical id     : 0
siblings        : 2

On Tue, 2010-03-02 at 10:40 -0800, Chen Guo wrote:
> Hi Shaun,
>     I think this is exactly what you were talking about. I'm really surprised
> this is faster, but your idea was spot on. How goes your work with this?
> 
> 
> 
> 
> ----- Forwarded Message ----
> > From: Pádraig Brady <p...@draigbrady.com>
> > To: Report bugs to <bug-coreutils@gnu.org>
> > Cc: Joey Degges <jdeg...@gmail.com>
> > Sent: Tue, March 2, 2010 2:20:34 AM
> > Subject: Taking advantage of L1 and L2 cache in sort
> > 
> > Currently when sorting we take advantage of the RAM vs disk
> > speed bump by using a large mem buffer dependent on the size of RAM.
> > However we don't take advantage of the cache layer in the
> > memory hierarchy which has an increasing importance in modern
> > systems given the disparity between CPU and RAM speed increases.
> > 
> > So let's test with a smaller buffer size to see can we
> > take advantage of cache locality in the comparisons.
> > (the system and test setup details are included below):
> > 
> > Using the standard large buffer as a base measurement:
> >   time sort < file > /dev/null
> >     31.6s (26.7 user)
> > Reducing this to 10M:
> >   time sort -S10M < file > /dev/null
> >     27.3s (22.1 user)
> > Now with a smaller buffer there will be many temp files
> > created on disk, so to alleviate this overhead somewhat,
> > let's put them on a ramdisk (mount -t tmpfs /ram /ram):
> >   time TMPDIR=/ram sort -S10M < file > /dev/null
> >     25.0s (22.1 user)
> > Now reducing the buffer to 1M to better match
> > my cache size mentioned below:
> >   time TMPDIR=/ram sort -S1M < file > /dev/null
> >     23.8s (21.1 user)
> > Now with a smaller buffer, less data will be read
> > up front by `sort` and so increasing the readahead
> > done in the background should help (posix_fadvise(...SEQUENTIAL)):
> >   time TMPDIR=/ram sort -S1M < file > /dev/null
> >     22.8s (21.1 user)
> > Note posix_fadvise(...WILLNEED) on the whole file runs
> > synchronously on my linux system at least. We don't want
> > that as we loose some processing time in parallel with
> > the kernel readahead, which is confirmed with:
> >   time TMPDIR=/ram sort -S1M < file > /dev/null
> >     25.4s (21.1 user)
> > 
> > So we just knocked 21% off the CPU usage and 28% off the run time. Woot!
> > That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL)
> > which is being added anyway as it's seen to speedup when reading
> > from faster flash devices which benefit from the larger readahead.
> > Code changes to manage the buffers internally rather than,
> > manually using the ram disk, should increase the gains further.
> > 
> > Summary is a win*4. Faster, less CPU, much less RAM. Also there is
> > more granular access to the sys resources so we can process more in parallel
> > with readahead of the data and we contend less with other processes.
> > 
> > cheers,
> > Pádraig.
> > 
> > p.s. I haven't actually analysed the code or cache characteristics.
> > I'm just thinking about it and testing matches the theory at least.
> > 
> > System details:
> > 
> > $ grep -E "(model name|cache)" /proc/cpuinfo
> > model name  : Intel(R) Pentium(R) M processor 1.70GHz
> > cache size  : 2048 KB
> > $ echo "$(($(grep MemTot /proc/meminfo |
> >              tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB"
> > 2GB
> > # hdparm -tT /dev/sda #mechanical disk
> > Timing cached reads:   1470 MB in  2.00 seconds = 734.98 MB/sec
> > Timing buffered disk reads:   92 MB in  3.04 seconds =  30.27 MB/sec
> > 
> > 
> > Test setup:
> > 
> > $ shuf -i 1-100000000 -n 10000000 > file #88MB
> > $ export LANG=C
> > # echo 1 > /proc/sys/vm/drop_caches #before each test

Re: Taking advantage of L1 and L2 cache in sort

Reply via email to