Re: Taking advantage of L1 and L2 cache in sort

Chen Guo Tue, 02 Mar 2010 10:21:02 -0800

This is exactly what that guy Shaun Jackman was talking about earlier.
I'm actually really surprised this is faster, if I can dig up his e-mail I'll
forward him this, I remember him saying something about experimenting
with exactly this.


Can you profile the difference in the number of I/O system calls?




----- Original Message ----
> From: Pádraig Brady <[email protected]>
> To: Report bugs to <[email protected]>
> Cc: Joey Degges <[email protected]>
> Sent: Tue, March 2, 2010 2:20:34 AM
> Subject: Taking advantage of L1 and L2 cache in sort
> 
> Currently when sorting we take advantage of the RAM vs disk
> speed bump by using a large mem buffer dependent on the size of RAM.
> However we don't take advantage of the cache layer in the
> memory hierarchy which has an increasing importance in modern
> systems given the disparity between CPU and RAM speed increases.
> 
> So let's test with a smaller buffer size to see can we
> take advantage of cache locality in the comparisons.
> (the system and test setup details are included below):
> 
> Using the standard large buffer as a base measurement:
>   time sort < file > /dev/null
>     31.6s (26.7 user)
> Reducing this to 10M:
>   time sort -S10M < file > /dev/null
>     27.3s (22.1 user)
> Now with a smaller buffer there will be many temp files
> created on disk, so to alleviate this overhead somewhat,
> let's put them on a ramdisk (mount -t tmpfs /ram /ram):
>   time TMPDIR=/ram sort -S10M < file > /dev/null
>     25.0s (22.1 user)
> Now reducing the buffer to 1M to better match
> my cache size mentioned below:
>   time TMPDIR=/ram sort -S1M < file > /dev/null
>     23.8s (21.1 user)
> Now with a smaller buffer, less data will be read
> up front by `sort` and so increasing the readahead
> done in the background should help (posix_fadvise(...SEQUENTIAL)):
>   time TMPDIR=/ram sort -S1M < file > /dev/null
>     22.8s (21.1 user)
> Note posix_fadvise(...WILLNEED) on the whole file runs
> synchronously on my linux system at least. We don't want
> that as we loose some processing time in parallel with
> the kernel readahead, which is confirmed with:
>   time TMPDIR=/ram sort -S1M < file > /dev/null
>     25.4s (21.1 user)
> 
> So we just knocked 21% off the CPU usage and 28% off the run time. Woot!
> That's with no code changes (well apart from posix_fadvise(...SEQUENTIAL)
> which is being added anyway as it's seen to speedup when reading
> from faster flash devices which benefit from the larger readahead.
> Code changes to manage the buffers internally rather than,
> manually using the ram disk, should increase the gains further.
> 
> Summary is a win*4. Faster, less CPU, much less RAM. Also there is
> more granular access to the sys resources so we can process more in parallel
> with readahead of the data and we contend less with other processes.
> 
> cheers,
> Pádraig.
> 
> p.s. I haven't actually analysed the code or cache characteristics.
> I'm just thinking about it and testing matches the theory at least.
> 
> System details:
> 
> $ grep -E "(model name|cache)" /proc/cpuinfo
> model name  : Intel(R) Pentium(R) M processor 1.70GHz
> cache size  : 2048 KB
> $ echo "$(($(grep MemTot /proc/meminfo |
>              tr -s ' ' | cut -d' ' -f2)/(1000**2)))GB"
> 2GB
> # hdparm -tT /dev/sda #mechanical disk
> Timing cached reads:   1470 MB in  2.00 seconds = 734.98 MB/sec
> Timing buffered disk reads:   92 MB in  3.04 seconds =  30.27 MB/sec
> 
> 
> Test setup:
> 
> $ shuf -i 1-100000000 -n 10000000 > file #88MB
> $ export LANG=C
> # echo 1 > /proc/sys/vm/drop_caches #before each test

Re: Taking advantage of L1 and L2 cache in sort

Reply via email to