Re: Taking advantage of L1 and L2 cache in sort

Chen Guo Tue, 02 Mar 2010 11:38:08 -0800

Forgot to CC the list:



> I did a quick time -v, and found that sorting a 96M file, with -S500M
> there were 36358 page faults, and only 5380 page faults with -S10M.
> 
> Wow.
> 
> So system time goes up, but user time goes down. It seems odd
> that user time would go down, but I believe it's in the output of
> the merging.
> 
> In internal sort, the output occurs after all the merging's finished,
> while in external merge the output occurs as each line is being
> output. With my group working on parallel sort, we noticed a ~14%
> speedup when we output to the top level of merging, as opposed
> to all at once after the sort is completed.
> 
> bash-3.2$ /usr/bin/time -v sort -S10M randL  > /dev/null
>     Command being timed: "sort -S10M randL"
>     User time (seconds): 4.74
>     System time (seconds): 0.57
>     Percent of CPU this job got: 99%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.32
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 0
>     Minor (reclaiming a frame) page faults: 5380
>     Voluntary context switches: 14
>     Involuntary context switches: 11
>     Swaps: 0
>     File system inputs: 0
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0
> bash-3.2$ /usr/bin/time -v sort -S500M randL  > /dev/null
>     Command being timed: "sort -S500M randL"
>     User time (seconds): 5.27
>     System time (seconds): 0.28
>     Percent of CPU this job got: 99%
>     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.56
>     Average shared text size (kbytes): 0
>     Average unshared data size (kbytes): 0
>     Average stack size (kbytes): 0
>     Average total size (kbytes): 0
>     Maximum resident set size (kbytes): 0
>     Average resident set size (kbytes): 0
>     Major (requiring I/O) page faults: 0
>     Minor (reclaiming a frame) page faults: 36358
>     Voluntary context switches: 3
>     Involuntary context switches: 11
>     Swaps: 0
>     File system inputs: 0
>     File system outputs: 0
>     Socket messages sent: 0
>     Socket messages received: 0
>     Signals delivered: 0
>     Page size (bytes): 4096
>     Exit status: 0
> 
> 
> 
> 
> 
> ----- Original Message ----
> > From: Philip Rowlands 
> > To: Pádraig Brady 
> > Cc: Report bugs to ; Joey Degges 
> > Sent: Tue, March 2, 2010 5:21:15 AM
> > Subject: Re: Taking advantage of L1 and L2 cache in sort
> > 
> > On Tue, 2 Mar 2010, Pádraig Brady wrote:
> > 
> > > Currently when sorting we take advantage of the RAM vs disk
> > > speed bump by using a large mem buffer dependent on the size of RAM.
> > > However we don't take advantage of the cache layer in the
> > > memory hierarchy which has an increasing importance in modern
> > > systems given the disparity between CPU and RAM speed increases.
> > [snip data]
> > 
> > Interesting results; this type of analysis might also benefit from running 
> > the 
> 
> > various tests under cachegrind, which would give detailed results about 
> > L1/L2 
> > cache miss rates.
> > 
> > 
> > Cheers,
> > Phil

Re: Taking advantage of L1 and L2 cache in sort

Reply via email to