Fascinating, Padraig. I wonder, did this email just get to my inbox really,
really late, or what prompted you to dig up a 2 year old thread?

Anyhow, it seems you had to specifically unset MALLOC_PERTURB_,
so is it by default set? I forget if we ever made the change to default to
a small buffer, but either way if desired I can make the change to read
MALLOC_PERTURB_ from env[] and set the buffer size accordingly
based on that.

On Thu, Jun 7, 2012 at 5:14 PM, Pádraig Brady <[email protected]> wrote:

> On 03/03/2010 01:06 AM, Pádraig Brady wrote:
> > On 02/03/10 18:20, Chen Guo wrote:
> >> This is exactly what that guy Shaun Jackman was talking about earlier.
> >> I'm actually really surprised this is faster, if I can dig up his
> e-mail I'll
> >> forward him this, I remember him saying something about experimenting
> >> with exactly this.
> >
> > I missed that thread but yes he pretty much had the
> > same idea as I stumbled on when trying to perturb
> > the posix_fadvise() testing by changing the buffer size.
> > http://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00151.html
> > Spooky :)
> >
> > Shaun, you can use `taskset` to set process affinity BTW.
> >
> >> Can you profile the difference in the number of I/O system calls?
> >
> > $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort
> sort.t/sort.1.test > /dev/null
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  70.70    0.283077       21775        13           read
> >  28.97    0.115983       19331         6           munmap
> >   0.32    0.001268           0     21609           write
> >   0.01    0.000054           8         7           open
> >   0.00    0.000000           0         9           close
> >   0.00    0.000000           0         1           execve
> >   0.00    0.000000           0         1         1 access
> >   0.00    0.000000           0         3           brk
> >   0.00    0.000000           0         1         1 ioctl
> >   0.00    0.000000           0         1           uname
> >   0.00    0.000000           0         5           mprotect
> >   0.00    0.000000           0        25           rt_sigaction
> >   0.00    0.000000           0         1           rt_sigprocmask
> >   0.00    0.000000           0         4           getrlimit
> >   0.00    0.000000           0        16           mmap2
> >   0.00    0.000000           0         9           fstat64
> >   0.00    0.000000           0         2         1 futex
> >   0.00    0.000000           0         1           set_thread_area
> >   0.00    0.000000           0         1           set_tid_address
> >   0.00    0.000000           0         1           fadvise64_64
> >   0.00    0.000000           0         1           set_robust_list
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.400382                 21717         3 total
> >     Command being timed: "strace -c ./src/sort sort.t/sort.1.test"
> >     User time (seconds): 26.91
> >     System time (seconds): 2.01
> >     Percent of CPU this job got: 90%
> >     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.02
> >     Average shared text size (kbytes): 0
> >     Average unshared data size (kbytes): 0
> >     Average stack size (kbytes): 0
> >     Average total size (kbytes): 0
> >     Maximum resident set size (kbytes): 0
> >     Average resident set size (kbytes): 0
> >     Major (requiring I/O) page faults: 3
> >     Minor (reclaiming a frame) page faults: 181060
> >     Voluntary context switches: 87362
> >     Involuntary context switches: 2526
> >     Swaps: 0
> >     File system inputs: 173504
> >     File system outputs: 0
> >     Socket messages sent: 0
> >     Socket messages received: 0
> >     Signals delivered: 0
> >     Page size (bytes): 4096
> >     Exit status: 0
> >
> > $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort -S1M
> sort.t/sort.1.test > /dev/null
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  38.95    0.035011           1     60991           read
> >  33.47    0.030081          90       334           unlink
> >  24.17    0.021721           0     81864           write
> >   2.07    0.001862           2      1006           munmap
> >   0.75    0.000670           1       673           open
> >   0.23    0.000209           0      1016           mmap2
> >   0.19    0.000167           0       675           fstat64
> >   0.09    0.000085           0       675           close
> >   0.07    0.000062           0       334           fcntl64
> >   0.02    0.000018           0      1337           rt_sigprocmask
> >   0.00    0.000000           0         1           execve
> >   0.00    0.000000           0         1         1 access
> >   0.00    0.000000           0         3           brk
> >   0.00    0.000000           0         1         1 ioctl
> >   0.00    0.000000           0         1           gettimeofday
> >   0.00    0.000000           0         1           uname
> >   0.00    0.000000           0         5           mprotect
> >   0.00    0.000000           0       334           _llseek
> >   0.00    0.000000           0        25           rt_sigaction
> >   0.00    0.000000           0         1           getrlimit
> >   0.00    0.000000           0         2         1 futex
> >   0.00    0.000000           0         1           set_thread_area
> >   0.00    0.000000           0         1           set_tid_address
> >   0.00    0.000000           0       335           fadvise64_64
> >   0.00    0.000000           0         1           set_robust_list
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    0.089886                149618         3 total
> >     Command being timed: "strace -c ./src/sort -S1M sort.t/sort.1.test"
> >     User time (seconds): 21.76
> >     System time (seconds): 4.51
> >     Percent of CPU this job got: 98%
> >     Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.79
> >     Average shared text size (kbytes): 0
> >     Average unshared data size (kbytes): 0
> >     Average stack size (kbytes): 0
> >     Average total size (kbytes): 0
> >     Maximum resident set size (kbytes): 0
> >     Average resident set size (kbytes): 0
> >     Major (requiring I/O) page faults: 3
> >     Minor (reclaiming a frame) page faults: 23038
> >     Voluntary context switches: 598317
> >     Involuntary context switches: 2316
> >     Swaps: 0
> >     File system inputs: 173504
> >     File system outputs: 0
> >     Socket messages sent: 0
> >     Socket messages received: 0
> >     Signals delivered: 0
> >     Page size (bytes): 4096
> >     Exit status: 0
>
> Hmm I think the results above are invalid
> due to MALLOC_PERTURB_ biasing against large buffers.
> In the test below the extra external file process dominates
> any gains from CPU cache usage:
>
> $ # Test setup
> $ cache_size() { lscpu | grep cache | tail -n1 | tr -s ' ' | cut -d' '
> -f3; }
> $ export TMPDIR=/dev/shm # RAM
> $ export OMP_NUM_THREADS=1 # limit to a single CPU to simplify
> $ unset MALLOC_PERTURB_ # negatively impacts performance tests
> $ shuf -i 1-10000000 > sort.test # 78MB
>
> $ LANG=C /usr/bin/time -v strace -c sort -S$(cache_size) sort.test >
> /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  66.33    0.034878         185       189           unlink
>  17.89    0.009408           0     45766           read
>  8.73    0.004590           0     57711           write
>  4.53    0.002381           4       571           munmap
>  1.28    0.000671           2       382           open
>  0.46    0.000242           1       384           fstat
>  0.35    0.000183           0       580           mmap
>  0.19    0.000100           0       384           close
>  0.14    0.000074           0       757           rt_sigprocmask
>  0.11    0.000058           0       189           fcntl
>  0.00    0.000000           0       189           lseek
>  0.00    0.000000           0         5           mprotect
>  0.00    0.000000           0         4           brk
>  0.00    0.000000           0        25           rt_sigaction
>  0.00    0.000000           0         1         1 ioctl
>  0.00    0.000000           0         1         1 access
>  0.00    0.000000           0         1           execve
>  0.00    0.000000           0         1           getrlimit
>  0.00    0.000000           0         1           arch_prctl
>  0.00    0.000000           0         1           set_tid_address
>  0.00    0.000000           0       190           fadvise64
>  0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.052585                107333         2 total
>        Command being timed: "strace -c sort --parallel=1 -T/dev/shm
> -S3072K sort.test"
>        User time (seconds): 25.88
>        System time (seconds): 5.81
>        Percent of CPU this job got: 94%
>        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.59
>        Average shared text size (kbytes): 0
>        Average unshared data size (kbytes): 0
>        Average stack size (kbytes): 0
>        Average total size (kbytes): 0
>        Maximum resident set size (kbytes): 4968
>        Average resident set size (kbytes): 0
>        Major (requiring I/O) page faults: 0
>        Minor (reclaiming a frame) page faults: 13900
>        Voluntary context switches: 429178
>        Involuntary context switches: 4882
>        Swaps: 0
>        File system inputs: 0
>        File system outputs: 0
>        Socket messages sent: 0
>        Socket messages received: 0
>        Signals delivered: 0
>        Page size (bytes): 4096
>        Exit status: 0
>
>
> $ LANG=C /usr/bin/time -v strace -c sort -S200M sort.test > /dev/null
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  80.68    0.348335         565       617           read
>  11.58    0.049994        3846        13           munmap
>  5.33    0.022995        7665         3           unlink
>  2.38    0.010281           0     38521           write
>  0.01    0.000052          52         1         1 access
>  0.01    0.000042           8         5           mprotect
>  0.01    0.000032           1        25           rt_sigaction
>  0.00    0.000000           0        10           open
>  0.00    0.000000           0        12           close
>  0.00    0.000000           0        12           fstat
>  0.00    0.000000           0         3           lseek
>  0.00    0.000000           0        22           mmap
>  0.00    0.000000           0         4           brk
>  0.00    0.000000           0        13           rt_sigprocmask
>  0.00    0.000000           0         1         1 ioctl
>  0.00    0.000000           0         1           execve
>  0.00    0.000000           0         3           fcntl
>  0.00    0.000000           0         1           getrlimit
>  0.00    0.000000           0         1           arch_prctl
>  0.00    0.000000           0         1           set_tid_address
>  0.00    0.000000           0         4           fadvise64
>  0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.431731                 39274         2 total
>        Command being timed: "strace -c sort -S200M sort.test"
>        User time (seconds): 19.82
>        System time (seconds): 4.33
>        Percent of CPU this job got: 94%
>        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.57
>        Average shared text size (kbytes): 0
>        Average unshared data size (kbytes): 0
>        Average stack size (kbytes): 0
>        Average total size (kbytes): 0
>        Maximum resident set size (kbytes): 205532
>        Average resident set size (kbytes): 0
>        Major (requiring I/O) page faults: 355
>        Minor (reclaiming a frame) page faults: 98057
>        Voluntary context switches: 159217
>        Involuntary context switches: 4681
>        Swaps: 0
>        File system inputs: 140344
>        File system outputs: 154168
>        Socket messages sent: 0
>        Socket messages received: 0
>        Signals delivered: 0
>        Page size (bytes): 4096
>        Exit status: 0
>
> I notice also that performance gets worse for the above input
> for the small buffer size when I enable multithreading on my 4 core i3.
>        User time (seconds): 35.40
>        System time (seconds): 9.20
>        Percent of CPU this job got: 120%
>        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.98
>
> cheers,
> Pádraig.
>

Reply via email to