Fascinating, Padraig. I wonder, did this email just get to my inbox really, really late, or what prompted you to dig up a 2 year old thread?
Anyhow, it seems you had to specifically unset MALLOC_PERTURB_, so is it by default set? I forget if we ever made the change to default to a small buffer, but either way if desired I can make the change to read MALLOC_PERTURB_ from env[] and set the buffer size accordingly based on that. On Thu, Jun 7, 2012 at 5:14 PM, Pádraig Brady <[email protected]> wrote: > On 03/03/2010 01:06 AM, Pádraig Brady wrote: > > On 02/03/10 18:20, Chen Guo wrote: > >> This is exactly what that guy Shaun Jackman was talking about earlier. > >> I'm actually really surprised this is faster, if I can dig up his > e-mail I'll > >> forward him this, I remember him saying something about experimenting > >> with exactly this. > > > > I missed that thread but yes he pretty much had the > > same idea as I stumbled on when trying to perturb > > the posix_fadvise() testing by changing the buffer size. > > http://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00151.html > > Spooky :) > > > > Shaun, you can use `taskset` to set process affinity BTW. > > > >> Can you profile the difference in the number of I/O system calls? > > > > $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort > sort.t/sort.1.test > /dev/null > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 70.70 0.283077 21775 13 read > > 28.97 0.115983 19331 6 munmap > > 0.32 0.001268 0 21609 write > > 0.01 0.000054 8 7 open > > 0.00 0.000000 0 9 close > > 0.00 0.000000 0 1 execve > > 0.00 0.000000 0 1 1 access > > 0.00 0.000000 0 3 brk > > 0.00 0.000000 0 1 1 ioctl > > 0.00 0.000000 0 1 uname > > 0.00 0.000000 0 5 mprotect > > 0.00 0.000000 0 25 rt_sigaction > > 0.00 0.000000 0 1 rt_sigprocmask > > 0.00 0.000000 0 4 getrlimit > > 0.00 0.000000 0 16 mmap2 > > 0.00 0.000000 0 9 fstat64 > > 0.00 0.000000 0 2 1 futex > > 0.00 0.000000 0 1 set_thread_area > > 0.00 0.000000 0 1 set_tid_address > > 0.00 0.000000 0 1 fadvise64_64 > > 0.00 0.000000 0 1 set_robust_list > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 0.400382 21717 3 total > > Command being timed: "strace -c ./src/sort sort.t/sort.1.test" > > User time (seconds): 26.91 > > System time (seconds): 2.01 > > Percent of CPU this job got: 90% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.02 > > Average shared text size (kbytes): 0 > > Average unshared data size (kbytes): 0 > > Average stack size (kbytes): 0 > > Average total size (kbytes): 0 > > Maximum resident set size (kbytes): 0 > > Average resident set size (kbytes): 0 > > Major (requiring I/O) page faults: 3 > > Minor (reclaiming a frame) page faults: 181060 > > Voluntary context switches: 87362 > > Involuntary context switches: 2526 > > Swaps: 0 > > File system inputs: 173504 > > File system outputs: 0 > > Socket messages sent: 0 > > Socket messages received: 0 > > Signals delivered: 0 > > Page size (bytes): 4096 > > Exit status: 0 > > > > $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort -S1M > sort.t/sort.1.test > /dev/null > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 38.95 0.035011 1 60991 read > > 33.47 0.030081 90 334 unlink > > 24.17 0.021721 0 81864 write > > 2.07 0.001862 2 1006 munmap > > 0.75 0.000670 1 673 open > > 0.23 0.000209 0 1016 mmap2 > > 0.19 0.000167 0 675 fstat64 > > 0.09 0.000085 0 675 close > > 0.07 0.000062 0 334 fcntl64 > > 0.02 0.000018 0 1337 rt_sigprocmask > > 0.00 0.000000 0 1 execve > > 0.00 0.000000 0 1 1 access > > 0.00 0.000000 0 3 brk > > 0.00 0.000000 0 1 1 ioctl > > 0.00 0.000000 0 1 gettimeofday > > 0.00 0.000000 0 1 uname > > 0.00 0.000000 0 5 mprotect > > 0.00 0.000000 0 334 _llseek > > 0.00 0.000000 0 25 rt_sigaction > > 0.00 0.000000 0 1 getrlimit > > 0.00 0.000000 0 2 1 futex > > 0.00 0.000000 0 1 set_thread_area > > 0.00 0.000000 0 1 set_tid_address > > 0.00 0.000000 0 335 fadvise64_64 > > 0.00 0.000000 0 1 set_robust_list > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 0.089886 149618 3 total > > Command being timed: "strace -c ./src/sort -S1M sort.t/sort.1.test" > > User time (seconds): 21.76 > > System time (seconds): 4.51 > > Percent of CPU this job got: 98% > > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.79 > > Average shared text size (kbytes): 0 > > Average unshared data size (kbytes): 0 > > Average stack size (kbytes): 0 > > Average total size (kbytes): 0 > > Maximum resident set size (kbytes): 0 > > Average resident set size (kbytes): 0 > > Major (requiring I/O) page faults: 3 > > Minor (reclaiming a frame) page faults: 23038 > > Voluntary context switches: 598317 > > Involuntary context switches: 2316 > > Swaps: 0 > > File system inputs: 173504 > > File system outputs: 0 > > Socket messages sent: 0 > > Socket messages received: 0 > > Signals delivered: 0 > > Page size (bytes): 4096 > > Exit status: 0 > > Hmm I think the results above are invalid > due to MALLOC_PERTURB_ biasing against large buffers. > In the test below the extra external file process dominates > any gains from CPU cache usage: > > $ # Test setup > $ cache_size() { lscpu | grep cache | tail -n1 | tr -s ' ' | cut -d' ' > -f3; } > $ export TMPDIR=/dev/shm # RAM > $ export OMP_NUM_THREADS=1 # limit to a single CPU to simplify > $ unset MALLOC_PERTURB_ # negatively impacts performance tests > $ shuf -i 1-10000000 > sort.test # 78MB > > $ LANG=C /usr/bin/time -v strace -c sort -S$(cache_size) sort.test > > /dev/null > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 66.33 0.034878 185 189 unlink > 17.89 0.009408 0 45766 read > 8.73 0.004590 0 57711 write > 4.53 0.002381 4 571 munmap > 1.28 0.000671 2 382 open > 0.46 0.000242 1 384 fstat > 0.35 0.000183 0 580 mmap > 0.19 0.000100 0 384 close > 0.14 0.000074 0 757 rt_sigprocmask > 0.11 0.000058 0 189 fcntl > 0.00 0.000000 0 189 lseek > 0.00 0.000000 0 5 mprotect > 0.00 0.000000 0 4 brk > 0.00 0.000000 0 25 rt_sigaction > 0.00 0.000000 0 1 1 ioctl > 0.00 0.000000 0 1 1 access > 0.00 0.000000 0 1 execve > 0.00 0.000000 0 1 getrlimit > 0.00 0.000000 0 1 arch_prctl > 0.00 0.000000 0 1 set_tid_address > 0.00 0.000000 0 190 fadvise64 > 0.00 0.000000 0 1 set_robust_list > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.052585 107333 2 total > Command being timed: "strace -c sort --parallel=1 -T/dev/shm > -S3072K sort.test" > User time (seconds): 25.88 > System time (seconds): 5.81 > Percent of CPU this job got: 94% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.59 > Average shared text size (kbytes): 0 > Average unshared data size (kbytes): 0 > Average stack size (kbytes): 0 > Average total size (kbytes): 0 > Maximum resident set size (kbytes): 4968 > Average resident set size (kbytes): 0 > Major (requiring I/O) page faults: 0 > Minor (reclaiming a frame) page faults: 13900 > Voluntary context switches: 429178 > Involuntary context switches: 4882 > Swaps: 0 > File system inputs: 0 > File system outputs: 0 > Socket messages sent: 0 > Socket messages received: 0 > Signals delivered: 0 > Page size (bytes): 4096 > Exit status: 0 > > > $ LANG=C /usr/bin/time -v strace -c sort -S200M sort.test > /dev/null > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 80.68 0.348335 565 617 read > 11.58 0.049994 3846 13 munmap > 5.33 0.022995 7665 3 unlink > 2.38 0.010281 0 38521 write > 0.01 0.000052 52 1 1 access > 0.01 0.000042 8 5 mprotect > 0.01 0.000032 1 25 rt_sigaction > 0.00 0.000000 0 10 open > 0.00 0.000000 0 12 close > 0.00 0.000000 0 12 fstat > 0.00 0.000000 0 3 lseek > 0.00 0.000000 0 22 mmap > 0.00 0.000000 0 4 brk > 0.00 0.000000 0 13 rt_sigprocmask > 0.00 0.000000 0 1 1 ioctl > 0.00 0.000000 0 1 execve > 0.00 0.000000 0 3 fcntl > 0.00 0.000000 0 1 getrlimit > 0.00 0.000000 0 1 arch_prctl > 0.00 0.000000 0 1 set_tid_address > 0.00 0.000000 0 4 fadvise64 > 0.00 0.000000 0 1 set_robust_list > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.431731 39274 2 total > Command being timed: "strace -c sort -S200M sort.test" > User time (seconds): 19.82 > System time (seconds): 4.33 > Percent of CPU this job got: 94% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.57 > Average shared text size (kbytes): 0 > Average unshared data size (kbytes): 0 > Average stack size (kbytes): 0 > Average total size (kbytes): 0 > Maximum resident set size (kbytes): 205532 > Average resident set size (kbytes): 0 > Major (requiring I/O) page faults: 355 > Minor (reclaiming a frame) page faults: 98057 > Voluntary context switches: 159217 > Involuntary context switches: 4681 > Swaps: 0 > File system inputs: 140344 > File system outputs: 154168 > Socket messages sent: 0 > Socket messages received: 0 > Signals delivered: 0 > Page size (bytes): 4096 > Exit status: 0 > > I notice also that performance gets worse for the above input > for the small buffer size when I enable multithreading on my 4 core i3. > User time (seconds): 35.40 > System time (seconds): 9.20 > Percent of CPU this job got: 120% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.98 > > cheers, > Pádraig. >
