On 15/03/15 22:18, Pádraig Brady wrote: > On 15/03/15 21:14, Kristoffer Brånemyr wrote: >> >> >> >> >>> Den söndag, 15 mars 2015 20:13 skrev Pádraig Brady <[email protected]>: >>> >>> >>>> On 15/03/15 08:33, Kristoffer Brånemyr wrote: >>>> >>>> Hi, >>>> >>>> I did some tests and found out you can actually beat memchr with a simple >>>> loop. Tests were done on >>a Intel Xeon E3-1231v3 (4*3.4GHz), on a 4GB >>>> file that was already cached in memory. >>Benchmarking >was done simply >>>> with the 'time' command. I don't know how this code would run on >>other >>>> >architectures, but I guess you could put it in an #ifdef? >>>> >>>> Coreutils 2.83 version, compiled with -O3: >>>> 507755520 /home/ztion/words >>>> >>>> real 0m3.126s >>>> user 0m2.699s >>>> sys 0m0.429s >>>> >>>> >>>> Improved version compiled with -O2: >>>> 507755520 /home/ztion/words >>>> >>>> real 0m2.857s >>>> user 0m2.461s >>>> sys 0m0.396s >>>> >>>> Improved version compiled with -O3: >>>> 507755520 /home/ztion/words >>>> >>>> real 0m1.518s >>>> user 0m1.157s >>>> sys 0m0.361s >>>> >>>> I studied the generated assembly and with -O3 gcc generates some fancy SSE >>>> code, getting some nice speedups. memchr is also SSE optimized as far as I >>>> know, so it's interesting that this is so much faster, twice as fast >>>> actually. >>>> >>>> In case you don't like turning -O3 on for some reason (the default in >>>> coreutils is -O2 i think), the best version I could put together for -O2 >>>> was this: >>>> >>>> Improved version 2, compiled with -O2: >>>> 507755520 /home/ztion/words >>>> >>>> real 0m2.206s >>>> user 0m1.827s >>>> sys 0m0.379s >> >> >>> Interesting. Thanks for the results. >>> I use 'gcc -march=native -g -O3' locally, and with that can't see a >>> difference in performance. >>> >>> What version of glibc and gcc are you using? >>> gcc-4.9.2-1.fc21.x86_64 and glibc-2.20-7.fc21.x86_64 here. >>> >>> thanks, >>> Pádraig. >> >> >> Hi, >> >> This is with gcc 4.9.2-7 and glibc 2.19-17 on Debian amd64. The difference >> is still there for me when compiling with your CFLAGS. Have they improved >> memchr in glibc 2.20? I don't think they have that yet in debian >> unfortunately. >> >> What cpu do you have? > > > i3-2310M > > I was doing a very quick test with _short_ lines > Specifically /usr/share/dict/words > > Note GCC should be using builtin_memchr here so not > hitting the function call overhead. > > I'll look in more detail later.
builtin_memchr isn't significant here as it's only used for constant folding I think. https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=124617 As hinted above, the main difference in results is due to the line lengths involved. glibc memchr() is more sophisticated, though there is overhead in calling the external function. I.E. memchr() be faster for longer lines, which I tested with: yes "$(printf %80s '')" | head -n10m > 80x10M.txt With that the existing code performs like: $ time src/wc.memchr -l 80x10M.txt real 0.253s user 0.128s sys 0.126s $ time src/wc.memchr -l 2x100M.txt real 0.842s user 0.810s sys 0.033s while the the new proposed code gave both improvements and regressions: $ time src/wc.internal -l 80x10M.txt real 0.543s user 0.422s sys 0.121s $ time src/wc.internal -l 2x100M.txt real 0.156s user 0.122s sys 0.035s Given then new code is better for shorter lines, it suggests benefits from a hybrid approach, and testing with the attached gives: $ time src/wc.hybrid -l 80x10M.txt real 0.253s user 0.132s sys 0.122s $ time src/wc.hybrid -l 2x100M.txt real 0.142s user 0.111s sys 0.031s The most important technique used is to avoid conditionals within the loop, and this extends to adding a separate loop for the short line case, as gcc 4.9.2 at least isn't sophisticated enough to handle the "check_len" invariant within the loop. cheers, Pádraig.
>From d495c2d0fa009db48e3cbe33406371c4daf7fd44 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Br=C3=A5nemyr?= <[email protected]> Date: Wed, 18 Mar 2015 15:32:19 +0000 Subject: [PATCH] wc: speedup counting of short lines Using a test file generated with: yes | head -n100M > 2x100M.txt before> time wc -l 2x100M.txt real 0.842s user 0.810s sys 0.033s after> time wc -l 2x100M.txt real 0.142s user 0.111s sys 0.031s * src/wc.c (wc): Split the loop that deals with -l into 3. The first is used at the start of the input to determine if the average line length is < 30, and if so the second loop is used to look for '\n' internally to wc. For longer lines, memchr is used as before to take advantage of system specific optimizations which any outweigh function call overhead. Note the first 2 loops could be combined, though in testing, GCC 4.9.2 at least, wasn't sophisticated enough to separate the loops based on the "check_len" invariant. Note also __builtin_memchr() isn't significant here as GCC currently only applies constant folding with that. * NEWS: Mention the improvement. --- NEWS | 2 ++ src/wc.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 31 insertions(+) diff --git a/NEWS b/NEWS index 1a51235..5a52074 100644 --- a/NEWS +++ b/NEWS @@ -94,6 +94,8 @@ GNU coreutils NEWS -*- outline -*- stat and tail now know about IBRIX. stat -f --format=%T now reports the file system type, and tail -f uses polling for files on IBRIX file systems. + wc -l is now up to 6 times faster with short lines. + References from --help and the man pages of utilities have been corrected in various cases, and more direct links to the corresponding online documentation are provided. diff --git a/src/wc.c b/src/wc.c index 8cb5163..8125100 100644 --- a/src/wc.c +++ b/src/wc.c @@ -264,6 +264,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) { /* Use a separate loop when counting only lines or lines and bytes -- but not chars or words. */ + bool long_lines = false; + bool check_len = true; while ((bytes_read = safe_read (fd, buf, BUFFER_SIZE)) > 0) { char *p = buf; @@ -275,12 +277,39 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) break; } + char *end = p + bytes_read; + char *line_start = p; + + /* Avoid function call overhead for shorter lines. */ + if (check_len) + while (p != end) + { + lines += *p++ == '\n'; + /* If there are more than 300 chars in the first 10 lines, + then use memchr, where system specific optimizations + may outweigh any function call overhead. */ + if (lines <= 10) + { + if (p - line_start > 300) + { + long_lines = true; + break; + } + } + } + else if (! long_lines) + while (p != end) + lines += *p++ == '\n'; + + /* memchr is more efficient with longer lines. */ while ((p = memchr (p, '\n', (buf + bytes_read) - p))) { ++p; ++lines; } + bytes += bytes_read; + check_len = false; } } #if MB_LEN_MAX > 1 -- 2.1.0
