Hi, I wanted to practice some more using vector intrinsics, so I made a small AVX2 optimization for wc -l. Depending on line length it is about 2-5x faster than previous version. (Well, only looking at user time it is much faster than that even.)
I put the patch at https://github.com/coreutils/coreutils/pull/50 . Maybe this is a pointless optimization, I guess not many people run wc -l on gigabytes of data, but maybe it could be useful for someone... As an aside, I think that .gitignore should be updated to include src/libcksum_pclmul.a , and if patch is accepted also the libwc_avx2.a I added in this patch. Some informal benchmark results with big files already in filecache: ----- An HTML formatted e-book, concatenated many times. Most lines around 80 chars, with some shorter. avx2 (1.98x faster) 38256750 /disk2/download/storfil4 real 0m0,292s user 0m0,040s sys 0m0,252s normal wc 38256750 /disk2/download/storfil4 real 0m0,580s user 0m0,346s sys 0m0,234s ------- A big file with only \n character in it. avx2 (4.9x faster) 1328545792 /disk2/download/storfil6_bara_radbryt real 0m0,160s user 0m0,012s sys 0m0,148s normal wc 1328545792 /disk2/download/storfil6_bara_radbryt real 0m0,768s user 0m0,626s sys 0m0,142s ---- A big file with no \n at all. avx2 (I think they are basically equally fast, since running several times it varied who was faster) 0 /disk2/download/storfil7_inga_radbryt real 0m0,277s user 0m0,035s sys 0m0,242s normal wc 0 /disk2/download/storfil7_inga_radbryt real 0m0,269s user 0m0,039s sys 0m0,230s -- /Kristoffer Brånemyr