Re: howto count lines - fast

Patrick Schluter via Digitalmars-d-learn Wed, 31 May 2017 21:41:25 -0700

On Wednesday, 31 May 2017 at 23:03:54 UTC, H. S. Teoh wrote:

On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis viaDigitalmars-d-learn wrote:
On Wednesday, May 31, 2017 12:13:04 H. S. Teoh viaDigitalmars-d-learn wrote:> I did some digging around, and it seems that wc is using> glibc's memchr, which is highly-optimized, whereas> std.algorithm.count just uses a simplistic loop. Which is> strange, because I'm pretty sure somebody optimized> std.algorithm some time ago to use memchr() instead of a> loop when searching for a byte value in an array. Whatever> happened to that??
I don't know, but memchr wouldn't work with CTFE, so someonemight have removed it to make it work in CTFE (though thatcould be done with a different branch for CTFE). Or maybe itnever made it into std.algorithm for one reason or another.
[...]
I checked the Phobos code again, and it appears that my memorydeceivedme. Somebody *did* add memchr optimization to find() and itsfriends,
but not to count().
CTFE compatibility is not a problem, since we can justif(__ctfe) the
optimized block away.
I'm currently experimenting with a memchr-optimized version ofcount(), but I'm getting mixed results: on small arrays orlarge arrays densely packed with matching elements, the memchrversion runs rather slowly, because it involves a function callinto the C library per matching element. On large arrays onlysparsely populated with matching elements, though, thememchr-optimized version beats the current code by about anorder of magnitude.
Since it wouldn't be a wise idea to assume sparsity of matchesin Phobos, I decided to do a little more digging, and looked upthe glibc implementation of memchr. The main optimization isthat it iterates over the array not by byte, as a naïve loopwould do, but by ulong's.

That's what I suggested above. It's the first optimisation to dowhen looping over a buffer (memcpy, memset, memchr etc.).



 (Of course, the first n bytes and

last n bytes that are not ulong-aligned are checked with aper-byte loop; so for very short arrays it doesn't lose out tothe naïve loop.) In each iteration over ulong, it performs thebit-twiddling hack alluded to by Nitram to detect the presenceof matching bytes, upon which it breaks out to the closingper-byte loop to find the first match. For short arrays, orarrays where a match is quickly found, it's comparable inperformance to the naïve loop; for large arrays where the matchis not found until later, it easily outperforms the naïve loop.

It is also important to not overdo the optimisations as it canhappen that the overhead generated manifests in pessimations notvisible in a specific benchmark. The code size explosion mayinduce I-cache misses, it can also cost I-TLB misses. Worse,using SSE or AVX can kill thread switch time or worse even reducethe turboing of the CPU.It's currently a hot topic on realworldtech[1]. Linus Torvaldsrants about this issue wit memcpy() which is over-engineered anddoes more harm than good in practice but has nice benchmarkresult.

My current thought is to adopt the same approach: iterate oversize_t or some such larger unit, and adapt the bit-twiddlinghack to be able to count the number of matches in each size_t.This is turning out to be trickier than I'd like, though,because there is a case where carry propagation makes itunclear how to derive the number of matches without iteratingover the bytes a second time.
But this may not be a big problem, since size_t.sizeof isrelatively small, so I can probably loop over individual byteswhen one or more matches is detected, and asufficiently-capable optimizer like ldc or gdc would be able tounroll this into a series of sete + add instructions, nobranches that might stall the CPU pipeline. Fordensely-matching arrays, this should still have comparableperformance to the naïve loops; for sparsely-matching arraysthis should show significant speedups.

That's what I think too, that a small and simple loop to countthe matching bytes in the ulong would be a somehow faster thanthe bit twiddling trick which requires a population count of bits.

[1]:http://www.realworldtech.com/forum/?threadid=168200&curpostid=168700

Re: howto count lines - fast

Reply via email to