On Wednesday, 31 May 2017 at 23:03:54 UTC, H. S. Teoh wrote:
On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via
Digitalmars-d-learn wrote:
On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via
Digitalmars-d-learn wrote:
> I did some digging around, and it seems that wc is using
> glibc's memchr, which is highly-optimized, whereas
> std.algorithm.count just uses a simplistic loop. Which is
> strange, because I'm pretty sure somebody optimized
> std.algorithm some time ago to use memchr() instead of a
> loop when searching for a byte value in an array. Whatever
> happened to that??
I don't know, but memchr wouldn't work with CTFE, so someone
might have removed it to make it work in CTFE (though that
could be done with a different branch for CTFE). Or maybe it
never made it into std.algorithm for one reason or another.
[...]
I checked the Phobos code again, and it appears that my memory
deceived
me. Somebody *did* add memchr optimization to find() and its
friends,
but not to count().
CTFE compatibility is not a problem, since we can just
if(__ctfe) the
optimized block away.
I'm currently experimenting with a memchr-optimized version of
count(), but I'm getting mixed results: on small arrays or
large arrays densely packed with matching elements, the memchr
version runs rather slowly, because it involves a function call
into the C library per matching element. On large arrays only
sparsely populated with matching elements, though, the
memchr-optimized version beats the current code by about an
order of magnitude.
Since it wouldn't be a wise idea to assume sparsity of matches
in Phobos, I decided to do a little more digging, and looked up
the glibc implementation of memchr. The main optimization is
that it iterates over the array not by byte, as a naïve loop
would do, but by ulong's.
That's what I suggested above. It's the first optimisation to do
when looping over a buffer (memcpy, memset, memchr etc.).
(Of course, the first n bytes and
last n bytes that are not ulong-aligned are checked with a
per-byte loop; so for very short arrays it doesn't lose out to
the naïve loop.) In each iteration over ulong, it performs the
bit-twiddling hack alluded to by Nitram to detect the presence
of matching bytes, upon which it breaks out to the closing
per-byte loop to find the first match. For short arrays, or
arrays where a match is quickly found, it's comparable in
performance to the naïve loop; for large arrays where the match
is not found until later, it easily outperforms the naïve loop.
It is also important to not overdo the optimisations as it can
happen that the overhead generated manifests in pessimations not
visible in a specific benchmark. The code size explosion may
induce I-cache misses, it can also cost I-TLB misses. Worse,
using SSE or AVX can kill thread switch time or worse even reduce
the turboing of the CPU.
It's currently a hot topic on realworldtech[1]. Linus Torvalds
rants about this issue wit memcpy() which is over-engineered and
does more harm than good in practice but has nice benchmark
result.
My current thought is to adopt the same approach: iterate over
size_t or some such larger unit, and adapt the bit-twiddling
hack to be able to count the number of matches in each size_t.
This is turning out to be trickier than I'd like, though,
because there is a case where carry propagation makes it
unclear how to derive the number of matches without iterating
over the bytes a second time.
But this may not be a big problem, since size_t.sizeof is
relatively small, so I can probably loop over individual bytes
when one or more matches is detected, and a
sufficiently-capable optimizer like ldc or gdc would be able to
unroll this into a series of sete + add instructions, no
branches that might stall the CPU pipeline. For
densely-matching arrays, this should still have comparable
performance to the naïve loops; for sparsely-matching arrays
this should show significant speedups.
That's what I think too, that a small and simple loop to count
the matching bytes in the ulong would be a somehow faster than
the bit twiddling trick which requires a population count of bits.
[1]:
http://www.realworldtech.com/forum/?threadid=168200&curpostid=168700