On Fri, 5 Jun 2020, Martin Storsjö wrote:

As for the speed of musl, it doesn't seem to bad, at least for strings:
https://www.etalabs.net/compare_libcs.html

Those look decent yeah. My prime concern is for memcpy, where implementations that use SIMD instructions might be even faster - which might matter for multimedia applications.

If I understand correctly, one of the reasons of having these string functions in vcruntime*.dll separately from the fixed UCRT, is that they want to be able to easier ship newer tuned versions of them.

But I could actually try to make a small benchmark for this, to see if there's any significant difference (and if the default one from api-ms-win-crt-private-* that is used right now isn't much faster, it isn't much of an issue).

I also noticed that api-ms-win-crt-string-* actually does contain memcpy_s and memmove_s. So we could just have small wrappers that call these instead, so we'd avoid having to maintain a performance sensitive implementation of that. That leaves us with a few functions less where we need a full implementation.

I did a few measurements with this now, and the TL;DR conclusion is - redirecting to ucrtbase/api-ms-win-crt-string's memcpy_s should be a good option.

I did the measurements with the "checkasm" tool from dav1d, with local modifications here: https://code.videolan.org/mstorsjo/dav1d/-/commits/memcpy-bench

I ran the tests with "checkasm --bench --bench-c --test=memcpy", and looked at the runtimes for the 1 MB aligned case.

On x86_64 linux, the results look like this (cycles, smaller is better):

musl_clang:      290599.7
glibc:           138632.0
musl_gcc:        138707.8
musl_x86_64_asm:  99238.8

On x86_64 windows, the relevant results are like this:

musl_clang_c:            279249.7
msvcrt.dll_memcpy:       202482.0
msvcrt.dll_memcpy_s:     134256.6
musl_gcc_c:              123527.4
vcruntime140.dll_memcpy: 101579.3
ucrtbase.dll_memcpy:      98145.5
ucrtbase.dll_memcpy_s:    97044.2

So the musl C code is pretty good when optimized by GCC, but clang does a bad job with it. The musl x86_64 assesmbly implementation seems quite fast for this testcase at least. I didn't try remaking the musl x86_64 assembly implementation for windows calling convention, but by projecting from the results above, it looks like it'd be in line with the vcruntime/ucrtbase results anyway.

So making a wrapper that just forwards memcpy to api-ms-win-crt-string's memcpy_s should be a performant solution that avoids us having to maintain that performance sensitive code.

// Martin

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to