I thought I see if I can speed up PNG loading by vectorizing alpha
premultiplication, and it actually does give a nice speedup:
commit d7d592b0acb25ad8084b1d60459dd40bfd9c3356 (HEAD -> png-faster,
Author: Behdad Esfahbod <beh...@behdad.org>
Date: Tue Aug 8 21:29:25 2017 -0700
Process four pixels at a time in premultiply_data() PNG function
Load/store using memcpy(). Now this is finally faster than the
code. The premultiply_data() overhead is reduced by 60%.
$ ftbench -b a ~/.fonts/NotoColorEmoji.ttf
Without premultiply_data: 155 us/op
With 4-pixel vectorization: 167 us/op <---------
Without vectorization: 182 us/op
The code is rather terse but readable. I can add comments. Needs some
GCC/clang checks, as well as implementing the big-endian case (or disable
it for big-endian). I couldn't find any endianness macros in FreeType.
Freetype-devel mailing list