UTF-8 vectorization

Mark Figura Thu, 22 Dec 2016 07:43:28 -0800

Hello everybody (long time, first time),

I came across an interesting post a few days ago with a method for counting 
the number of characters in UTF-8: 
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html


Rather than checking byte-by-byte, it loads as many bytes as fit into the 
word-size of the of the system and works on them in parallel.

Does anyone know any good resources that can help explain how one might 
come up with lines of code like the following 2?
u = ((u & (ONEMASK * 0x80)) >> 7) & ((~u) >> 6);
count += (u * ONEMASK) >> ((sizeof(size_t) - 1) * 8);

In particular, I'm not sure how the multiplications fit in. Taking "ONEMASK 
* 0x80" as a constant, the 1st line is pretty straight-forward, but I 
haven't a clue for the 2nd.

Apologies if I've managed to push the entire content of one of my early CS 
classes out of my head. :)

Thanks!
Mark

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

UTF-8 vectorization

Reply via email to