On Monday, 24 March 2014 at 11:48:00 UTC, Dmitry Olshansky wrote:
RFC 3629 (http://tools.ietf.org/html/rfc3629) restricted UTF-8 to conform to constraints in UTF-16, removing all 5- and 6-byte sequences.

More importantly Unicode standard explicitly fixed the range of code points to that of representable in UTF-16. Starting with the 5th version of the standard if memory serves me right.

I did some hacks using C at work with _pext_u32, it's an absolutely wonderful instruction(pext) with the corresponding pdep.
http://software.intel.com/sites/landingpage/IntrinsicsGuide/

And ridiculously fast according to Agner(Latency 3, Throughput 1):
http://www.agner.org/optimize/instruction_tables.pdf

I think we should add this as an intrinsic to D as well(if it isn't already, but I couldn't find it)... it could do wonders for utf decoding.

I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising.

Reply via email to