On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 04:47 PM, tsbockman wrote:
That doesn't sound like much of an endorsement for defaulting
to only
level 1 support to me - "it does not handle more complex
languages or
extensions to the Unicode Standard very well".
Code point/Level 1 support sounds like a sweet spot between
efficiency/complexity and conviviality. Level 2 is opt-in with
byGrapheme. -- Andrei
Actually, according to the document Walter Bright linked level 1
does NOT operate at the code point level:
Level 1: Basic Unicode Support. At this level, the regular
expression engine provides support for Unicode characters as
basic 16-bit logical units. (This is independent of the actual
serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or
UTF-32.)
...
Level 1 support works well in many circumstances. However, it
does not handle more complex languages or extensions to the
Unicode Standard very well. Particularly important cases are
**surrogates** ...
So, level 1 appears to be UTF-16 code units, not code points. To
do code points it would have to recognize surrogates, which are
specifically mentioned as not supported.
Level 2 skips straight to graphemes, and there is no code point
level.
However, this document is very old - from Unicode 3.0 and the
year 2000:
While there are no surrogate characters in Unicode 3.0 (outside
of private use characters), future versions of Unicode will
contain them...
Perhaps level 1 has since been redefined?