On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu wrote:
On 06/02/2016 04:47 PM, tsbockman wrote:
That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or
extensions to the Unicode Standard very well".

Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei

Actually, according to the document Walter Bright linked level 1 does NOT operate at the code point level:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.)
...
Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are **surrogates** ...

So, level 1 appears to be UTF-16 code units, not code points. To do code points it would have to recognize surrogates, which are specifically mentioned as not supported.

Level 2 skips straight to graphemes, and there is no code point level.

However, this document is very old - from Unicode 3.0 and the year 2000:

While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them...

Perhaps level 1 has since been redefined?

Reply via email to