Am 19.03.2017 um 23:00 schrieb Christian Bartolomaeus via RT:
Looking at https://en.wikipedia.org/wiki/Code_point and
http://www.unicode.org/glossary/#code_point I understand that
U+10FFFF is indeed the maximum Unicode code point.

Yes, that's the maximum value you can encode in four bytes with UTF-8, see https://en.wikipedia.org/wiki/UTF-8#Description.

I was wondering how the Unicode consortium might extend this limit, so I investigated a bit.


TL;DR

I can confirm that 10ffff is going to remain the maximum for the foreseeable future.


DETAILS


Technical limits:

  UTF-8 could be extended up to 0x108430ffff [1]
  UTF-16 ("surrogate pairs") cannot be extended beyond 0x10ffff
  UTF-32 can be extended up to 0xffffffff (32 bits available)


Political limits:

Since Java chose to use surrogate pairs, and UTF-16 is not extensible,
any motion to extend the Unicode code range would be met with opposition from Oracle, and from any language community that has a JVM implementation and wants to be interoperable with Java libraries.


Code space exhaustion:

Unicode assigns code points like this:
  characters:  128,237 code points
  private use: U+e000—U+f8ff (6,400 code points) [2]
               U+f0000—U+ffffd (65,534) [2]
               U+100000—U+10fffd (65,534) [2]
  surrogates:  U+d800—U+dfff (2,048 code points) [2]
So out of the
  0x10ffff=1,114,111 available code points,
  128,237+6,400+65,534+65,534=265705 are in use, leaving
  848406 free for future character set extension.

Assuming 10,000 new characters per year (which is conservative given the numbers in [3]), the current encoding space will be exhaused in ca. 85 years.


Regards,
Jo

[1]
The Unicode consortium could extend the maximum value of UTF-8 by using more prefixes:
  111110xx for 5-byte encoding
  1111110x for 6-byte encoding
  11111110 for 7-byte encoding
  11111111 for 8-byte encoding
No prefixes are possible for a longer encoding.

Bit counts for each prefix are:
  5-byte: nr of 4-byte encoding bits (21) + 5 = 26
  6-byte: 26 + 5 = 31
  7-byte: 26 + 5 = 36
  8-byte: 36 + 6 (prefix does not lose a bit) = 42
The maximum 8-byte-encoded value is
  0x10ffff+2^26+2^31+2^36+2^42 = 0x108430FFFF

[2]
Numbers taken from
http://unicodebook.readthedocs.io/unicode.html#statistics

[3]
See https://en.wikipedia.org/wiki/Unicode#Versions

Reply via email to