On June 20, 2011 13:04, Chris Hall wrote:
Karl Williamson wrote (on Sun 26-June-2011 at 01:24 +0100):
....
> I would not stand in somebody's way if they wanted to do this, but
> I'm not willing to do all the work entailed, such as proving that
> nobody has encoded a second byte to 0xFF that is not 0x80. One
> complication is that in UTF-EBCDIC, that would happen around 2**31,
> I think. I actually don't think there are any EBCDIC machines out
> there running modern Perl in native locales, but Perl officially is
> supposed to support them.
While you are tidying up this area (so that it can actually be said to
work), I think it would be a shame to leave the issue of "what is a Perl
Character" ambiguous.
Clearly the cheapest of all approaches is to declare the effective limit
to be machine specific, but with outer limit of 72-bits for all time;
and live with the existing 0xFF encoding for ever.
---------------------------------------
Since I know nothing about how this is implemented, I can offer the
following expert opinion :-)
Assuming that all code that creates and reads the Perl-Extended-UTF-8,
works from/to local machine integer, then the current limit must be
64-bits ? If so, one can assert that the first byte after the 0xFF on
all sequences written/readable by Perl to date *must* be 0x80.
Least work, but extensible, approach would then be to:
a. assert that any value of the byte after 0xFF other
than 0x80 is now *reserved* for future extension.
Could define 64 bit unsigned to be the (current)
outer limit, which implicitly limits the valid
values for this byte (and also the ms 2 value
bits of the next byte).
b. throw a suitable invalid encoding wobbly if any
(now) reserved value is read.
I assume it already throws some wobbly if the
value doesn't fit in a machine integer.
...so this is largely "definition engineering".
At some time in the future, iff it is found necessary to go beyond 64
bits, can then implement new-fangled sequences, with the byte(s) after
the 0xFF as a count.
The new-fangled sequences could provide shorter encodings for 37-60 bit
values. However, that messes up string comparison: old-style 0xFF 0x80
sequences do not sort correctly against new-fangled 0xFF 0x80+N ones.
That is solvable, but requires special case handling -- including the
need to know the index of the first mis-matching bytes -- all of which
is only required for the most remote of fringe cases :-(
IMHO anything beyond 31 bits is "exotic", so I find it difficult to give
(damn > 0) how wasteful the encoding is ! So, sticking with the
current, fixed length, 13 byte sequences for 37..66 bits is the
straightforward solution.
---------------------------------------
Nevertheless, the current 0xFF encoding does seem klunky :-( If there
is any general -- beyond Perl "internal" use -- requirement to extend
UTF-8 beyond 31 or 36 bits, then an encoding limited by an
("interesting") early design decision in Perl is unlikely to find favour
elsewhere.
So... it would be cleaner to legislate current 0xFF out of existence,
and require anyone who (general broken-ness notwithstanding) has 0xFF
sequences in files, to convert those files. Rationale: values > 31 bits
have never worked terribly well (let alone > 36); it is now fixed, for
now and into the future; BUT if you have actually managed to use 0xFF
sequence and store those in files, then here is how to convert same
(sorry). IMHO, the number of people who would be caught by this is
trivial -- but I can think of no way of verifying that.
As an intermediate step, could now set a default limit of 36 bits -- so
that no 0xFF sequences are valid any more -- but provide a switch to
override that and use current 0xFF sequences. Again, given the general
broken-ness, this is not a big change, and is not irreversible. Anyone
needing the override could then shout -- and it would become clear how
many people would be troubled by a later complete withdrawal of current
0xFF sequences, and the introduction of new-fangled 0xFF sequences (for
which 0xFF 0x80 would be the start of a 9 byte sequence for values
37..42 bits).
Chris
I have now investigated this further. I now suspect that the reason
that a variable length scheme was not used initially is because it would
introduce a branch in a very commonly used construct. UTF8SKIP() is a
macro that tells how many bytes the next character occupies. It is
implemented as a simple array lookup of a 256 byte const array. This
would almost always be in the cache when doing UTF-8 processing, as it
is used all over the place.
I don't know what the effects nowadays of having a branch rarely taken
would be on the code. Probably the size gain would not be noticeable,
and branch prediction would say that the branch isn't taken, or it could
be inline. But it's still my guess as to why this was done this way
originally.
As to why its sized to allow a 72-byte code point, I don't know. It
makes some sense getting to 64; to accommodate 64-bit systems would take
12 UTF-8 bytes instead of 13. The payload is doubled, one start plus 12
data, so that may have some bearing, but what I don't know. Doing so,
though, does mean that there are fewer overlongs than otherwise. But if
that is a consideration they cared about or even thought about, I don't
know.
Changing things now introduces backwards compatibility issues. However,
I don't think this should be of real concern. I don't think such high
code points are used very much at all, and there is a default-off
warning raised whenever outputting a code point above Unicode. There
could for a time be a stronger, default-on, warning raised for these
very large code points.
I am not advocating for this change; I could be persuaded it is
worthwhile; I do think the pods should be changed to say that we reserve
the right to change the representation that gets written out for very
large code points, those having 0xFE and 0xFF start bytes. And it might
be a good idea to have a different warning raised when outputting these,
than the run-of-the-mill above-Unicode characters.
Each added byte adds 6 bits of information.
UTF-EBCDIC runs out of code point space at 2**31-1 without using the
trick that UTF-8 uses to get above 2**36. But there are now 64-bit
EBCDIC platforms, so I'm going to change our implementation of
UTF-EBCDIC to use the trick. A total of 14 bytes gets it to that (as
opposed to 13 total for UTF-8 to get to 2**72