Re: perlunitut - feedback appreciated

Jarkko Hietaniemi Mon, 12 Nov 2001 01:15:10 -0800

On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote:
> Thanks. The Perl implementors and you have done a very good job. I have a
> few suggestions and one complaint.
> 
> The most important issue is chr().
> 
> >Note that C<chr(...)> for arguments less than 0x100 (decimal 256) will
> >return an eight-bit character for backward compatibility with older
> >Perls (in ISO 8859-1 platforms it can be argued to be producing
> >Unicode even then, just not Unicode encoded in UTF-8 -- the ISO 8859-1
> >is equivalent to the first 256 characters of Unicode).  For C<chr()>
> >arguments of 0x100 or more, Unicode will always be produced.
> 
> My complaint: There should be a pure Unicode alternative to this kludge.


You mean chr() producing UTF-8?  There has been talk about uchr() or
the like.  Maybe I'll just implement it in some module.

> Obviously, it is not hard to write one in Perl, but it should be part of the
> implementation.

> ISO Latin-1 characters encoded as 10-FF in single bytes are not Unicode.
> There is no Unicode transformation format or other encoding that permits
> this. The code point range is actually x000010-x0000FF, and the encodings
> are
> 
> 0000000010000000  0000000011111111 UTF-16 Big Endian
> 1000000000000000  1111111100000000 UTF-16 Little Endian
> 00000000000000000000000010000000  00000000000000000000000011111111 UCS-4 BE
> 00000000000000001000000000000000  00000000000000001111111100000000 UCS-4 LE
> 1100001010000000  1100001110111111 UTF-8

Okay.

> >Character ranges in regular expression character classes [a-z]
> >and in the tr///, aka y///, operator are not affected by Unicode.
> 
> This could mean that they extend gracefully to Unicode, for example
> something like [\{x0300}-\{x03FF}], or that they cannot be used outside the
> 00-FF range (or would it be 00-7F?). Clarification is needed.

Hmmm.  They extend but they may not do what people are expecting them
to do: [a-z] will most certainly not mean "alphabetic characters".

> >Unicode is a standard that defines a unique number for every character.
>
> Unique: Some characters are encoded in Unicode twice. Examples include
> A-ring, also encoded as the Angstrom symbol, and a number of
> full-width/half-width variants from Japanese standards.

Argh.  This has been the most contested point of the document :-)
My take is that too many buts, ifs, and furthermores muddle the
message.

> Number: Please say "code point" rather than number.

http://www.unicode.org/unicode/standard/WhatIsUnicode.html
 
> Every character: Unicode and ISO/IEC 10646 are coordinated standards that
> provide code points for the characters in almost all modern character set
> standards, covering more than 30 writing systems and hundreds of languages,
> including all commercially important modern languages. All characters in the
> largest Chinese, Japanese, and Korean dictionaries are also encoded. The
> standards will eventually cover almost all characters in more than 250
> writing systems and thousands of languages, but will not include proprietary
> characters, personal-use characters, and some others.

Nice chunk of text.  Can I borrow?  Though the 'proprietary characters'
part is a bit debatable.  What is a proprietary character?  Is, say,
HP's roman-8 proprietary?  All its characters are in the Unicode (AFAIK).

> Note that no platform today (Java, Unix, Mac, Windoze) includes rendering
> capability for all of the writing systems defined in Unicode, even where
> appropriate fonts are available. The greatest deficits are in Armenian,
> Georgian, Ethiopic, and writing systems of Asia, including India, Tibet,
> Mongolia, Sri Lanka, Burma, and Cambodia.

Hmmm.  I probably have to mention something about the display of
Unicode but I'd rather keep it short and just refer to nice URLs.

> >Since Unicode 3.1 Unicode characters have been defined all the way
> >up to 21 bits...
> 
> Unicode 1.0 began as a 16-bit character set, defining code points in the
> range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
> 00000000-0000FFFF as the Basic Multilingual Plane (Plane 0). Since Unicode
> 2.0, the Unicode code space has been defined to be 000000-10FFFF, adding 16
> more planes. This is often described as a 20.5 bit encoding. A set of
> language tag characters was defined in Plane 14. Their use is highly
> deprecated.
>
> In Unicode 3.1 characters were defined in Planes 1 and 2, and there are
> plans for Plane 3, at least, to be populated in Unicode 4.0. ISO plans to
> vote soon to restrict 10646 to the corresponding range, 00000000-0010FFFF.

Uhhh, that's quite an information overload for an introductory
document.  Remember, this is not intended as comprehensive retelling
of the Unicode FAQ, just the bare essential to start learning more.
But saying a bit more about the history of Unicode is probably a good
idea.

> Some mention should be made of surrogates. They do not appear in UTF-8, but
> many people are unclear on this point. They are also not characters.

In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is
constantly updated) I mention surrogates, but I just point to
perlunicode (the actual reference).

> Mention should be made of the rule requiring the use of shortest-length
> UTF-8 representations. Violations of this rule constitute a security hazard
> in communications. I hope that Perl observes this rule.

Yes, we have a regression test in our test suite that uses Markus
Kuhn's appropriate tests.  Perl generates only shortest-length, and
non-shortest UTF-8 will generate a warning.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perlunitut - feedback appreciated

Reply via email to