On 2015/02/06 00:58, John C Klensin wrote:
--On Thursday, February 05, 2015 19:23 +0900 "\"Martin J.
Dürst\"" <[email protected]> wrote:
Except maybe for two or three people on the Unicode Technical
Committee I know, I wouldn't want to claim that anybody knows
the implications of even a significant (in terms of size and
use) part of the Unicode repertoire. And for the average
implementer or system administrator, it's of course much less.
But we definitely don't want that to lead to a situation where
we go back to (some time) last century and ASCII only.
Martin,
I don't see how you get from there to "ASCII only". First,
there are a lot of people in the world who don't "understand the
implications of" Latin Script, even the basic undecorated Latin
characters and even though they might use them. I think that,
while it may require some effort on their part, it is reasonable
to expect implementers and system administrators who establish
rules for identifiers to take responsibility for understanding
the use and possible risks associated with the characters of
their own scripts, especially the subset of those characters
that are relevant to their own languages.
Hello John - You are right that not all people who use a script
understand it. I could give some specific examples. But they usually
think they understand it, so that will lead to the same outcome
regarding what we write. Also, you are correct that not all implementors
and system administrators are from the Latin writing parts of the world.
But a good majority is, and has a huge influence on the rest of the
world, and once one is in a defensive mindset, it's easy to come to the
conclusion "let's use ASCII, that has worked for decades, that can't be
wrong (and even if it is, nobody can be blamed/fired for choosing it)."
So "ASCII only" was somewhat of a simplification, but not a big one.
I recognize that makes it hard to design software systems that
are somehow internationally script-insensitive where identifiers
are concerned, but I think we have to live with that as the
price of the diversity of human languages and writing systems.
It may also imply a need for software implementations that are
far more rule-driven, possibly with locally-tailorable rules for
individual scripts, languages, and context, rather than an
approach that is construed as "this magic table of characters is
ok". Again, that may be the price of the diversity of human
writing system and, by looking at tables and global profiles, we
may just be in denial about that diversity and its implications.
I agree, at least in theory.
None of the above is made any easier by Unicode decisions,
however justified by the same diversity issues, pushing us from
design principles that apply to all of the coding system, to
design principles that are different on a per-script basis, to
specific and exception-driven rules such as "normalization does
all of the right comparison things within the Latin script
except for the following code points for which decomposition is
appropriate under some circumstances but not others" or "there
are case matching rules that are universally applicable except
for certain specific locales and code points, where special
treatment is needed".
I spent quite a bit of my time in colleague to learn Kanji. Quite a bit
later, I spent some time to analyse Kanji shapes in an attempt to create
some software for font design. I published a few papers, but didn't get
much farther. But one thing I learned was that although Kanji were built
up highly regularly, once you had to account for them in their full
numbers, there was always some kind of edge case or exception that broke
(or confirmed, as the saying goes) the rules.
What you talk about above shows that the same applies to Unicode
overall: Even with a very strong attempt of keeping everything in line
with simple rules, there will always be some corner case or exception.
It may be that we have been in denial, that the whole concept of
identifiers without language context is unworkable for at least
some protocols, and that we should be thinking of an
"internationalized identifier" as a tuple with a string and
language identifier. Comparisons would then depend, not on
catenation and bit-by-bit comparison but on
consideration of the language identifier based on RFC 4647 and
then interpretation and comparison of the string based on that
information.
I think we have very good reasons to have rejected this approach
virtually since the first moment we thought about internationalized
identifiers. Human eyesight doesn't see invisible BCP 47 'color' painted
over text, and people don't think that way.
That suggests that we should finish the PRECIS work based on
current documents rather than looking for a more prefect
solution (or textual phrasing) now. However, it does also
suggest that, for at least some purposes, the PRECIS work may be
a waypoint rather than a final answer.
As I tried to explain, I just wanted to make sure we don't throw out the
baby with the bathwater. The new text is fine with me.
Regards, Martin.
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis