On Thu, 22 Feb 2001, David Starner wrote:

> On Wed, Feb 21, 2001 at 10:58:06PM -0800, Thomas Chan wrote:
> > First, there are the 4000 new[4] "CJK Ideographs" that he created solely
> > for a work called _Tianshu_ (A Book from the Sky)[5] (1987-1991), which Xu
> > spent three years carving movable wooden type for.  There is no doubt that
> > these are bona fide Han characters, albeit without readings and meanings.
> 
> Idiosyncratic and personal characters are not encoded in Unicode. 

I know they aren't, but they have in the past, and some have just sneaked
in with Plane 2.  Of course, policy can also change--it wasn't that long
ago that musical notation and braille were barred.

I'm not suggesting that Xu's 4000 bogus characters deserve to be included;
this is merely an example of a possibility to think about.  If say, they
were included in a futuristic cjk vertical extension "M" (to pick a letter
safely in the distant future), who'll know to object to them?  

(Someone will probably dig up this thread, now that I've mentioned it...)

 
> > However, the lack of readings and/or meanings, or nonce usage, has not
> > stopped characters before from being included in Unicode or precursor
> > dictionaries and standards, e.g., U+20091 and U+219CC, created as a "I
> > know these two characters and you don't" one-upmanship stunt; or the
> > various typos inherited from JIS standards.  
> 
> I believe Unicode, as a general rule, does not encode meaningless
> characters. Any currently in Unicode are either mistakes, or come from 
> preexisting standards.

Mistakes are mistakes; they happen.  But how does one decide how to handle
pre-existing sources?  Set 1991 as a cutoff date?  It really becomes a
delicate issue.

 
> > But consider that these represent potentially 4000
> > codepoints that could be gobbled up by "fictional characters", and it
> > only took a a single individual three years to come up with them.
> 
> But that's not true. No one is proposing that every newly created script
> that comes along be encoded in Unicode. For them to gobble up 4000 
> codepoints, it would take a body of work by a number of authors, like 
> Tengwar and Cirth have had. 

A number of the characters in Plane 2 were grandfathered in because of
their inclusion in dictionaries, despite lacking reading, meaning, or
both, or being outright typos.  If those bogus 4000 made their way into a
dictionary or standard, and some large country(s) pressured to include
them, then there'd be an ugly situation--one can probably think of a few
examples of compromises made, Unicode and elsewhere.

 
> > The second example I would like to raise are the "Square Words" or "New
> > English Calligraphy"[6] (I don't know which name is more appropriate,
> > but I will refer to it hereafter as "NEC"), which is a Sinoform script.
> > NEC is a system where each letter of the English alphabet[7] is equated
> > with one (?) component of Han characters, and each orthographic word is
> > written within the confines of a square block, in imitation of Chinese
> > writing [... CJK ideographs are precomposed in Unicode ...]
> > Thus, there's no reason to expect that NEC would be encoded any
> > differently.
> 
> I disagree. Say, for instance, some small* country decided to adopt NEC 
> as a writing style, and hence Unicode had to include it. There are 
> 1,000,000 words in English by some counts, so it's not feasible to 
> encode them all in Unicode, or even some semi-complete subset. So
> it would be encoded by component and treated like any other complex
> script. (* I say a small country, because a large country might be
> able to get a large chunk of precomposed characters stored in Unicode.
> I still don't think that it would be done soley precomposed.)

Even very small countries using a certain script would have more users
than scholars of dead scripts, even if the margin is measured in
thousands, and countries have political clout that loose federations
scholars do not.  Yet, the historical Sinoform scripts of the Khitan,
Jurchen, and Tanguts[1] are given many rows in WG2 N2314[1] (2001.1.9),
the Plane 1 roadmap, and there are no modern successors who can champion
their cause for cultural reasons, like Vietnam for chu+ no^m characters,
or Ireland for ogham.  I'm not privy to WG2's workings, but what else
would one conclude based on this roadmap; and treatment of Han characters
and Hangul, except that a script like NEC would be treated precomposed?
Of course, this is only a transient roadmap.

[1] On the same roadmap are other South American, Near Eastern, and other
large scripts in same position.

[2] http://www.egt.ie/standards/iso10646/plane1-roadmap-table.html

 
> > At the inception of various other fictional scripts, no one could foresee
> > the growth of scholarly and/or amateur interest in them; 
> 
> True. That's why we wait until there is, before we consider encoding
> a script.

Yes, I agree.  It is harder to find historical scripts and characters than
to create new ones, and it is the latter, especially for large sets like
the two I raised, whose rate of inclusion must be tempered. 


Thomas Chan
[EMAIL PROTECTED]

Reply via email to