Re: New version of UTF-8 on Linux

Arne GÃtje (éçè) Wed, 23 Mar 2005 21:06:23 -0800

On Thursday 24 March 2005 03:45, Jan Willem Stumpel wrote:
> > But not all fonts reflect those attitudes. Fonts develped in
> > Mainland China *have to follow the GB1830 standard*, so there
> > is no other option.
>
> So GB1830 is a standard for the actual appearance of the glyphs?


Yes, I don't know of any online ressource, but I have a printed one at 
home. The same applies for Japan (JISX0213) and Taiwan (CNS11643).

For example the 'bone' radical: To find it in CNS11643, go to
http://www.cns11643.gov.tw/web/seek_01.jsp
and type in the second field the radical index number (188) (the link 
next to the field will display a list of all radicals with their 
corresponding index numbers). Then press the submit button. The result 
will display all characters for that radical.

The free Arphic fonts don't follow any of these standards. There are 
commercial ones available which follow the CNS11643 or the GB18030 
standard.

> > [..] However, I'm experimenting with OTF features, like
> > providing multiple varients for different regions. The next
> > release (scheduled for March 27.) will contain the varients for
> > the "bone" character. Currently I know only OpenOffice.org to
> > support this function. I only do this for testing first.
>
> OK.. I downloaded your 'AR PL ShanHeiSun Uni' font. It looks nice
> & stylish. Do you aim to make it a complete typographically
> consistent font for all of CJK (not including Korean syllables I
> suppose)? And if I understand you correctly, you also intend to

yes. This is work in progress. Every release contains more characters.

> provide regional variants of 'bone', 'rain', 'meat', etc., but
> only very few apps, only Openoffice at this moment, can select
> between them. So browsers can't?

At least I only know that OpenOffice.org is *supposed* to work.ÂI don't 
think that any other applications can support this feature. (They 
mostly cannot support *any* OTF feature... including combining 
diacritics, metrics and anchors... :( )

> The 'bone' and 'rain' are characters with different appearance in
> East-Asian countries, but with the same Unicode code points. Now
> it appears that there are also characters which are so different
> (so drastically simplified) that they have been given different
> code points, although they are 'basically' the same (used in
> cognate words), like 'electricity': é (J) and ç (mainland-C).
> For instance in 'computer' which I gather from Chinese billboards
> & advertisements to be çè; some professor in Japan years ago
> proposed dennÅ, éè, instead of konpyÅtÄ. I don't know any
> Chinese but am curious to know the difference between the two
> types of 'equivalent' characters. Are both forms of the
> electricity character (or the word 'computer') allowed in mainland
> China?

In fact all forms can be used in any region. but every region has a 
national standard (GB18030, JISX0213, CNS11643, etc.) which specifies, 
not only which characters have to be in their charset, but also how 
they should look like.
I heared that in China every *commercially* sold font must follow the 
GB18030 standard, but only government agencies and the military is 
*required* to use that standard.
I don't know of any such requirement in Japan or Taiwan. In fact all 
three presentation forms can be found in public usage in any of these 
regions. 
However, you cannot just convert the characters between traditional and 
simplified varients to 'translate' a text between Taiwan and Mainland 
China or Japan. The vocabulary also differs.
The characters in Unicode have been 'unified' by a standards commitee 
(IRG - Ideographic Rapporteur Group), which includes delegates from 
every region where CJK characters are in use. Those guys have the job 
to dig through the huge amount of character varients used in each 
region and decide, which varients represent the *same* character and 
though can be unified. If a character has been simplified by either one 
of the regions and both varients are in use, then they will get 
different codepoints, so that they can coexist. If the differences are 
only minor (like 'bone'), then it's considered to be a presentation 
form issue and is left to each region to deal with it...
But even the Unicode standard is not consistent in this policy...

There had been efforts to create a charset in Taiwan in the 1980s, which 
includes all known varients (more than 100,000 characters), each one 
having a distinctive codepoint. The idea was to let the input method 
engine handle this. For example, you type 'gu3' for 'bone', select the 
bone character and then get a list of possible presentation forms to 
choose from. The character set still exists (I don't know the standards 
number though), but noone has ever implemented fonts or input methods 
for that. It's simpley too much work and not really needed.

For the topic about applications and their support of OTF features:
What is needed urgently is the following:
1. support for metrics and anchors: freetype does support these features 
and so does 'fontforge', which I use to create my fonts. But no other 
application I know of supports these features. So, combining diacritics 
on latin characters don't work correctly, only the precomposed ones 
(have seperate codepoints for each combination) are used currently. The 
disadvantage: languages like Taiwanese (Holo) use diacritics which 
don't have a seperate precomposed character codepoints. They rely on 
the combinig sequence to work and though look ugly when used in open 
source applications. The diacritics are not in the correct position 
above the latin character.
2. some applications (some GTK2 apps for example) use a unicode 
database, which specifies, which codepoints are allocated and which 
ones are not. Those codepoints which are 'reserved' cannot be 
displayed. As the Unicode standard is constantly expanded and 
rearranged, new characters, like U+0358 (Combining dot right above) 
(used in Taiwanese / Holo), cannot be displayed. The character will be 
newly introduced in the upcoming Unicode 4.1.0 standard.
3. Varient selectors: OTF includes a feature in the GSUB table to 
specify which glyphs should be used for which codepoint in a specific 
region (simplified chinese, trad. chinese, japanese, etc.). Currently 
OO.o is the only application supposed to support this feature.
I will have to study the Unicode spec how the varient selectors in 
Unicode are supposed to work. However, I don't know any application to 
support this feature, and if we would want to use it, we would have to 
'define' a standard, *which* of the varient selectors represents which 
presentation form.
Another feature is 'stylistic alternatives (salt)' and is also not 
supported in any application except fontforge... this one let's you 
decide which presentation form to display without depending on the 
region. However, I don't know how this should work when you exchange 
documents...
4. SIL Graphite support should be included, I heared that OO.o should be 
able to support graphite in version 2.0... but I don't know if this is 
true.

Cheers
Arne
-- 
Arne GÃtje (éçè) <[EMAIL PROTECTED]>
PGP/GnuPG key: 1024D/685D1E8C
Fingerprint: 2056 F6B7 DEA8 B478 311F  1C34 6E9F D06E 685D 1E8C
Key available at wwwkeys.pgp.net.   Encrypted e-mail preferred.

pgpxkFA9zGY81.pgp
Description: PGP signature

Re: New version of UTF-8 on Linux

Reply via email to