From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > even if the Dutch language considers it as a single letter, in a
> > way similar to the Spanish "ch"
> 
> I see one major difference: When you apply extra wide inter-char
> distance, you (should) get, f.i.:
> K  o  r  t  r  ij  k     and not     K  o  r  t  r  i  j  k
> but    E  l  c  h  e     and not     E  l  ch  e
> This is common practice in both spanish and dutch typography, ISTK.
> I was told in this forum that the surest way to keep this working in
> Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other
> languages.

My opinion about this is not related to the use or non-use of joiner and disjoiner 
controls.

I think it goes to the locale definition of breakers (I mean the set of breakers for 
sentences, lines, words, hyphenation):

Shouldn't that go to the definition of locale-specific ***character (or 
character-clusters) breakers***, going beyond what Unicode can provide in a single and 
unified character model that just tries to represent international text independantly 
of the language ? 

After all Unicode mostly defines only the required abstract characters needed to 
encode a given strict, outside of any typographical considerations with fonts and 
style effects, but does not really work on the representation of locale-specific needs 
for specific typographical uses such as line justification...

Once again, Unicode should not attempt to be a markup language. It only represents 
text as a linear stream of abstract characters encoded in strings that can be 
transmitted. Unicode is not specifying the typographic needs. This goes to other 
systems such as HTML, SGML, or XSLT and CSS, plus other internationalization standards 
such as transliteration rules, and domain specific conventions, or even the art of 
text translation...

Regarding your request to handle ij specially in Dutch, nothing forbids a locale-aware 
rendering application to remap the i+j pair as a single ij character before rendering 
it, if the text is labelled as Dutch...

So you could get with a few locale-specific chararacter-cluster breaking rules:
    K  o  r  t  r  ij  k     and not   K  o  r  t  r  i  j  k
    B  i  j  e  c  t  i  e   and not    B  ij  e  c  t  i  e
(simply because i+j is a single combined Dutch ij character only if its not followed 
by a vowel)

For the same reason, a French text would render with strict typography:
    B  oe  u  f    and not    B  o e  u  f
(in this case it would render the oe ligature)

Such approach is still much less complicated than what is actually needed for Brahmic 
scripts, and even worse for Thai! And it could handle the defficiencies of some 
conversions to legacy character sets, for example restoring the final form of a greek 
sigma when appropriate.

So the only good question to ask is whever we can label the text with its language, 
using some markup system, or at least using the Unicode language tags needed as a 
possible interface for font renderers that cannot interpret a markup system...

I would not be shocked to see the ligated or combined forms not rendered in a text 
simply because the text is incorrectly marked with the wrong language, or ecause such 
markup is simply not available. This exception is similar to the common approach 
consisting in rendering the text the best as we can with the tools we have, by using 
canonical or compatibility equivalences.

But I see nothing in Unicode that would require the text to be encoded only with the 
Unicode prefered character, only because Unicode recommands it, but where in practice, 
other standards exist that mandate input methods or keyboards where such composition 
is widely impractical. The strict typographic rules cannot be applied without some 
smart algorithm, but the reader will always make the correct interpretation of text 
(this is the interpretation of text that Unicode standardizes, not its rendering).

-- Philippe.


Reply via email to