On Thursday, June 26, 2003 1:04 AM, Andrew C. West <[EMAIL PROTECTED]> wrote:

> On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote:
> 
> > 
> > Peter asked:
> > 
> > > How can things that are visually indistinguishable be lexically
> > > different? 
> > 
> > chat (en)
> > chat (fr)
> 
> And if Unicode reordered vowels in front of consonants, then we
> wouldn't be able to distinguish :
> 
> chat (en)
> chat (fr)
> acht (de)
> 
> Andrew

Such distinction by language is futile: you try to add a language-specific lexical 
meaning, that simply does not exist in Unicode which only standardizes the *script* so 
that it *can* be rendered correctly independantly of the actual language...

So you need to assume a unique language when interpreting an encoded string, but this 
is out of scope of Unicode (which at best will define language-dependant character 
properties, but not language-dependant canonical equivalences.

When Unicode defines such canonical equivalence, the contract must be *only* based on 
the rendered text: if the text is rendered identically so that it becomes impossible 
to determine which order was used to encode it in abstract character sequences, then 
all these orders should be made canonically equivalent.

The only exception is for abstract character propertiesn, which MUST be language 
independant for normative properties (the only exception is character transformations 
such as case mappings, which change the semantic of the text) but need sometimes to be 
distinct for correct processing in the rendering process (for example the Mathematics 
Symbol category and the Letter category, as they influence the layout in actual 
renderers, notably for the choice of font styles or point sizes or alignment, or 
extraction of entities sharing a common set of properties, such as breaking rules that 
also influence the correct rendering of text in variable display environments with 
different capabilities).

Labelling the text with extra information such as language or word semantics or 
phonetic values is not part of the Unicode standard. The Unicode standard stops at the 
point where a text *can* be rendered with its original semantics, and this excludes 
all phonological, phonetical, or logical ordering analysis that can be made 
equivalently on the rendered text or on the encoded text.

-- Philippe.

Reply via email to