On Wednesday, June 25, 2003 8:14 PM, Peter Lofting <[EMAIL PROTECTED]> wrote:

> At 7:41 PM +0200 6/25/03, Philippe Verdy wrote:
> > If there are real distinct semantics that were "abusively" unified
> > by the canonicalization, the only safe way would be to create a
> > second character that would have another combining class than the
> > existing one, to be used when lexical distinction from the most
> > common use is necessary.
> > 
> > So the added character for the modified vowel signs would have the
> > same representative glyph, but would have the additional semantic
> > "contraction" (clearly indicated in their name). This does not break
> > the existing encoding of most texts, but allows a specific usage for
> > contractions where the existing canonical equivalences would be
> > inappropriate.
> 
> How do you envisage this getting into the data?
> 
> Often in Tibetan data capture, operators are keying in the appearance
> of a text and do not know what a stack represents.
> 
> So the data then requires expert review after input to verify and
> assign the semantic representation.

This is not a major problem, in fact this occurs everyday in all scripts: there are 
correctors, and some dictionnary based corrections that may be used to help correct 
the "incorrectly" or ambiguously encoded string...

This is true even for all Latin-based languages, where the incorrect accents are used, 
or missing, and only native readers will be able to see the incorrect interpretation 
of a grapheme cluster, using their own knowledge of the language when the "error" 
(introduced by some intermediate technical constraint such as a past missing standard) 
appears.

I still think that the contraction "problem" has a limited impact, which doesnot 
affect the normal written form of the Tibetan language which clearly uses a single 
interpretation. If both interpretations of a grapheme cluster is needed, then we 
should keep the encoding of the existing characters for the most common interpretation 
(without the contraction semantics), and assign a variant specially to allow encoding 
the other interpretation or reading of the grapheme-cluster.

Legacy encoded text may still contain such ambiguous encodings that will look 
erroneous with the new updated standard, but this offers a way to correct later the 
encoded text, by looking at occurences of such ambiguous sequences, and letting actual 
native readers correct these interpretation, if the correction is absolutely required 
for some text processing.

I do think that most already encoded text will not need such correction, if the 
encoding is just a way to transport a text which is only intended to be rendered or 
printed, but not used with automated lexical analysis. And even in that case, if the 
encoding ambiguity is well documented in a revision of the standard, there is a 
possibility to enhance tools like automated full-text search engines to search for 
both encodings of the character, based on their actually identical glyphic 
representation.

-- Philippe.


Reply via email to