More to Francois:


>> When I create and exchange an HTML file for instance:
>> <HTML>
>> <TITLE>bla</TITLE>
>> </HTML>
>>
>> only 'bla' is plain text. To conform to Unicode, does it mean I have to
use
>> the Unicode character set and encoding ONLY for 'bla'? (which would
indicate
>> mixing of character encoding in one single file)

The HTML spec uses textual markup (as opposed to some binary file format),
so what constitute plain text depends on how you're interpreting an HTML
file. To an HTML parser, in a sense, it's all plain text; that is, the
parser has to interpret the plain text date to identify tokens like <HTML>.
After that parsing has occured, then at a different level the file has been
analysed into content portions and markup portions, and at this level only
the content portions are seen as plain text.

While it might be possible to create an HTML-like specification in which
the markup and the content could potential be in different encodings (with
some constraints: you need to avoid byte sequences in content that can be
wrongly interpreted as markup), this is no the case for HTML or for XML:
the entire file, markup and content, must be in the same encoding.



>> The second problem I can see with Unicode is the fact that although the
>> character set is universal, the encoding forms are multiple (UTF-8,
UTF-16
>> and UTF-32).

How is that a problem? It is the kind of flexibility that makes Unicode
very practical for implementers. It may be necessary to translate from one
encoding form to another on occasion, but that is very simple.




- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>


Reply via email to