Re: Unicode, character ambiguities

Glenn Maynard Wed, 09 Jan 2002 04:48:45 -0800

On Wed, Jan 09, 2002 at 01:10:45PM +0100, Pablo Saratxaga wrote:
> Even in a text-only, monofont appliance like the display of a VCR controler,
> or a GSM phone display? Even in a road sign? Even when you handwrite the
> text ?
> 
> There are cases where the concept of using different fonts for different
> portions of text (depending on language or any other criteria) applies;
> and other cases where that doesn't apply at all.


And it's useful and reasonably easy to support both in many cases.
Keeping the original language stored somewhere allows for more than not
doing it, and it's optional.  (Allowing for changing the language
mid-field is beyond the scope of the tags; whether this is useful for a
comprehensive metadata stream I don't yet know.  I'm suspecting not--at
least not for CJK.)

> People telling they are annoyed are in fact annoyed of the fact of the
> unicode unification more than anything other; and even if that unification
> never has any visible consequence in their lives they will still be annoyed.
> If they had never heard of the unicode unification they would have never
> noticed it.

This is why I suggested the ID3V2 encoding problem might have been more
the fault of political problems than real ones.

> EUC-JP and Shift-JIS can encode *only* japanese; so, what is the difference
> between encoding japanese in a japanese-only encoding and using a japanese
> only font; and encoding japanese text in unucode and using a japanese font?
> There is absolutely no visible difference.
> 
> That is why that proposal is nonsense.

You could encode multiple RFC2047 blocks in a single line, using
different encodings for different lines.

*That* is why it's nonsense--dealing with multiple encodings for a
single block of data is absurd.

(Actually, there are a ton of reasons it's nonsense.)
 
> The problem is that proper unicode support needs much more than simply
> Japanese support. You need to handle a complex multi-byte encoding, with
> also multi-width chars (while Japanese only encodings are quite simple:
> only two kind of chars: ascii (1 byte, 1 column), japanese (2 bytes, 
> 2 columns). There is no "non spacing" chars, no combining chars, not
> chars encoded in 3, 4, 5 or 6 bytes... on top of that, libraries to convert
> between the various japanese encodings are around for years, they are mature
> and there are lots of sample code (including real applications) using them,
> and a lot of programers experienced to use them.
> UTF-8 is just new world, and needs some time to mature at the same level,
> and to be understood and used by programers.

This is why I believe it's extremely important for the libraries to deal
with encodings correctly.  If a programmer is in Shift-JIS, and the
library handles that transparently (converting it as necessary), they
may not care that the underlying data is UTF-8.

> On the other hand, for a completly new development, it makes sense to
> use unicode (utf-8 or other encoding) internally, and use a good iconv-like
> library to convert to locale encoding, if needed.
> So, as ogg format is quite new, it makes sense to mandate utf-8 as being the
> default and *only* encodign used for all embedded text.
> That will also have the extra advantage of avoiding all the encoding problems
> of misinterpretating the encoding. No mojibake.

Not quite all.  Like Tomohiro said, the conversion tables between
Unicode and other CJK encodings aren't properly defined yet.  This would
probably mean that stuff will look fine on a local system, and on ones
that use the same underlying iconv()-like implementation, but there
might be variation when loading the files elsewhere.  I don't know how
serious a problem this is today, but I think simply giving Unicode a
chance to fix it is the best possible solution right now.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to