Re: Unicode, character ambiguities

Glenn Maynard Tue, 08 Jan 2002 23:43:49 -0800

On Wed, Jan 09, 2002 at 08:12:55AM +0100, Pablo Saratxaga wrote:
> Kaixo!


Gasundheit!

> > A couple people on a Vorbis list are suggesting allowing RFC2047
> > encoding in Ogg tags, to let people use encodings other than UTF-8, as a
> > "fix" for these problems.
> 
> A *VERY BAD* idea.

I agree; I'm trying to convince them of this.

(Mind you, this is only in a proposal.  I suspect the Xiph-Powers-That-Be
would overrule such an idea, anyway.)

> The only result will be that most implementations will ignore non unicode
> encodings.

One argument I'm being given is that "if we try to force UTF-8 on users,
they'll ignore us and use other encodings, so at least define how it
should be done."  This did happen with ID3 and ID3V2; I've hit tags
myself that are in SJIS, even though ID3V2 is UTF-8.

I think part of this is due to the poor state of ID3V2 libraries.  (The
last time I used them, they were unstable; I'm not going to use a
library that might trash my data, and I didn't trust this library.)  I
believe a lot of this would go away if the library did proper conversion
to the user's locale (unix) or codepage (win32), as long as the library was
*used*.  I expect Xiph's libraries to be much more stable than ID3LIB.

> > I think RFC2047 is a fairly horrible
> > solution.  An alternative is simply to store the language of the text;
> > is that sufficient, or are there deeper problems?
> 
> Why even bother with that?
> When you write a web page, for example, do you put language tags everywhere
> and around any text?
> And is there any browser that actually uses language tags for rendering?

Around Japanese text, yes, I do.  Explorer honors them, displaying
<p lang="ja"> in a Japanese font and <p lang="zh"> in a Chinese font.
Fonts are selectable per-language.  (For larger amounts of text, I
believe you can put it in the <BODY> tag.)

Try http://zewt.org/~glenn/test.html on a system with both Japanese and
Chinese fonts installed.

> The main purpose of a sound format is to contain sound. The text portions
> are just informative; they are not intended to have all the flexibility
> of a word processor and produce LaTeX-quality printouts; they are intended
> to be just plaintext.
> Simply, utf-8 is the ascii of the new millenium. 

And it should be possible to display the text in a font native to the
language it was originally written in, as long as it's not difficult to
do so.  (It's not; there's a UTF8_LANG tag in the proposal for just this
purpose.  It's limited in that you can't tag in more than one language
where the font matters, but that apparently doesn't matter for most
cases, and I suspect the "variation selectors" might allow that anyway,
though there aren't many details available yet.)

> People choose the fonts in function of their needs.
> Those "problems" are *not* encoding problems; only font problems.
> 
> Saying that unicode is not good because the above reasons is nearly as stupid
> as saying unicode is not good for latin based languages as it doesn't
> disambiguate between serif and sans serif styles.

Every reader of English can read any Roman characters in any reasonable
font.  The same is not true of CJK variants, so the comparison doesn't
really work.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to