Re: Doublewidth EM DASH for unhappy English people

Markus Kuhn Wed, 11 Apr 2001 05:12:25 -0700
Bram Moolenaar wrote on 2001-04-11 11:36 UTC:
> I'm confused.  I thought that the width of a Unicode character was fixed.
> Thus when I take a Unicode character, it is either defined to be single-width
> or double-width.

I published such a definition in preliminary form on

  http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

and both xterm and the glibc 2.2 UTF-8 locales implement the same. Read
the comments in the source file to see how that function was
constructed.

ISO 10646:2000 remains completely silent on the issue.

The Unicode Consortium has only published

  http://www.unicode.org/unicode/reports/tr11/

which assigns each Unicode character to one of five EastAsian width
categories F, H, W, Na, A, N. In orther words, Unicode documents only
for each character the width semantics in legacy standards, but it does
*not* prescribe the width semantics of a UTF-8 terminal emulator.

> If this is not true, I won't be able to edit Unicode with Vim reliably.  I'm
> using the current version of wcwidth().  When someone decides to make a font
> with different widths, the display will be messed up.  I suppose xterm has the
> same problem.  Running Vim in a xterm has a double problem (Vim can only guess
> which characters will end up double-width in the xterm).

For xterm at least, we have made sure that this is under the control of
xterm, *not* under the control of the font. Xterm decides which glyphs
are normal or double-width and then picks then glyphs accordingly from
one of two monospaced fonts (one normal and one double-width). This way,
the same font (pair) can be used with different wcwidth conventions,
which allows us even later to define ESC sequences to switch between
different width conventions should it be necessary. I think, this is
clearly the right and most flexible approach. It also solves the problem
that the CharCell XLFD font category that we want to use for
applications such as xterm does not allow two different widths in a
single font.

If xterm bases its decision on the libc implementation, then at least as
long as xterm and the text-mode application using it run under the same
locale, they are guaranteed to agree on the width of every character.
Problematic is if you telnet within xterm to another machine and you
application runs potentially under another locale there. Then xterm and
the text-mode application have incompatible wcwidth conventions. Because
of this problem, I am playing with the idea of trying to become the
all-mighty wcwidth dictator and tell everyone to use

  http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Since some CJK users have complained about the above definition (which
makes all Class A (ambiguous width in legacy implementations) narrow, I
have added recently to the above file defines also a second wcwidth_cjk
convention, in which all Class A characters are double-width, thus
providing an EUC backwards-compatible convention.

We agree that a single wcwidth convention can't make everyone happy.
Perhaps two are sufficient?

> Perhaps using wcwidth() is wrong and it should be deleted?

We have no other choice in text-mode applications.

> Should the width of a character be obtained from the font information?

Only in situations where you also want to support proportional fonts.
The classical tty model does not provide a communication mechanism for
that sort of information. The goal of the exercise here it to keep the
classical tty model alive. I don't think we want to add ESC sequences to
query the width semantics of the terminal.

> Either that or the results of wcwidth() should be set in stone.

That's what I've tried to do in the above wcwidth and wcwidth_cjk,
though at the moment it is not yet a formally recognised standard. More
a simple and stable discussion proposal and something to base the first
generations of implementations on. Once I have a bit more time, I might
start editing an RFC on a revised text terminal model (the project to
fix the many problems of ISO 6429, including its ambiguities, useless
features and incomprehensible writing style), which will also come with
a single or at least a very small number of wcwidth conventions. (I
won't have time to do that in the next 3-4 months or so, but once I do,
I hope 2-3 experienced terminal emulation gurus will want to join a team
for doing this properly.)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Doublewidth EM DASH for unhappy English people

Reply via email to