Jeff wrote:
>
> If the Keyboard Language is ISO Latin1 and the clipboard contains
> Unicode data containing Cyrillic, then there is a lossy translation.
> If the Keyboard Language is UTF-8, then there is no loss at all.
>
But as noted previously, UTF-8 encodes graphic characters into sequences
that include C1 bytes, and this can cause serious confusion in an ISO
2022/4873/6429 setting. The problem is not so bad in the host-to-terminal
(emulator) direction, because the emulator can decode UTF-8 before
interpreting escape sequences. In the terminal-to-host direction, however,
there is (in general) no such decoding layer -- most of us who use UTF-8 are
using it without the host's knowledge or complicity. The console terminal
driver gets the bytes first and what happens next is anybody's guess.
(If the host does not support C1 controls, it probably treats them as the
C0 controls with their 8th bits on, so valid UTF-8 sequences are likely
to interrupt, suspend, or freeze our sessions... but we might not experience
this effect unless we are using certain scripts, such as Cyrillic.)
Btw I've seen some of the recent discussions without having time to
participate and they worry me a bit. Host/Terminal protocols are not to be
taken lightly. In the old days, they were cast in stone -- or at least ink
and bound paper. We have terminal manuals specifying these protocols going
back 30 years, and we can (and do) use them to recreate these old terminals,
not just for fun, but because a surprising number of applications are
hardwired to use them.
But now it feels like we have a distributed group of people making
off-the-cuff decisions about how UTF-8 xterm should work and then changing
their minds on an continuing basis -- as if whoever has a cool idea and
codes it first wins. I still don't have the time to get involved in this
but let's try to remember how important it is:
. To nail down a specification and stick to it.
. Not to violate ISO 2022, 4873, or 6429, because if you do, you will
break the state machines upon which current terminals and emulators
are based. Work within this well-established framework. Then the
many emulators that comply with these standards can be modified with
relative ease to support UTF-8.
. If it turns out the specification is flawed and needs to be changed,
start a new one and give it a different name.
The business about the duospaced font is also troubling. Obviously we need
this for character-cell graphics. They are proven in practice throughout
the Far East -- or at least certainly in Japan, going back 20 years. Until
now there has never been a need for the host to tell the terminal how wide
each character is. The terminal simply knows it already. I think I remember
Markus posting code here a while back that takes care of this. I'd recommend
we use this approach rather than worrying about single-shifts, stateful
shifts, and who knows what else. Especially since there is no longer any
recognized authority for coordinating new escape sequences -- whatever you
pick, somebody else will be picking the same thing at the same for some
other purpose. That's how things work these days.
- Frank
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/