On Fri, Aug 04, 2006 at 02:16:04PM +1000, George W Gerrity wrote:
> Actually, that is what I was opposing. But any solution to console  
> representation has to handle three things together \windows-1252-0277 
> localisation,  
> internationalisation, and multilingualisation \windows-1252-0277 or there 
> will still  
> be the mess where these things are dealt with inconsistently in  
> separate and in multiple places in existing *NIX systems, and even in  
> the POSIX standard.

If you're going to make bold claims like this you need to back them
up, especially claiming POSIX is inconsistent on the matter.

> The font encoding is incidental unless it is too simple to provide  
> the rendering required for complext script systems. Moreover, the  

That is exactly the topic: rendering "complex" scripts. Better Unicode
console support would still be interesting even without this (for
example, CJK would be very useful to many many people), but I'm not
interested in any solution that doesn't cover the so-called (IMO a
misnomer) complex scripts.

> [...] then you are wasting your time.

How am I "wasting my time" if the end result is something I can use,
whereas nothing I can use exists now??

> A font requires more than  
> the encoding of glyph representation if it is to be compact: there  
> must be some way to combine simple glyphs to form a more complex  
> glyph before rendering as a glyph image.

If you'd read any of this thread you'd know that I'm quite aware of
combining and the requirments for characters to have varying glyphs
under different contexts and combinations. However, the combining
process is not complex. It can all be accomplished with simple
overstrike provided you have sufficiently powerful rules for
expressing which glyph to use for a character depending on context.
Developing such a system is the question this thread was started in
order to answer.

> Experts in font encoding  

Proof by appeal to authority generally does not impress
mathematicians. :)

> have spent years in developing their encoding methods to be both  
> efficient in time and in space,

It's been established that their methods are _not_ efficient in space.
They're only efficient in time because of their severe limitations (at
most 64k glyphs, etc.).

> while at the same time enabling the  
> encoding to handle fonts for _any_ script system:

Only with added script-specific knowledge from the rendering
implementation, which may need to be upgraded when new scripts are
added.. This is not acceptable since some software will not be
updated, due to laziness/disappearance by authors, etc. As long as the
only burden is on the font files, then any font containing glyphs for
a script will necessarily provide full support for that script in any
application, without the application's authors having to explicitly
include support.

> >The system level has nothing to do with fonts... Until you get to  
> >fonts and rendering, m17n and i18n are extremely trivial.
> 
> It depends on how character strings are handled before they get to  
> the console application. In some *NIX systems, this is handled in the  
> kernel, mixed up with I/O handling. This was done for efficient I/O  
> handling, including efficient buffering. As I said in my first email,  
> I am no longer cognisant of how this sort of code is handled, but  
> when I was working on *NIX, I had to rewrite a lot of that code to  
> remove assumptions about what a word was, what a char was, what a  
> byte was. I know that this has been cleaned up since, but I would be  
> surprised if all the dependencies of low-lying data handling have  
> been removed.

This entire paragraph shows a complete ignorance about unix which
almost amounts to trolling. There is exactly one place where the
kernel needs to have an awareness of character encoding, and this is
at the 'cooked/icanon' tty level where the kernel handles simple
line-editing operations. Failure to be aware of multibyte encoding,
fullwidth characters, and nonspacing characters will result in
backspace and such behaving incorrectly when the terminal is in
canonical input mode.

Otherwise, there are a few optional things like filename translation
when mounting windows UTF-16 filesystems, which the kernel can handle.
But for the most part the kernel is unaware of encoding and has no
need to care about encoding.

> The other point is that rendering _is_ required at the console level  

The kernel-internal terminal on the console video device is another
issue that can be handled by the kernel, but there's no fundamental
reason one of these is needed. It can also be implemented in
userspace, which is what I'm doing for the time being. If my work (or
someone else's better work) can eventually be integrated into Linux,
I'll be happy, but it's not essential. Terminals that run on the
framebuffer device, svgalib, or under X are all perfectly usable.

> for more complex script systems: you cannot special-case consoles to  
> fixed width and avoid rendering problems in the _majority_ of non- 
> Latin scripts.

A terminal is a character-cell device, with fixed-width character
cells. This is not open to discussion, but fear not, it's not a
problem! On a modern terminal there are three character widths: zero
(nonspacing/combining), one (most scripts including Latin, Greek,
Cyrillic, Indic, ...), and two ("full width" CJK ideographs, etc.).

To my knowledge there is still no official standard as to which
characters have which width, but POSIX specifies the function used to
obtain the width of each character (and defines the results as
'locale-specific'), and Markus Kuhn's implementation is the de facto
standard and is based on applying very reasonable rules to the
published Unicode data (East Asian Width tables and Mn and Cf classes,
mainly).

> With proper m17n, L10n and I18n, someone speaking Hindi, for instance  
> (more of them than English speakers!) should be able to boot into  
> single user with both prompts and commands using the appropriate  
> script for Hindi.

I agree completely.

> Correct rendering of Indic scripts is _not_ trivial  
> (and therefore the code is bulky).

This is a non sequitur. I've written plenty of non-trivial code in
well under 10k. Also rendering Indic scripts is nowhere near as
complicated as people make it sound. The main issue, vowel placement,
can be handled with simple glyph selection rules in the font as
follows: any character followed by the reordering vowel uses the vowel
glyph as its glyph; the vowel takes its glyph from the character
appearing before it. Notice that neither the application using the
terminal nor the terminal itself had to know anything about the
concept of reordering. Everything takes place at the glyph selection
stage.

> Naturally, one wouldn't want this code for rendering every script  
> system incorporated into each and every system, and therefore,  

Why not? If it's 5k what's the harm?? My claim is that I can support
all scripts in under 10k of code, provided you have an appropriate
font with the right (also small) tables. In this implementation, users
wanting to minimize system size can select a font with only the
scripts they need, rather than compiling a crippled application. The
same application/kernel binaries will then support all scripts if you
just give them a more-complete font.

> >I'm not English-only speaking yet I'm quite confident that it  
> >should be trivial and not bulky in code, and that applications  
> >should not even have to think about it.
> 
> But do your linguistic skills extend to a language using a non-latin  
> script -- or, more relevant -- a language that uses a complex script  
> system?

Yes. བོད་སྐད་དང་བོད་ཡིག་ཤེས་ཀྱི་ཡོད༌
(Hope I didn't mess that up... I typed it blind since I don't yet have
a terminal that can display it.)

> >The difference between your approach (and the approach of people  
> >who have written most of the existing applications with extensive  
> >script support) and mine is exactly the same as the difference  
> >between the early efforts at converting to Unicode (especially by  
> >MS) and UTF-8: The MS/Unicode approach was to pull the rug out from  
> >under everyone and force them to drop C, drop UNIX, drop all  
> >existing internet protocols, and store text as 16bit characters in  
> >UCS-2.
> 
> Please don't imply that I would support MS in any way. They have used  

I'm not; I was just implying that your "pull out the rug" idea is
similar to theirs, and that it's based on a wrong assumption that
unix/C/etc. are somehow broken when it comes to m17n, which they're
not.

> them hidden from independent developers. In any case, they had no  
> choice but to pull the rug out from under anyone (including their own  
> people) because there was no other way to upgrade to Unicode from the  
> crap API they had.

UTF-8 would have worked just as well on Windows as it does on UNIX.
Maybe someday they'll finally offer UTF-8 as an option for the 8bit
character encoding, though somehow I doubt it since that would be like
admitting UCS-2 was a stupid idea.

> I applaud the extension of C/ 
> C++, etc, to use and represent variable names and commands in scripts  
> other than latin-1.

Yes! Actually AFAIK Latin-1 was never legal. They went straight from
ASCII to Unicode.

> >Now, the same requires mbtowc/wcwidth, but it's not any huge  
> >burden. Surely a lot less burden than doing the text rendering  
> >yourself.
> 
> If you have to map code to representation, then you are doing  
> rendering.

You don't. A program with a terminal interface works only with
characters, never glyphs/representation. There are still some
questions about r2l/bidi stuff which I don't think have satisfactory
answers yet, but this is a different issue from glyphs/characters.

> Representing Vietnamese in a fixed-width simple  
> terminal emulator requires considerable rendering code, even though  
> most of the required accents and all of the alphabet is found in ascii.

The rendering code is trivial. The terminal emulator just accumulates
all nonspacing characters after the base character in one character
cell, and blits the appropriate glyphs on top of one another. The only
nontrivial part is when the glyphs to use depend on the context. This
is the problem I'm working on -- efficiently mapping characters
(including combining characters) to glyphs based on context.

> Not true! Rendering of non-latin fonts is much more complex than  
> that. Rendering involves a complex (multiple) code point-to-glyph  
> mapping that can be context dependent

There's never a need for a mapping of a multi-element sequence of
codepoints to one glyph. You juse use overstrike with appropriate
variants. In the worst case you can emulate many-to-one with
sufficient use of contextual glyph selection to make the base
character map to a glyph containing all of the modifications while the
combining characters all map to a blank glyph, but in practice you can
almost always do something much more reasonable with fewer contextual
rules.

> It is  
> also language dependent, since, for instance, Farsi (spoken in Iran  
> and Afghanistan), uses some extra glyphs not found in Arabic-language  
> Arabic script. I don't believe that reordering is required, but then  
> I am not a user of Arabic script.

AFAIK it's not required, but I'm not a user of Arabic either.

> Both modern Greek and modern Hebrew also have a few consonants that  
> are rendered differently when they are at the end of a word.

Greek has separate codepoints for final/nonfinal sigma. I don't know
whether Greek users are used to typing these as separate characters or
whether the font should do it for them.. Anyway it's a trivial
context-based replacement.

> >>energised to tackle this can of worms, but a quick fix or re- 
> >>invention of the wheel is just not the way to go.
> >
> >Someone once said: "when the wheel is square you need to reinvent it".
> 
> Agreed.

:)

> Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex,  
> even if implemented with fixed-width fonts at the console level. your  
> little hammer won't do, and the solution will be big compared to  
> handling ascii.

Yes, compared to ASCII handling it will be bigger. It's about a
thousand times bigger. 5k instead of 5 bytes. :)

> >There is a possibility here to solve a simple, almost-trivial  
> >unsolved problem.
> 
> If it were trivial, it would have been solved long ago: it is not.

UTF-8 encoding: solved a long time ago with mb[r]towc and friends.

Unicode terminals with nonspacing and doublewide characters: solved by
the wcwidth library call and implemented by urxvt, xterm, mlterm, ...

"Complex" scripts on terminal: not solved. mlterm handles some scripts
but only the ones it speficially supports. The problem is not
complexity but the lack of the appropriate data. Due to the rendering
methods mlterm uses (especially if using core X font system) it
doesn't have access to GSUB/GPOS type info. This can be solved for
most scripts just by getting more direct/intelligent access to the
fonts, but I'm taking a different approach (which works for all
scripts not just some) of letting the font tell you what glyphs to
use.

The amount that remains to be done here is really small. It's like
putting in the last few pieces of a puzzle. I don't claim getting bidi
perfect or getting all the data tables right for perfect glyph mapping
will be easy or something that I can do on my own, but I _do_ claim
that building the framework that can support all cases is trivial.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to