Re: Next Generation Console Font?

George W Gerrity Thu, 03 Aug 2006 22:48:23 -0700


On 2006-08-04, at 01:58, Rich Felker wrote:

On Thu, Aug 03, 2006 at 03:40:29PM +1000, George W Gerrity wrote:
Please. Let's not have yet another *NIX font encoding andpresenting scheme! Why don't you set up a team to rationalise theexisting encodings and presentation methods.
This is the sort of mentality that sickens me. "Please oh pleasedon't make something good because there's so much crap out therethat you should fix instead!" This is the sort of mentality thatlead to abominations like BIND and Sendmail surviving as long asthey did, OpenSSH (in all its glory of vulnerabilities) beingforked from the old SSH code instead of rewritten from scratch, etc.

Actually, that is what I was opposing. But any solution to consolerepresentation has to handle three things together — localisation,internationalisation, and multilingualisation — or there will stillbe the mess where these things are dealt with inconsistently inseparate and in multiple places in existing *NIX systems, and even inthe POSIX standard.

The font encoding is incidental unless it is too simple to providethe rendering required for complext script systems. Moreover, theproblem has nothing to do with font encoding, except that thedecoding and rendering are done in so many different places in a *NIXsystem.

If you are spending your effort on a new (compact) glyphrepresentation to use at the console to avoid bloat or proprietarysoftware, then you are wasting your time. A font requires more thanthe encoding of glyph representation if it is to be compact: theremust be some way to combine simple glyphs to form a more complexglyph before rendering as a glyph image. Experts in font encodinghave spent years in developing their encoding methods to be bothefficient in time and in space, while at the same time enabling theencoding to handle fonts for _any_ script system: I doubt that youcan improve on them, but go ahead and try, keeping in mind thatsupport for describing some glyphs in complex fonts is still notfully specified even in Unicode, much less than in the font encodings.

And having done that, you still have to fix L10n, I18n, and m17n sothat it is handled properly at the console level and so that theroutines for these features are available only once and don't have tobe replicated by every application and/or in different interfaces.

The biggest headache in *NIX (with the exception of Mac OS X'sunderlying version) is the haphazard way that handling of non-ASCII characters and the I18n has developed. It is especiallygrotty at the system level, and as you ...
The system level has nothing to do with fonts... Until you get tofonts and rendering, m17n and i18n are extremely trivial.

It depends on how character strings are handled before they get tothe console application. In some *NIX systems, this is handled in thekernel, mixed up with I/O handling. This was done for efficient I/Ohandling, including efficient buffering. As I said in my first email,I am no longer cognisant of how this sort of code is handled, butwhen I was working on *NIX, I had to rewrite a lot of that code toremove assumptions about what a word was, what a char was, what abyte was. I know that this has been cleaned up since, but I would besurprised if all the dependencies of low-lying data handling havebeen removed.

The other point is that rendering _is_ required at the console levelfor more complex script systems: you cannot special-case consoles tofixed width and avoid rendering problems in the _majority_ of non-Latin scripts.

With proper m17n, L10n and I18n, someone speaking Hindi, for instance(more of them than English speakers!) should be able to boot intosingle user with both prompts and commands using the appropriatescript for Hindi. Correct rendering of Indic scripts is _not_ trivial(and therefore the code is bulky). At the single user level, manyimplementations of *NIX incorporate this terminal rendering codeeither into the kernel or as system code.

Naturally, one wouldn't want this code for rendering every scriptsystem incorporated into each and every system, and therefore,modularity makes sense: you put in what you need at build time.That's done all the time with *NIX implementations. Moreover, withproper (modular) design, you can build in your m17n one script systemand its supported languages at a time, releasing new code as you go,and not having to rewrite what you started with every time you add.

commented below, one of the reasons is that (English-onlyspeaking) *NIX systems people think that handling of non-ASCIIcharsets should somehow be trivial and not bulky in code.
I'm not English-only speaking yet I'm quite confident that itshould be trivial and not bulky in code, and that applicationsshould not even have to think about it.

But do your linguistic skills extend to a language using a non-latinscript — or, more relevant — a language that uses a complex scriptsystem?

The difference between your approach (and the approach of peoplewho have written most of the existing applications with extensivescript support) and mine is exactly the same as the differencebetween the early efforts at converting to Unicode (especially byMS) and UTF-8: The MS/Unicode approach was to pull the rug out fromunder everyone and force them to drop C, drop UNIX, drop allexisting internet protocols, and store text as 16bit characters inUCS-2.

Please don't imply that I would support MS in any way. They have usedevery trick in the trade to lock users into their products, and allof their products started as spaghetti code written in assembler.They eventually switched to C and then to their brand of C++, butstill produced monolithic code to control their APIs and keep most ofthem hidden from independent developers. In any case, they had nochoice but to pull the rug out from under anyone (including their ownpeople) because there was no other way to upgrade to Unicode from thecrap API they had.

Modular approaches to programming have nothing whatever to do withbloat and everything to do with maintainability, understandability,orthogonality, and extensibility. Indeed, some *NIX systems (such asMach) are modular and are _not_ bloated and are highly efficient. Agood argument can be made that Mach is more easily maintained thansay, Linux, which has a huge, non-modular kernel.

The UTF-8 approach on the other hand recognizes that most of thetime when programs are dealing with text they don't care about theencoding or meaning of the text at all. At most they care aboutsome codepoints in the first 128 positions that have specialmeaning to the software. Thus Unicode can be supported _without_any special effort from developers.

Yep. As long as all code developers are required to learn English toan acceptable level: very Anglo-centric. I applaud the extension of C/C++, etc, to use and represent variable names and commands in scriptsother than latin-1.

The obvious exception to this comes when it's time to display thetext on a visual device for the user. :) Terminals, if they workcorrectly with the necessary scripts, provide a very clean solutionto the problem because the application doesn't have to think aboutthe presentation of the text. Historically it meant the applicationcould just assume 1 byte == 1 character position for non-controlcharacters.

Obviously, we are talking at cross purposes. You seem to be agreeingthat text rendering for multiple scripts needs to be available forany application (including the login process and "sh" and itsderivatives?). That requires delving into pretty low-level and prettyancient code bits, some of which, _may_ require a change to theinterface, ie, the API, and that would mean that the POSIX standardwould be breached unless and until it was altered. It also means acomplete rewrite of this low-level code.

I am asking for this complete rewrite, as opposed to quick fixes. Ifthat is what you are actually doing, I support it. But, it is a bigjob, and would benefit from some help and consultation with like-minded individuals. I am _not_ suggestion an IBM/MS-type code team:efficiency can be — and often is — achieved by a small team ofexperienced and dedicated programmers working together closely.

And I repeat, writing rendering code is _not_ trivial, and it _is_bulky, and it _is_ necessary at the console level.

Now, the same requires mbtowc/wcwidth, but it's not any hugeburden. Surely a lot less burden than doing the text renderingyourself.

If you have to map code to representation, then you are doingrendering. Rendering in some scripts maps multiple code points to oneglyph position. For instance, Vietnamese, which uses the Latinalphabet, can have up to five accents applied to a basic Latincharacter, to present one glyph fitting into one (wide or narrow)character position. Representing Vietnamese in a fixed-width simpleterminal emulator requires considerable rendering code, even thoughmost of the required accents and all of the alphabet is found in ascii.

But what about applications that _do_ want/need to do the textrendering themselves? This must include at least the terminalemulator, :) and also imaging programs, presentation apps, visually-oriented web browsers, ... As long as the program does not need todo its _own_ text display it may be able to rely on a widget set,which basically gives all the same advantages as using a terminalwith regard to implementation simplicity. (However now we need toadd widget sets to the list of things that need to do textrendering..)


Agreed.

This whole line of questioning raises a lot more questions than itanswers and I'm going to easily get sidetracked if I continue...

If you don't get sidetracked enough to deal with difficult scripts —at least at the design level — your solution will be yet anotherinadequate kludge: it won't be flexible enough to add support formore difficult scripts.

I am no longer up-to-date with kernel and system details in *NIX,and am not a developer — perhaps an interested bystander is whereI fit in — but I used to do a lot of coding in that area, so Iknow how difficult it can be. My view is that what is needed is amodular (and ...
Why modular? "Modular" is the magic panacea word among peoplewriting this bloatware, and all it does is massively increasememory requirements and complexity.

See my comments above. Modularity generally does increase code sizecompared to spaghetti code, but it has a number of advantages: 1) Inthe case discussed above, it allows one to remove rendering code forscript systems that do not need to be supported at build time; 2) itallows you to develop code step by step or in parallel with others.You can start with simple Latin scripts, add support for Cyrillic andGreek. Then you need to tackle the simplest right-to-left script,Hebrew. Next you can deal with something like Vietnamese, which usesmultiple code points from Latin-1 to render single glyphs. And so on;3) It is more easily maintained than non-modular designs.

In addition, well-designed modular code is not particularly largerthan non-modular code. bulky code is usually feature-bloated codegenerated by tacking these features onto a core application that wasin itself badly designed. MS Word is a good example: there is stillan annoying rendering flaw in it that has been there since itsinception, but nobody can fix it because the original code was sobadly designed and written, and the add-ons only bury it deeper.

unified) way of slotting in support for handling various alphabetsand languages,
The view that supporting a new alphabet or language requires a newmodule is fundamentally wrong. All it should require is properinformation in the font.

Not true! Rendering of non-latin fonts is much more complex thanthat. Rendering involves a complex (multiple) code point-to-glyphmapping that can be context dependent and may require reordering ofthe code points before mapping. The mapping is script dependent,language dependent, and font-dependent. Rendering Arabic Script, forinstance, is highly context dependent, since the form of glyph useddepends on whether the constant is at the beginning, middle, or endof the word, and on what vowel is associated with the constant. It isalso language dependent, since, for instance, Farsi (spoken in Iranand Afghanistan), uses some extra glyphs not found in Arabic-languageArabic script. I don't believe that reordering is required, but thenI am not a user of Arabic script.

Both modern Greek and modern Hebrew also have a few consonants thatare rendered differently when they are at the end of a word.

All Indic scripts require extensive code-point reordering, sincetraditionally, some vowels are placed before certain consonants, eventhough they are pronounced after the consonant, and the reason isthat they are a key to the combined vowel-consonant glyph to berendered. I could go on, but I repeat: rendering of non-latin scriptsis _not_ trivial. Moreover, like Arabic scripts, rendering of anIndic script is also language dependent. A Devanagiri font hasalternate and extra glyphs in order to cater for a number of Sanscrit-derived modern languages that use it.

based on Unicode categories, that can be easily set up at systembuild time.
So at build time you either choose "bloatware with m17n" or "legacyascii/latin1 crap"? Sounds like the current problem we're stuckwith. The bloatware distros will mostly choose the former and theones targetting more advanced users who dislike bloat will choosethe latter, perpetuating the problem that competent developers

I would use "one-eyed" rather than "competent". Competency is notlimited to *NIX system coders, nor are all *NIX coders competent.

despise m17n and i18n and therefore do not include support in theirprograms.

Once again, you misunderstand my position: bloatware _will_ result ifyou tack on L10n, m17n, and I18n to the existing "legacy ascii/latin1crap". You need to start from scratch, maybe even changing some APIs,to remove the ascii/latin1 design bias built into original UNIX. But,handling and rendering non-latin-1 scripts is complex, and hence thecode will be considerably larger than what is being replaced.Moreover, this needs to be done whatever font encoding is used, andfont encoding is largely orthogonal to L10n, m17n, and I18n. Only therendering engine is dependent on font encoding.

Moreover, *NIX is greatly in need of a way of unifying all thevarious ways for formatting and representing characters at alllevel, using system-level code.
Huh? What does this even mean? Are you confusing glyphs withcharacters? Representing characters is trivial.

My slip of the tongue. What I was trying to say is that variousapplications including printer drivers, terminal drivers, texteditors, etc, seem to use different APIs, software, and tables toformat and to render text. Rewriting from scratch with the idea thatall text is Unicode and with the rendering done at a low level basedon L10n and included Unicode control characters (some of which aremultiple byte and some of which specifiy language and script systemto be rendered), then the system presents a uniform API for allapplications that need code rendered. Perhaps some text processingsystems will need to do their own rendering, but I imagine that insome cases they can access a fixed-width rendering system applied toa variable-width font, and do their own spacing. Alternatively, ifrendered fonts use anti-aliasing and colours/shades, maybe your basicrenderer won't be useful.

This may even imply some minor tweaking of the POSIX standard.


.....

I know that a real-life problem (with a deadline?) has got you


No deadline except being tired of having a legacy system.

Great! You have time to spend on a careful, modular design based on agood understanding of the problems that arise with m17n beingincluded at the basic console level.

energised to tackle this can of worms, but a quick fix or re-invention of the wheel is just not the way to go.
Someone once said: "when the wheel is square you need to reinvent it".


Agreed.

Someone with energy and know-how has got to get a team togetherand fix what is broken in the guts of *NIX so that it presents agood, clean interface for I18n and multiple character setrepresentation.
Absolutely not. This is the bloatware doctrine, that new interfacesand libs are a panacea, that they're best designed by teams andcommittees, etc. What's needed is _simplicity_. When you havesimplicity everything else follows.

A hammer is a simple tool. If that's all you've got, then everyproblem is a nail that needs to be whacked in. Unfortunately, not allmechanical problems can be rectified with just a hammer.

Cars are much more complex than the original Model T, and a lot of itis bloatware. But no one in their right mind would want to drive at100kph with mechanical-linkage brakes. No one could afford an engineas inefficient as the V-12 that was in some 40s and 50s Oldsmobiles.No one today would design an engine w/o fuel injection and computercontrol of fuel/air ratio, because in this case, complexity yieldsefficiency.

Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex,even if implemented with fixed-width fonts at the console level. yourlittle hammer won't do, and the solution will be big compared tohandling ascii.

There is a possibility here to solve a simple, almost-trivialunsolved problem.


If it were trivial, it would have been solved long ago: it is not.

What you propose is abandoning the simple problem


I repeat: the problem is not simple.

and trying to solve much more difficult problems instead, many ofwhich will not be solved anytime in the near future due as much topersonal and political reasons as to technical ones.

The only political problem has to do with font encoding: both Adobeand MS want to keep control of the huge font rendering and fontfoundary industry, Adobe because they have a near monopoly onfirmware rendering systems, and MS so they can keep technology awayfrom any competitors, including Apple. Their union is unstable, andconflicting interests has indeed led to complex font encoding and toinadequate standardisation. Having said that, Adobe has been tryingto deal with the complexities of rendering non-Latin script systemswithout having to start from scratch: that leads to bloat, but you dohave to take economics into consideration.

Moreover, even if the more difficult problem is solved, thesolution will not be useful to anyone except the ./configure --enable-bloat crowd (myself included). Why would I want to abandon areal solvable problem in order to attempt at solving a problemthat's uninteresting to me?

Because you misunderstand the problem? Because if it really weresimple, it would have been simply done already?


George
------
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Next Generation Console Font?

Reply via email to