On 2006-08-04, at 01:58, Rich Felker wrote:

On Thu, Aug 03, 2006 at 03:40:29PM +1000, George W Gerrity wrote:

Please. Let's not have yet another *NIX font encoding and presenting scheme! Why don't you set up a team to rationalise the existing encodings and presentation methods.

This is the sort of mentality that sickens me. "Please oh please don't make something good because there's so much crap out there that you should fix instead!" This is the sort of mentality that lead to abominations like BIND and Sendmail surviving as long as they did, OpenSSH (in all its glory of vulnerabilities) being forked from the old SSH code instead of rewritten from scratch, etc.

Actually, that is what I was opposing. But any solution to console representation has to handle three things together — localisation, internationalisation, and multilingualisation — or there will still be the mess where these things are dealt with inconsistently in separate and in multiple places in existing *NIX systems, and even in the POSIX standard.

The font encoding is incidental unless it is too simple to provide the rendering required for complext script systems. Moreover, the problem has nothing to do with font encoding, except that the decoding and rendering are done in so many different places in a *NIX system.

If you are spending your effort on a new (compact) glyph representation to use at the console to avoid bloat or proprietary software, then you are wasting your time. A font requires more than the encoding of glyph representation if it is to be compact: there must be some way to combine simple glyphs to form a more complex glyph before rendering as a glyph image. Experts in font encoding have spent years in developing their encoding methods to be both efficient in time and in space, while at the same time enabling the encoding to handle fonts for _any_ script system: I doubt that you can improve on them, but go ahead and try, keeping in mind that support for describing some glyphs in complex fonts is still not fully specified even in Unicode, much less than in the font encodings.

And having done that, you still have to fix L10n, I18n, and m17n so that it is handled properly at the console level and so that the routines for these features are available only once and don't have to be replicated by every application and/or in different interfaces.

The biggest headache in *NIX (with the exception of Mac OS X's underlying version) is the haphazard way that handling of non- ASCII characters and the I18n has developed. It is especially grotty at the system level, and as you ...

The system level has nothing to do with fonts... Until you get to fonts and rendering, m17n and i18n are extremely trivial.

It depends on how character strings are handled before they get to the console application. In some *NIX systems, this is handled in the kernel, mixed up with I/O handling. This was done for efficient I/O handling, including efficient buffering. As I said in my first email, I am no longer cognisant of how this sort of code is handled, but when I was working on *NIX, I had to rewrite a lot of that code to remove assumptions about what a word was, what a char was, what a byte was. I know that this has been cleaned up since, but I would be surprised if all the dependencies of low-lying data handling have been removed.

The other point is that rendering _is_ required at the console level for more complex script systems: you cannot special-case consoles to fixed width and avoid rendering problems in the _majority_ of non- Latin scripts.

With proper m17n, L10n and I18n, someone speaking Hindi, for instance (more of them than English speakers!) should be able to boot into single user with both prompts and commands using the appropriate script for Hindi. Correct rendering of Indic scripts is _not_ trivial (and therefore the code is bulky). At the single user level, many implementations of *NIX incorporate this terminal rendering code either into the kernel or as system code.

Naturally, one wouldn't want this code for rendering every script system incorporated into each and every system, and therefore, modularity makes sense: you put in what you need at build time. That's done all the time with *NIX implementations. Moreover, with proper (modular) design, you can build in your m17n one script system and its supported languages at a time, releasing new code as you go, and not having to rewrite what you started with every time you add.

commented below, one of the reasons is that (English-only speaking) *NIX systems people think that handling of non-ASCII charsets should somehow be trivial and not bulky in code.

I'm not English-only speaking yet I'm quite confident that it should be trivial and not bulky in code, and that applications should not even have to think about it.

But do your linguistic skills extend to a language using a non-latin script — or, more relevant — a language that uses a complex script system?

The difference between your approach (and the approach of people who have written most of the existing applications with extensive script support) and mine is exactly the same as the difference between the early efforts at converting to Unicode (especially by MS) and UTF-8: The MS/Unicode approach was to pull the rug out from under everyone and force them to drop C, drop UNIX, drop all existing internet protocols, and store text as 16bit characters in UCS-2.

Please don't imply that I would support MS in any way. They have used every trick in the trade to lock users into their products, and all of their products started as spaghetti code written in assembler. They eventually switched to C and then to their brand of C++, but still produced monolithic code to control their APIs and keep most of them hidden from independent developers. In any case, they had no choice but to pull the rug out from under anyone (including their own people) because there was no other way to upgrade to Unicode from the crap API they had.

Modular approaches to programming have nothing whatever to do with bloat and everything to do with maintainability, understandability, orthogonality, and extensibility. Indeed, some *NIX systems (such as Mach) are modular and are _not_ bloated and are highly efficient. A good argument can be made that Mach is more easily maintained than say, Linux, which has a huge, non-modular kernel.

The UTF-8 approach on the other hand recognizes that most of the time when programs are dealing with text they don't care about the encoding or meaning of the text at all. At most they care about some codepoints in the first 128 positions that have special meaning to the software. Thus Unicode can be supported _without_ any special effort from developers.

Yep. As long as all code developers are required to learn English to an acceptable level: very Anglo-centric. I applaud the extension of C/ C++, etc, to use and represent variable names and commands in scripts other than latin-1.

The obvious exception to this comes when it's time to display the text on a visual device for the user. :) Terminals, if they work correctly with the necessary scripts, provide a very clean solution to the problem because the application doesn't have to think about the presentation of the text. Historically it meant the application could just assume 1 byte == 1 character position for non-control characters.

Obviously, we are talking at cross purposes. You seem to be agreeing that text rendering for multiple scripts needs to be available for any application (including the login process and "sh" and its derivatives?). That requires delving into pretty low-level and pretty ancient code bits, some of which, _may_ require a change to the interface, ie, the API, and that would mean that the POSIX standard would be breached unless and until it was altered. It also means a complete rewrite of this low-level code.

I am asking for this complete rewrite, as opposed to quick fixes. If that is what you are actually doing, I support it. But, it is a big job, and would benefit from some help and consultation with like- minded individuals. I am _not_ suggestion an IBM/MS-type code team: efficiency can be — and often is — achieved by a small team of experienced and dedicated programmers working together closely.

And I repeat, writing rendering code is _not_ trivial, and it _is_ bulky, and it _is_ necessary at the console level.


Now, the same requires mbtowc/wcwidth, but it's not any huge burden. Surely a lot less burden than doing the text rendering yourself.

If you have to map code to representation, then you are doing rendering. Rendering in some scripts maps multiple code points to one glyph position. For instance, Vietnamese, which uses the Latin alphabet, can have up to five accents applied to a basic Latin character, to present one glyph fitting into one (wide or narrow) character position. Representing Vietnamese in a fixed-width simple terminal emulator requires considerable rendering code, even though most of the required accents and all of the alphabet is found in ascii.

But what about applications that _do_ want/need to do the text rendering themselves? This must include at least the terminal emulator, :) and also imaging programs, presentation apps, visually- oriented web browsers, ... As long as the program does not need to do its _own_ text display it may be able to rely on a widget set, which basically gives all the same advantages as using a terminal with regard to implementation simplicity. (However now we need to add widget sets to the list of things that need to do text rendering..)

Agreed.

This whole line of questioning raises a lot more questions than it answers and I'm going to easily get sidetracked if I continue...

If you don't get sidetracked enough to deal with difficult scripts — at least at the design level — your solution will be yet another inadequate kludge: it won't be flexible enough to add support for more difficult scripts.

I am no longer up-to-date with kernel and system details in *NIX, and am not a developer — perhaps an interested bystander is where I fit in — but I used to do a lot of coding in that area, so I know how difficult it can be. My view is that what is needed is a modular (and ...

Why modular? "Modular" is the magic panacea word among people writing this bloatware, and all it does is massively increase memory requirements and complexity.

See my comments above. Modularity generally does increase code size compared to spaghetti code, but it has a number of advantages: 1) In the case discussed above, it allows one to remove rendering code for script systems that do not need to be supported at build time; 2) it allows you to develop code step by step or in parallel with others. You can start with simple Latin scripts, add support for Cyrillic and Greek. Then you need to tackle the simplest right-to-left script, Hebrew. Next you can deal with something like Vietnamese, which uses multiple code points from Latin-1 to render single glyphs. And so on; 3) It is more easily maintained than non-modular designs.

In addition, well-designed modular code is not particularly larger than non-modular code. bulky code is usually feature-bloated code generated by tacking these features onto a core application that was in itself badly designed. MS Word is a good example: there is still an annoying rendering flaw in it that has been there since its inception, but nobody can fix it because the original code was so badly designed and written, and the add-ons only bury it deeper.

unified) way of slotting in support for handling various alphabets and languages,

The view that supporting a new alphabet or language requires a new module is fundamentally wrong. All it should require is proper information in the font.

Not true! Rendering of non-latin fonts is much more complex than that. Rendering involves a complex (multiple) code point-to-glyph mapping that can be context dependent and may require reordering of the code points before mapping. The mapping is script dependent, language dependent, and font-dependent. Rendering Arabic Script, for instance, is highly context dependent, since the form of glyph used depends on whether the constant is at the beginning, middle, or end of the word, and on what vowel is associated with the constant. It is also language dependent, since, for instance, Farsi (spoken in Iran and Afghanistan), uses some extra glyphs not found in Arabic-language Arabic script. I don't believe that reordering is required, but then I am not a user of Arabic script.

Both modern Greek and modern Hebrew also have a few consonants that are rendered differently when they are at the end of a word.

All Indic scripts require extensive code-point reordering, since traditionally, some vowels are placed before certain consonants, even though they are pronounced after the consonant, and the reason is that they are a key to the combined vowel-consonant glyph to be rendered. I could go on, but I repeat: rendering of non-latin scripts is _not_ trivial. Moreover, like Arabic scripts, rendering of an Indic script is also language dependent. A Devanagiri font has alternate and extra glyphs in order to cater for a number of Sanscrit- derived modern languages that use it.

based on Unicode categories, that can be easily set up at system build time.

So at build time you either choose "bloatware with m17n" or "legacy ascii/latin1 crap"? Sounds like the current problem we're stuck with. The bloatware distros will mostly choose the former and the ones targetting more advanced users who dislike bloat will choose the latter, perpetuating the problem that competent developers

I would use "one-eyed" rather than "competent". Competency is not limited to *NIX system coders, nor are all *NIX coders competent.

despise m17n and i18n and therefore do not include support in their programs.

Once again, you misunderstand my position: bloatware _will_ result if you tack on L10n, m17n, and I18n to the existing "legacy ascii/latin1 crap". You need to start from scratch, maybe even changing some APIs, to remove the ascii/latin1 design bias built into original UNIX. But, handling and rendering non-latin-1 scripts is complex, and hence the code will be considerably larger than what is being replaced. Moreover, this needs to be done whatever font encoding is used, and font encoding is largely orthogonal to L10n, m17n, and I18n. Only the rendering engine is dependent on font encoding.

Moreover, *NIX is greatly in need of a way of unifying all the various ways for formatting and representing characters at all level, using system-level code.

Huh? What does this even mean? Are you confusing glyphs with characters? Representing characters is trivial.

My slip of the tongue. What I was trying to say is that various applications including printer drivers, terminal drivers, text editors, etc, seem to use different APIs, software, and tables to format and to render text. Rewriting from scratch with the idea that all text is Unicode and with the rendering done at a low level based on L10n and included Unicode control characters (some of which are multiple byte and some of which specifiy language and script system to be rendered), then the system presents a uniform API for all applications that need code rendered. Perhaps some text processing systems will need to do their own rendering, but I imagine that in some cases they can access a fixed-width rendering system applied to a variable-width font, and do their own spacing. Alternatively, if rendered fonts use anti-aliasing and colours/shades, maybe your basic renderer won't be useful.

This may even imply some minor tweaking of the POSIX standard.

.....

I know that a real-life problem (with a deadline?) has got you

No deadline except being tired of having a legacy system.

Great! You have time to spend on a careful, modular design based on a good understanding of the problems that arise with m17n being included at the basic console level.


energised to tackle this can of worms, but a quick fix or re- invention of the wheel is just not the way to go.

Someone once said: "when the wheel is square you need to reinvent it".

Agreed.

Someone with energy and know-how has got to get a team together and fix what is broken in the guts of *NIX so that it presents a good, clean interface for I18n and multiple character set representation.

Absolutely not. This is the bloatware doctrine, that new interfaces and libs are a panacea, that they're best designed by teams and committees, etc. What's needed is _simplicity_. When you have simplicity everything else follows.

A hammer is a simple tool. If that's all you've got, then every problem is a nail that needs to be whacked in. Unfortunately, not all mechanical problems can be rectified with just a hammer.

Cars are much more complex than the original Model T, and a lot of it is bloatware. But no one in their right mind would want to drive at 100kph with mechanical-linkage brakes. No one could afford an engine as inefficient as the V-12 that was in some 40s and 50s Oldsmobiles. No one today would design an engine w/o fuel injection and computer control of fuel/air ratio, because in this case, complexity yields efficiency.

Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex, even if implemented with fixed-width fonts at the console level. your little hammer won't do, and the solution will be big compared to handling ascii.

There is a possibility here to solve a simple, almost-trivial unsolved problem.

If it were trivial, it would have been solved long ago: it is not.

What you propose is abandoning the simple problem

I repeat: the problem is not simple.

and trying to solve much more difficult problems instead, many of which will not be solved anytime in the near future due as much to personal and political reasons as to technical ones.

The only political problem has to do with font encoding: both Adobe and MS want to keep control of the huge font rendering and font foundary industry, Adobe because they have a near monopoly on firmware rendering systems, and MS so they can keep technology away from any competitors, including Apple. Their union is unstable, and conflicting interests has indeed led to complex font encoding and to inadequate standardisation. Having said that, Adobe has been trying to deal with the complexities of rendering non-Latin script systems without having to start from scratch: that leads to bloat, but you do have to take economics into consideration.

Moreover, even if the more difficult problem is solved, the solution will not be useful to anyone except the ./configure -- enable-bloat crowd (myself included). Why would I want to abandon a real solvable problem in order to attempt at solving a problem that's uninteresting to me?

Because you misunderstand the problem? Because if it really were simple, it would have been simply done already?

George
------
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to