On 2006-08-04, at 01:58, Rich Felker wrote:
On Thu, Aug 03, 2006 at 03:40:29PM +1000, George W Gerrity wrote:
Please. Let's not have yet another *NIX font encoding and
presenting scheme! Why don't you set up a team to rationalise the
existing encodings and presentation methods.
This is the sort of mentality that sickens me. "Please oh please
don't make something good because there's so much crap out there
that you should fix instead!" This is the sort of mentality that
lead to abominations like BIND and Sendmail surviving as long as
they did, OpenSSH (in all its glory of vulnerabilities) being
forked from the old SSH code instead of rewritten from scratch, etc.
Actually, that is what I was opposing. But any solution to console
representation has to handle three things together — localisation,
internationalisation, and multilingualisation — or there will still
be the mess where these things are dealt with inconsistently in
separate and in multiple places in existing *NIX systems, and even in
the POSIX standard.
The font encoding is incidental unless it is too simple to provide
the rendering required for complext script systems. Moreover, the
problem has nothing to do with font encoding, except that the
decoding and rendering are done in so many different places in a *NIX
system.
If you are spending your effort on a new (compact) glyph
representation to use at the console to avoid bloat or proprietary
software, then you are wasting your time. A font requires more than
the encoding of glyph representation if it is to be compact: there
must be some way to combine simple glyphs to form a more complex
glyph before rendering as a glyph image. Experts in font encoding
have spent years in developing their encoding methods to be both
efficient in time and in space, while at the same time enabling the
encoding to handle fonts for _any_ script system: I doubt that you
can improve on them, but go ahead and try, keeping in mind that
support for describing some glyphs in complex fonts is still not
fully specified even in Unicode, much less than in the font encodings.
And having done that, you still have to fix L10n, I18n, and m17n so
that it is handled properly at the console level and so that the
routines for these features are available only once and don't have to
be replicated by every application and/or in different interfaces.
The biggest headache in *NIX (with the exception of Mac OS X's
underlying version) is the haphazard way that handling of non-
ASCII characters and the I18n has developed. It is especially
grotty at the system level, and as you ...
The system level has nothing to do with fonts... Until you get to
fonts and rendering, m17n and i18n are extremely trivial.
It depends on how character strings are handled before they get to
the console application. In some *NIX systems, this is handled in the
kernel, mixed up with I/O handling. This was done for efficient I/O
handling, including efficient buffering. As I said in my first email,
I am no longer cognisant of how this sort of code is handled, but
when I was working on *NIX, I had to rewrite a lot of that code to
remove assumptions about what a word was, what a char was, what a
byte was. I know that this has been cleaned up since, but I would be
surprised if all the dependencies of low-lying data handling have
been removed.
The other point is that rendering _is_ required at the console level
for more complex script systems: you cannot special-case consoles to
fixed width and avoid rendering problems in the _majority_ of non-
Latin scripts.
With proper m17n, L10n and I18n, someone speaking Hindi, for instance
(more of them than English speakers!) should be able to boot into
single user with both prompts and commands using the appropriate
script for Hindi. Correct rendering of Indic scripts is _not_ trivial
(and therefore the code is bulky). At the single user level, many
implementations of *NIX incorporate this terminal rendering code
either into the kernel or as system code.
Naturally, one wouldn't want this code for rendering every script
system incorporated into each and every system, and therefore,
modularity makes sense: you put in what you need at build time.
That's done all the time with *NIX implementations. Moreover, with
proper (modular) design, you can build in your m17n one script system
and its supported languages at a time, releasing new code as you go,
and not having to rewrite what you started with every time you add.
commented below, one of the reasons is that (English-only
speaking) *NIX systems people think that handling of non-ASCII
charsets should somehow be trivial and not bulky in code.
I'm not English-only speaking yet I'm quite confident that it
should be trivial and not bulky in code, and that applications
should not even have to think about it.
But do your linguistic skills extend to a language using a non-latin
script — or, more relevant — a language that uses a complex script
system?
The difference between your approach (and the approach of people
who have written most of the existing applications with extensive
script support) and mine is exactly the same as the difference
between the early efforts at converting to Unicode (especially by
MS) and UTF-8: The MS/Unicode approach was to pull the rug out from
under everyone and force them to drop C, drop UNIX, drop all
existing internet protocols, and store text as 16bit characters in
UCS-2.
Please don't imply that I would support MS in any way. They have used
every trick in the trade to lock users into their products, and all
of their products started as spaghetti code written in assembler.
They eventually switched to C and then to their brand of C++, but
still produced monolithic code to control their APIs and keep most of
them hidden from independent developers. In any case, they had no
choice but to pull the rug out from under anyone (including their own
people) because there was no other way to upgrade to Unicode from the
crap API they had.
Modular approaches to programming have nothing whatever to do with
bloat and everything to do with maintainability, understandability,
orthogonality, and extensibility. Indeed, some *NIX systems (such as
Mach) are modular and are _not_ bloated and are highly efficient. A
good argument can be made that Mach is more easily maintained than
say, Linux, which has a huge, non-modular kernel.
The UTF-8 approach on the other hand recognizes that most of the
time when programs are dealing with text they don't care about the
encoding or meaning of the text at all. At most they care about
some codepoints in the first 128 positions that have special
meaning to the software. Thus Unicode can be supported _without_
any special effort from developers.
Yep. As long as all code developers are required to learn English to
an acceptable level: very Anglo-centric. I applaud the extension of C/
C++, etc, to use and represent variable names and commands in scripts
other than latin-1.
The obvious exception to this comes when it's time to display the
text on a visual device for the user. :) Terminals, if they work
correctly with the necessary scripts, provide a very clean solution
to the problem because the application doesn't have to think about
the presentation of the text. Historically it meant the application
could just assume 1 byte == 1 character position for non-control
characters.
Obviously, we are talking at cross purposes. You seem to be agreeing
that text rendering for multiple scripts needs to be available for
any application (including the login process and "sh" and its
derivatives?). That requires delving into pretty low-level and pretty
ancient code bits, some of which, _may_ require a change to the
interface, ie, the API, and that would mean that the POSIX standard
would be breached unless and until it was altered. It also means a
complete rewrite of this low-level code.
I am asking for this complete rewrite, as opposed to quick fixes. If
that is what you are actually doing, I support it. But, it is a big
job, and would benefit from some help and consultation with like-
minded individuals. I am _not_ suggestion an IBM/MS-type code team:
efficiency can be — and often is — achieved by a small team of
experienced and dedicated programmers working together closely.
And I repeat, writing rendering code is _not_ trivial, and it _is_
bulky, and it _is_ necessary at the console level.
Now, the same requires mbtowc/wcwidth, but it's not any huge
burden. Surely a lot less burden than doing the text rendering
yourself.
If you have to map code to representation, then you are doing
rendering. Rendering in some scripts maps multiple code points to one
glyph position. For instance, Vietnamese, which uses the Latin
alphabet, can have up to five accents applied to a basic Latin
character, to present one glyph fitting into one (wide or narrow)
character position. Representing Vietnamese in a fixed-width simple
terminal emulator requires considerable rendering code, even though
most of the required accents and all of the alphabet is found in ascii.
But what about applications that _do_ want/need to do the text
rendering themselves? This must include at least the terminal
emulator, :) and also imaging programs, presentation apps, visually-
oriented web browsers, ... As long as the program does not need to
do its _own_ text display it may be able to rely on a widget set,
which basically gives all the same advantages as using a terminal
with regard to implementation simplicity. (However now we need to
add widget sets to the list of things that need to do text
rendering..)
Agreed.
This whole line of questioning raises a lot more questions than it
answers and I'm going to easily get sidetracked if I continue...
If you don't get sidetracked enough to deal with difficult scripts —
at least at the design level — your solution will be yet another
inadequate kludge: it won't be flexible enough to add support for
more difficult scripts.
I am no longer up-to-date with kernel and system details in *NIX,
and am not a developer — perhaps an interested bystander is where
I fit in — but I used to do a lot of coding in that area, so I
know how difficult it can be. My view is that what is needed is a
modular (and ...
Why modular? "Modular" is the magic panacea word among people
writing this bloatware, and all it does is massively increase
memory requirements and complexity.
See my comments above. Modularity generally does increase code size
compared to spaghetti code, but it has a number of advantages: 1) In
the case discussed above, it allows one to remove rendering code for
script systems that do not need to be supported at build time; 2) it
allows you to develop code step by step or in parallel with others.
You can start with simple Latin scripts, add support for Cyrillic and
Greek. Then you need to tackle the simplest right-to-left script,
Hebrew. Next you can deal with something like Vietnamese, which uses
multiple code points from Latin-1 to render single glyphs. And so on;
3) It is more easily maintained than non-modular designs.
In addition, well-designed modular code is not particularly larger
than non-modular code. bulky code is usually feature-bloated code
generated by tacking these features onto a core application that was
in itself badly designed. MS Word is a good example: there is still
an annoying rendering flaw in it that has been there since its
inception, but nobody can fix it because the original code was so
badly designed and written, and the add-ons only bury it deeper.
unified) way of slotting in support for handling various alphabets
and languages,
The view that supporting a new alphabet or language requires a new
module is fundamentally wrong. All it should require is proper
information in the font.
Not true! Rendering of non-latin fonts is much more complex than
that. Rendering involves a complex (multiple) code point-to-glyph
mapping that can be context dependent and may require reordering of
the code points before mapping. The mapping is script dependent,
language dependent, and font-dependent. Rendering Arabic Script, for
instance, is highly context dependent, since the form of glyph used
depends on whether the constant is at the beginning, middle, or end
of the word, and on what vowel is associated with the constant. It is
also language dependent, since, for instance, Farsi (spoken in Iran
and Afghanistan), uses some extra glyphs not found in Arabic-language
Arabic script. I don't believe that reordering is required, but then
I am not a user of Arabic script.
Both modern Greek and modern Hebrew also have a few consonants that
are rendered differently when they are at the end of a word.
All Indic scripts require extensive code-point reordering, since
traditionally, some vowels are placed before certain consonants, even
though they are pronounced after the consonant, and the reason is
that they are a key to the combined vowel-consonant glyph to be
rendered. I could go on, but I repeat: rendering of non-latin scripts
is _not_ trivial. Moreover, like Arabic scripts, rendering of an
Indic script is also language dependent. A Devanagiri font has
alternate and extra glyphs in order to cater for a number of Sanscrit-
derived modern languages that use it.
based on Unicode categories, that can be easily set up at system
build time.
So at build time you either choose "bloatware with m17n" or "legacy
ascii/latin1 crap"? Sounds like the current problem we're stuck
with. The bloatware distros will mostly choose the former and the
ones targetting more advanced users who dislike bloat will choose
the latter, perpetuating the problem that competent developers
I would use "one-eyed" rather than "competent". Competency is not
limited to *NIX system coders, nor are all *NIX coders competent.
despise m17n and i18n and therefore do not include support in their
programs.
Once again, you misunderstand my position: bloatware _will_ result if
you tack on L10n, m17n, and I18n to the existing "legacy ascii/latin1
crap". You need to start from scratch, maybe even changing some APIs,
to remove the ascii/latin1 design bias built into original UNIX. But,
handling and rendering non-latin-1 scripts is complex, and hence the
code will be considerably larger than what is being replaced.
Moreover, this needs to be done whatever font encoding is used, and
font encoding is largely orthogonal to L10n, m17n, and I18n. Only the
rendering engine is dependent on font encoding.
Moreover, *NIX is greatly in need of a way of unifying all the
various ways for formatting and representing characters at all
level, using system-level code.
Huh? What does this even mean? Are you confusing glyphs with
characters? Representing characters is trivial.
My slip of the tongue. What I was trying to say is that various
applications including printer drivers, terminal drivers, text
editors, etc, seem to use different APIs, software, and tables to
format and to render text. Rewriting from scratch with the idea that
all text is Unicode and with the rendering done at a low level based
on L10n and included Unicode control characters (some of which are
multiple byte and some of which specifiy language and script system
to be rendered), then the system presents a uniform API for all
applications that need code rendered. Perhaps some text processing
systems will need to do their own rendering, but I imagine that in
some cases they can access a fixed-width rendering system applied to
a variable-width font, and do their own spacing. Alternatively, if
rendered fonts use anti-aliasing and colours/shades, maybe your basic
renderer won't be useful.
This may even imply some minor tweaking of the POSIX standard.
.....
I know that a real-life problem (with a deadline?) has got you
No deadline except being tired of having a legacy system.
Great! You have time to spend on a careful, modular design based on a
good understanding of the problems that arise with m17n being
included at the basic console level.
energised to tackle this can of worms, but a quick fix or re-
invention of the wheel is just not the way to go.
Someone once said: "when the wheel is square you need to reinvent it".
Agreed.
Someone with energy and know-how has got to get a team together
and fix what is broken in the guts of *NIX so that it presents a
good, clean interface for I18n and multiple character set
representation.
Absolutely not. This is the bloatware doctrine, that new interfaces
and libs are a panacea, that they're best designed by teams and
committees, etc. What's needed is _simplicity_. When you have
simplicity everything else follows.
A hammer is a simple tool. If that's all you've got, then every
problem is a nail that needs to be whacked in. Unfortunately, not all
mechanical problems can be rectified with just a hammer.
Cars are much more complex than the original Model T, and a lot of it
is bloatware. But no one in their right mind would want to drive at
100kph with mechanical-linkage brakes. No one could afford an engine
as inefficient as the V-12 that was in some 40s and 50s Oldsmobiles.
No one today would design an engine w/o fuel injection and computer
control of fuel/air ratio, because in this case, complexity yields
efficiency.
Your basic premise is wrong: I18n, m17n, and L10n _is_ very complex,
even if implemented with fixed-width fonts at the console level. your
little hammer won't do, and the solution will be big compared to
handling ascii.
There is a possibility here to solve a simple, almost-trivial
unsolved problem.
If it were trivial, it would have been solved long ago: it is not.
What you propose is abandoning the simple problem
I repeat: the problem is not simple.
and trying to solve much more difficult problems instead, many of
which will not be solved anytime in the near future due as much to
personal and political reasons as to technical ones.
The only political problem has to do with font encoding: both Adobe
and MS want to keep control of the huge font rendering and font
foundary industry, Adobe because they have a near monopoly on
firmware rendering systems, and MS so they can keep technology away
from any competitors, including Apple. Their union is unstable, and
conflicting interests has indeed led to complex font encoding and to
inadequate standardisation. Having said that, Adobe has been trying
to deal with the complexities of rendering non-Latin script systems
without having to start from scratch: that leads to bloat, but you do
have to take economics into consideration.
Moreover, even if the more difficult problem is solved, the
solution will not be useful to anyone except the ./configure --
enable-bloat crowd (myself included). Why would I want to abandon a
real solvable problem in order to attempt at solving a problem
that's uninteresting to me?
Because you misunderstand the problem? Because if it really were
simple, it would have been simply done already?
George
------
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/