On Fri, Aug 04, 2006 at 01:30:34AM +0200, Werner LEMBERG wrote:
> > > What you probably mean is that some language data needs to be
> > > proprocessed into a normalized form before it is fed into the
> > > font, for example Indic and Arabic scripts.
> >
> > What sort of preprocessing? Reordering vowels? Replacement of Arabic
> > characters with the appropriate presentation forms?
> 
> Arabic needs tagging of glyphs as being `initial', `medial', `final',
> and `isolated', as specified in the Unicode book.  Since this is
> identical for all fonts the OpenType designers have decided to make
> this information not being part of the font itself.  In the long run,
> this makes the fonts smaller.

With my proposed context system it doesn't save but a few bytes total
in the font file since the context rules can be shared by all the
characters that need them.

> The process is simple (at least in theory -- there are many tricky
> details): A font contains a number of `features' like `use small caps'
> or `use old ligatures', or `use a different set of digits'.  Each
> feature consists of an ordered set of `lookups'.
> 
> Having a string of input character codes, you apply the first lookup
> table, then you start again and process the next one, and so on until
> all lookup tables have been applied.

Wow, what a horribly bad design. No wonder including arabic
initial/medial/final information would make the font so big.

> > How will this solve anything?  The core protocol is still
> > unacceptable because all the glyph info has to be transmitted to the
> > client, and this info is way too big.
> 
> AFAIK is it possible to have fonts on the client side, avoiding the
> overhead of transmitting fonts.

1. If the fonts are client side you're no longer using the core font
   system which was the topic under discussion: whether or not the
   core font system is usable.
2. If the fonts are client side you must transmit large amounts of
   data to the server for rendering.

> Please read this:
>   http://keithp.com/~keithp/talks/usenix2001/xrender/
> It discusses the X Rendering Extension which has become standard
> meanwhile, I think.

Again this has nothing to do with the original X font system.

> You still need code to handle the SFNT format.  As mentioned in
> another mail, you can compile FreeType without any support for outline
> formats, using SFNT bitmap fonts only.

Why would I want to use this format after you explained above how
stupid its substitution system is? I designed something much better on
the very first attempt and since have refined it much more. I'll post
the new ideas soon.

> > What I mean by bitmap font format is the character->glyph mapping
> > system.
> 
> I doubt that you find something really better than the abilities of
> GSUB

Already have.

> and GPOS tables.

GPOS is undesirable for character cell glyphs. It makes much more
sense just to include variants with the position pre-applied. The only
advantage of GSUB is for doing arbitrarily-long combining/stacking,
which has no hope of working except with variable-size fonts that can
extend outside of their original bounding boxes.

> > Is 3-4 bytes per potential substitution inefficient? That's what I'm
> > looking at. This is not counting context definitions specifying when
> > the substitution would be applied, but these definitions can often
> > be reused by many glyphs in the same script. As a simple example all
> > the Latin capital letters can share the "if superscribed combining
> > mark is attached" context.
> 
> In OpenType parlance this is called a `glyph class', defined in the
> GDEF table.
> 
> > Mongolian can be and is written horizontally as well.
> 
> Using Cyrillic, yes, but not the traditional script, AFAIK.

No, in the traditional script.

> > Certainly you can write vertical Mongolian in a Mongolian-only
> > editor, or in a top-down context in some sort of higher level word
> > processor or markup file, but the idea that you should see Mongolian
> > filenames vertically when you type "ls" somehow mixed in with other
> > filenames in horizontal orientation is hopeless.
> 
> Exactly.  We are again at the point where we have to define which
> scripts should be supported...

No. Like I said Mongolian can be and is written horizontally (L2R I
believe even though the original vertical version was written from
right to left). I believe the Unicode standard even mentions this
somewhere though I may be mistaken. If you want me to check again I
can ask friends who are familiar with the language.

> > I'm told there's also a script that runs R2L and L2R alternating on
> > successive rows, i.e. snakes back and forth, though I've never
> > actually been told what it is so perhaps it's a myth.
> 
> This is called `boustrophedon'.  Ancient Greek uses it, and Rongorongo
> also (the undeciphered script from the Easter Island).

Well as long as it's a dead languages only I won't lose any sleep over
not supporting it. The dead can rise and complain to me if they care.

> > Whether it's possible or reasonable to support such things remains
> > to be seen.
> 
> It's not reasonable IMHO.  Another (quite natural) limitation of the
> scripts to support.

Another viewpoint is that the script is obviously readable in both
directions and that printing it in either is equally correct.
Alternating orientation on successive lines would then be seen as a
local stylistic preference of the people who used those scripts rather
than a necessity of the script, and would be available as an option in
applications supporting it.

> > Anyway the question with stuff like Urdu is whether it's imperative
> > to typeset the text in its standard written form or whether a
> > 'computer style' line-based form or something is acceptable.
> 
> Ah, this is similar to the discussion whether it is acceptable to
> represent the German `ü', `ä', and `"ö' with `ue', `ae', and `oe',
> respectively.

How is it similar at all? One is a question of using the correct
characters (even the correct number of characters!) while the other is
a matter of spacing and layout. You have a justification to be mad
when "ü" is written as "ue" because it's a result of English-centric
legacy systems. There's nothing _fundamentally_ difficult about
displaying a "ü". 

On the other hand, ...

[Now before we go on, there's a difference between a word processor and
a text editor or command line. This needs to be abundantly clear
because what I'm about to say of course does not apply to a word
processor or quality typesetting system intended to produce
publications for print.]

...a speaker of Urdu is (IMO) not justified in being mad about the
lack of support for this diagonal layout. All the glyphs are correct.
They're all in the right order and orientation. Etc. etc. etc.

Ask yourself this: what would a speaker of Urdu do if they needed to
write a message and the only paper they had was barely tall enough for
one handwritten letter. If your answer is "write it all on one line"
or even "write it essentially on one line with each word slanted
slightly diagonal" then there's absolutely no reason the same can't be
done on a computer terminal.

> For me as a native German speaker, this is extremely
> ugly, and still a lot of computerized systems used in, say, public
> transport facilities are displaying this.

I agree, I just don't find it relevant.

> > I'm not saying that it's justifiable to have crap support for
> > languages or scripts, just that sometimes a language has to adapt
> > and develop alternate presentation forms that _will_ work with
> > technology, or risk becoming irrelevant as technology becomes more
> > important in society.
> 
> I'm quite conservative here: It's a very bad idea to adopt a language
> or script to the computer.  It should be the opposite.

You've hit the nail on the head with regard to why so much of m17n and
i18n software is bloated to hell: this is the exact philosophy of
bloatware! Rather than thinking like a computer and using a computer
in the natural way a computer works, the bloatware philosophy is to
stop at nothing until the computer has been beaten into submission and
forced to think like a human. This is why we have abominations like
Lisp, Java, Perl, garbage collection, MS Office assistant, "Wizards",
'auto-correct', ... (and yet they all still fail to think like a
human... ;)

If you insist on this kind of inefficiency with any other kind of
technology, people would call you mad. It's like spending billions of
dollars to create a fertile paradise in the middle of a desert (hey
they did it in Dubai...) rather than building your home in a sane
place to begin with.

I have no interest in telling people to revise their languages to make
them conform to legacy ASCII limitations, nor to make them conform to
international expectations or any other imperial bullshit. However I
do believe very strongly that all people (including English speakers)
should tolerate having their language displayed in a form that
respects both the intended look of the script and the nature of the
medium on which it's displayed. Most languages already have many
different ways of being written depending on whether their use is in a
book, on a street sign, on the sign for a place of business (often
vertical even in non-vertical-script languages), on a poster or flyer,
etc. Computers (and particularly on-topic now, terminals) are yet
another such medium.

Rich



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to