Re: Bidi considered harmful? :)

Rich Felker Fri, 01 Sep 2006 16:53:36 -0700

On Fri, Sep 01, 2006 at 03:46:44PM -0600, Mark Leisher wrote:
> Did it every occur to you that it wasn't the "word processing mentality" 
> of the Unicode designers that led to ambiguities surviving in plain 
> text? It is simply the fact that there is no nice neat solution. Unicode 
> went farther than just about anyone else in solving the general case of 
> reordering plain bidi text for display without explicit directional codes.


It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.

> Why does plain text still exist?

Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.

Or I could just ask: should we write C code in MS Word .doc format?

> >A bidi algorithm with minimal/no
> >implicit behavior works fine as long as you are not mixing
> >languages/scripts, and when mixing scripts it makes sense to use
> >explicit embedding -- especially since the cases of mixed scripts that
> >MUST work without formatting controls are files that are meant to be
> >machine-interpreted as opposed to pretty-printed for human
> >consumption.
> 
> I'm not quite sure what point you are trying to make here. Do away with 
> plain text?

No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.

> >In particular, an algorithm that only applies reordering within single
> >'words' would give the desired effects for writing numbers in an RTL
> >context and for writing single LTR words in a RTL context or single
> >RTL words in a LTR context. Anything more than that (with unlimited
> >long range reordering behavior) would then require explicit embedding.
> 
> You are aware that numeric expressions can be written differently in 
> Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions 
> differ (i.e. 1/2 in  Latin and Hebrew would be presented as 2/1 in 
> Arabic). This also affects other characters often used with numbers such 
> as percent and dollar sign. So even within strictly RTL scripts, 
> different reordering is required depending on which script is being 
> used. But if you know a priori which script is in use, reordering is 
> trivial.

This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.

> This is the choice of each programming language designer: either allow 
> directional override codes in the source or ban them. Those than ban 
> them obviously assume that knowledge of the language's syntax is 
> sufficient to allow an editor to present the source code text reasonably 
> well.

It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.

> >>You left out the part where Unicode says that none of these things is 
> >>strictly required.
> >
> >This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
> >consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
> >1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
> >files as it is written because text files do not define paragraphs.
> 
> How is a line ending with newline in a text file not a paragraph? A 
> poorly formatted paragraph, to be sure, but a paragraph nonetheless. The 
> Unicode Standard says paragraph separators are required for the 
> reordering algorithm. There is no reason why a line can't be viewed as a 
> paragraph. And it even works reasonably well most of the time.

Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first character of a new line happens to be an
LTR character (some embedded English text?) in a (semantic) paragraph
that's Arabic or Hebrew.

As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.

> BTW, what part of ISO/IEC 9899 are you referring to? All I see is 
> §7.19.2.7 which says something about lines being limited to 254 
> characters and a terminating newline character. No definitions of lines 
> or paragraphs that I see off hand.

I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..

> Han unification did indeed break existing practice, but I think you will 
> find that the IRG (group of representatives from all Han-using 
> countries) feels that in the long run, it was the best thing to do.

I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.

> UCS-2 didn't so much break existing practice as come along at one of the 
> most confusing periods of internationalization retrofitting of the C 
> libraries and language.

I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.

> Those "useless" legacy characters avoided breaking many existing 
> applications, most of which were not written for Southeast Asia. Some 
> scripts had to end up in the 3-byte range of UTF-8. Are you in a 
> position to determine who should and should not be in that range? Have 

IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.

Of course it's pointless to discuss this since we can't change it now
anyway.

> you even considered why they ended up in that range?

Probably at the time of allocation, UTF-8 was not even around yet. I
haven't studied that portion of Unicode history. Still the legacy
characters could have been put at the end with CJK compat forms and
preshaped Arabic forms, etc. or even outside the BMP.

> Like others who didn't like the Unicode bidi reordering approach, the 
> Arabeyes people were welcome to continue doing things the way they 
> wanted. Interoperability problems often either kill these companies or 
> force them to go Unicode at some level.

Thankfully there's not too much room for interoperability problems
with the data itself as long as you stick to logical order, especially
since the need for more than a single embedding level is rare. Unless
you're arguing for visual order, the question is entirely a display
matter, whether bidi display is compatible with other requirements.

> So you do understand. If it isn't fixable, what point is there in 
> complaining about it? Find a better way.

That's what I'm trying to do... Maybe some Hebrew or Arabic users who
dislike the whole bidi mess (the one Israeli user I'm in contact with
hates bidi and thinks it's backwards...not a good sample size but
interesting nonetheless) will agree and try my ideas for a
unidirectional presentation and like them. Or maybe they'll think it's
ugly and look for other solutions.

> >Naturally fullscreen programs can draw bidi text in its natural
> >directionality either by swapping character order (but some special
> >treatment of combining marks and Arabic shaping must be done by the
> >application first in order for this to work, and it will render
> >copy-and-paste from the terminal mostly useless) or by using ECMA-48
> >bidi controls (but don't expect anyone to use these until curses is
> >adapted for it and until screen supports it).
> 
> Hmm. Sounds just like a bidi reordering algorithm I heard about. You 
> know. The one the Unicode Consortium is touting.

Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.

> I have a lot of experience

Could you tell me some of what you've worked on and what conclusions
you reached? I'm not familiar with your work.

> with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022. 

ISO 2022 is an abomination, certainly not an acceptable way to store
text due to its stateful nature, and although it works for
_displaying_ text, it's ugly even for that.

I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)

> All I will say about them is Unicode is a lot easier to deal with. Have 

Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.

> a look at the old kterm code if you want to see how complicated things 
> can get. And that was one of the cleaner implementations I've seen over 
> the years.

Does it implement ECMA-48 version of bidi? Or random unspecified bidi
like mlterm? Or..?

> >Languages are messy. Scripts are not, except for a very few bidi
> >scripts. Even Indic scripts are relatively easy.
> 
> Hah! I often hear the same sentiment from people who don't know the 
> difference between a glyph and a character.

I think we've established that I know the difference..

> Yes, it is true that Indic, 
> and even Khmer and Burmese scripts are relatively easy. All you need to 
> do is create the right set of glyphs.

Exactly. That's a lot of work...for the font designer. Almost no work
for the application author or for the machine at runtime.

> This frequently gives you multiple glyph codes for each abstract 
> character. To do anything with the text, a mapping between glyph and 
> abstract character is necessary for every program that uses that text.

No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.

The same applies to gui apps too if they're using a nice widget kit.
Unfortunately all the existing widget kits are horribly bloated and
very painful to work with for someone not coming from a MS Windows
mentality (i.e. if you want to actually have control over the flow of
execution of your program..).

> Again, I would encourage you to try it yourself. Talk is cheap, 
> experience teaches.

That's what I'm working on, but sometimes discussing the issues at the
same time helps.

> >UAX#9 requires
> >imposing language semantics onto characters which is blatently wrong
> >and which is the source of the mess.
> 
> If 30 years of experience has led to blatantly wrong semantics, then 
> quit whining about it and fix it! The Unicode Consortium isn't deaf, 
> dumb, or stupid. They have been known to honor actual evidence of 
> incorrect behavior and change things when necessary. But they aren't 
> going to change things just because you find it inconveniently complicated.

They generally don't change things in incompatible ways, certainly not
in ways that would require retrofitting existing data with proper
embedding codes. What they might consider doing though is adding a
support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
application not implementing at least implicit bidi algorithm must not
interpret RTL characters visually at all.

> >This insistence on making simple
> >things into messes is why no real developers want to support i18n and
> >why only the crap like GNOME and KDE and OOO support it decently. I
> >believe that both 'sides' are wrong and that universal i18n on unix is
> >possible but only after you accept that unix lives by the New Jersey
> >approach and not the MIT approach.
> 
> I have been complaining about the general trend to over-complicate and 
> over-standardize software for years. These days the "art" of programming 
> only exists in the output of a rare handful of programmers. Don't worry 
> about it. Software will collapse under it's own weight in time. You just 
> have to be patient and wait until that happens and be ready with all 
> your simpler solutions.

Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.

> <sarcasm>
> But you better hurry up with those simpler solutions, the increasing 
> creep of unnecessary complexity into software is happening fast. The 
> crash is coming! It will probably arrive with /The Singularity/.
> </sarcasm>

Keep an eye on busybox. It's quickly gaining in features while
shrinking in size, and while currently the i18n support is rather poor
the developers are open to adding good support as long as it's an
option at compiletime. Along with my project I've been documenting the
quality, portability, i18n/m17n support, bloat, etc. of lots of other
software too and I'll eventually be making the results available
publicly.

Rich


> ------------------------------------------------------------------------
> Mark Leisher
> Computing Research Lab              We find comfort among those who
> New Mexico State University         agree with us, growth among those
> Box 30001, MSC 3CRL                 who don't.
> Las Cruces, NM  88003                 -- Frank A. Clark
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                         Somehow seems appropriate
                                           to the topic at hand.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

Reply via email to