Re: Bidi considered harmful? :)

Mark Leisher Fri, 01 Sep 2006 14:47:40 -0700

Rich Felker wrote:

I can say with certainty born of 10+ years of trying to implement animplicit bidi reordering routine that "just does the right thing," thereare ambiguities that simply can't be avoided. Like your example.
Are one or both numbers associated with the RTL text or the LTR text?Simple question, multiple answers. Some answers are simple, some are not.
Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.

Did it every occur to you that it wasn't the "word processing mentality"of the Unicode designers that led to ambiguities surviving in plaintext? It is simply the fact that there is no nice neat solution. Unicodewent farther than just about anyone else in solving the general case ofreordering plain bidi text for display without explicit directional codes.

The Unicode bidi reordering algorithm is not fundamentally broken, itsimply provides a result that is correct in many, but not all cases. Ifyou can defy 30 years of experience in implicit bidi reorderingimplementations and come up with one that does the correct thing all thetime, you could be a very rich man.
Why is implicit so important?


Why does plain text still exist?

A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.

I'm not quite sure what point you are trying to make here. Do away withplain text?


In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.

You are aware that numeric expressions can be written differently inHebrew and Arabic, yes? Sometimes the reordering of numeric expressionsdiffer (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 inArabic). This also affects other characters often used with numbers suchas percent and dollar sign. So even within strictly RTL scripts,different reordering is required depending on which script is beingused. But if you know a priori which script is in use, reordering istrivial.

So you have a choice, adapt your config file reader to ignore a fewcharacters or come up with an algorithm that displays plain textcorrectly all the time.


What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)

This is the choice of each programming language designer: either allowdirectional override codes in the source or ban them. Those than banthem obviously assume that knowledge of the language's syntax issufficient to allow an editor to present the source code text reasonablywell.

You left out the part where Unicode says that none of these things isstrictly required.


This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.

How is a line ending with newline in a text file not a paragraph? Apoorly formatted paragraph, to be sure, but a paragraph nonetheless. TheUnicode Standard says paragraph separators are required for thereordering algorithm. There is no reason why a line can't be viewed as aparagraph. And it even works reasonably well most of the time.

BTW, what part of ISO/IEC 9899 are you referring to? All I see is§7.19.2.7 which says something about lines being limited to 254characters and a terminating newline character. No definitions of linesor paragraphs that I see off hand.


I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.

Han unification did indeed break existing practice, but I think you willfind that the IRG (group of representatives from all Han-usingcountries) feels that in the long run, it was the best thing to do.

UCS-2 didn't so much break existing practice as come along at one of themost confusing periods of internationalization retrofitting of the Clibraries and language. The wchar_t type was in the works before UCS-2came along. And in most implementations it could hold a UCS-2 character.I don't recall UTF-1 being around long enough to have much of an impact.Consider how quickly it was discarded in favor of UTF-8. And I certainlydon't recall UTF-1 being forced on anyone.


On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...

Those "useless" legacy characters avoided breaking many existingapplications, most of which were not written for Southeast Asia. Somescripts had to end up in the 3-byte range of UTF-8. Are you in aposition to determine who should and should not be in that range? Haveyou even considered why they ended up in that range?

Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!

So are you implying that good programming practice requires lines are tobe ended with a newline and paragraphs are separated by two newlines?What about the 25 year convention of CRLF on DOS/Win? What about the 20year practice of using CR on Mac? Should we denounce them as heretics tobe excommunicated and unilaterally dictate to all that newline is theonly answer, just like you seem to think the Unicode Consortium did?

Like others who didn't like the Unicode bidi reordering approach, theArabeyes people were welcome to continue doing things the way theywanted. Interoperability problems often either kill these companies orforce them to go Unicode at some level.

Why is it someone else's responsibility to code it? You are the one thatfinds decades of experience unacceptable. Stop whining and fix it.That's what I did. I'm still working on it 13 years later, but I'm notcomplaining any more.


You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.

So you do understand. If it isn't fixable, what point is there incomplaining about it? Find a better way.

The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).

Hmm. Sounds just like a bidi reordering algorithm I heard about. Youknow. The one the Unicode Consortium is touting.

I have a lot of experience with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.All I will say about them is Unicode is a lot easier to deal with. Havea look at the old kterm code if you want to see how complicated thingscan get. And that was one of the cleaner implementations I've seen overthe years.

Again, I would encourage you to try it yourself. Talk is cheap,experience teaches.


Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy.

Hah! I often hear the same sentiment from people who don't know thedifference between a glyph and a character. Yes, it is true that Indic,and even Khmer and Burmese scripts are relatively easy. All you need todo is create the right set of glyphs.

This frequently gives you multiple glyph codes for each abstractcharacter. To do anything with the text, a mapping between glyph andabstract character is necessary for every program that uses that text.

Again, I would encourage you to try it yourself. Talk is cheap,experience teaches.

UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess.

If 30 years of experience has led to blatantly wrong semantics, thenquit whining about it and fix it! The Unicode Consortium isn't deaf,dumb, or stupid. They have been known to honor actual evidence ofincorrect behavior and change things when necessary. But they aren'tgoing to change things just because you find it inconveniently complicated.

This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.

I have been complaining about the general trend to over-complicate andover-standardize software for years. These days the "art" of programmingonly exists in the output of a rare handful of programmers. Don't worryabout it. Software will collapse under it's own weight in time. You justhave to be patient and wait until that happens and be ready with allyour simpler solutions.


<sarcasm>

But you better hurry up with those simpler solutions, the increasingcreep of unnecessary complexity into software is happening fast. Thecrash is coming! It will probably arrive with /The Singularity/.

</sarcasm>
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab              We find comfort among those who
New Mexico State University         agree with us, growth among those
Box 30001, MSC 3CRL                 who don't.
Las Cruces, NM  88003                 -- Frank A. Clark

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

Reply via email to