On Fri, Sep 01, 2006 at 09:36:44AM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >If that were the problem it would be trivial. The problems are much
> >more fundamental. The key examples you should look at are things like:
> >printf("%s %d %d %s\n", string1, number2, number3, string4); where the
> >output is intended to be columnar. Everything is fine until someone
> >puts in data where string1 ends in RTL text and string4 begins with
> >RTL text, in which case the numbers switch places. This kind of
> >instability is not just awkward; it shows that implicit bidi is
> >fundamentally broken.
>
> I can say with certainty born of 10+ years of trying to implement an
> implicit bidi reordering routine that "just does the right thing," there
> are ambiguities that simply can't be avoided. Like your example.
>
> Are one or both numbers associated with the RTL text or the LTR text?
> Simple question, multiple answers. Some answers are simple, some are not.
Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.
> The Unicode bidi reordering algorithm is not fundamentally broken, it
> simply provides a result that is correct in many, but not all cases. If
> you can defy 30 years of experience in implicit bidi reordering
> implementations and come up with one that does the correct thing all the
> time, you could be a very rich man.
Why is implicit so important? A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.
In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
> So you have a choice, adapt your config file reader to ignore a few
> characters or come up with an algorithm that displays plain text
> correctly all the time.
What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)
> >One of the unacceptable things that the Unicode consortium has done
> >(as opposed to ISO 10646 which, after their initial debacle, has been
> >quite reasonable and conservative in what they specify) is to presume
> >they can redefine what a text file is. This has included BOMs,
> >paragraph break character, implicit(?) deprecation of newline
> >character as a line/paragraph break, etc. Notice that all of these
> >redefinitions have been universally rejected by *NIX users because
> >they are incompatible with the *NIX notion of a text file. My view is
> >that implicit bidi is equally incompatible with text files and should
> >be rejected for the same reasons.
> >
>
> You left out the part where Unicode says that none of these things is
> strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
> The *NIX community didn't reject anything. They
> didn't need to. You also seem unaware of how much effort was made by
> ISO, the Unicode Consortium, and all the national standards bodies to
> avoid breaking a lot of existing practice.
I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.
On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...
Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!
> >I'm not unwilling to support implicit bidi if somebody else wants to
> >code it, but the output WILL BE WRONG in many cases and thus will be
> >off by default. The data needed to do it correctly is simply not
> >there.
>
> Why is it someone else's responsibility to code it? You are the one that
> finds decades of experience unacceptable. Stop whining and fix it.
> That's what I did. I'm still working on it 13 years later, but I'm not
> complaining any more.
You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.
The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).
> Human languages and the scripts used to represent them are messy.
> There are no neat solutions. Get used to it.
Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy. UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess. This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/