Rich Felker wrote:
If that were the problem it would be trivial. The problems are much
more fundamental. The key examples you should look at are things like:
printf("%s %d %d %s\n", string1, number2, number3, string4); where the
output is intended to be columnar. Everything is fine until someone
puts in data where string1 ends in RTL text and string4 begins with
RTL text, in which case the numbers switch places. This kind of
instability is not just awkward; it shows that implicit bidi is
fundamentally broken.
I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.
Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.
The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.
Implicit bidi requires interpreting a flow of plain text as
sentence/paragraph content which is simply not a reasonable
assumption. Consider also what would happen if your text file is two
preformatted 32-character-wide paragraph columns side-by-side. Now
imagine the kind of havok that could result if this sort of insanity
took place in the presentation of configuration files with critical
security settings, for instance where the strings are usernames (which
MUST be able to contain any letter character from any language) and
the numbers are permission levels. And certainly you can't just throw
explicit direction markers into a config file like that because they'd
alter the semantics (which should be purely byte-oriented; there's no
reason any program not displaying text should include code to process
the contents).
So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.
One of the unacceptable things that the Unicode consortium has done
(as opposed to ISO 10646 which, after their initial debacle, has been
quite reasonable and conservative in what they specify) is to presume
they can redefine what a text file is. This has included BOMs,
paragraph break character, implicit(?) deprecation of newline
character as a line/paragraph break, etc. Notice that all of these
redefinitions have been universally rejected by *NIX users because
they are incompatible with the *NIX notion of a text file. My view is
that implicit bidi is equally incompatible with text files and should
be rejected for the same reasons.
You left out the part where Unicode says that none of these things is
strictly required. The *NIX community didn't reject anything. They
didn't need to. You also seem unaware of how much effort was made by
ISO, the Unicode Consortium, and all the national standards bodies to
avoid breaking a lot of existing practice.
I highly recommend participating in any standards development process
managed by any national or international standards body. You will find
an obsession with avoidance of breaking existing practice.
I'm not unwilling to support implicit bidi if somebody else wants to
code it, but the output WILL BE WRONG in many cases and thus will be
off by default. The data needed to do it correctly is simply not
there.
Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.
Human languages and the scripts used to represent them are messy. There
are no neat solutions. Get used to it.
Good day and good luck.
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/