Rich Felker wrote:
I can say with certainty born of 10+ years of trying to implement an
implicit bidi reordering routine that "just does the right thing," there
are ambiguities that simply can't be avoided. Like your example.
Are one or both numbers associated with the RTL text or the LTR text?
Simple question, multiple answers. Some answers are simple, some are not.
Exactly. Unicode bidi algorithm assumes that anyone putting bidi
characters in a text stream will give them special consideration and
manually resolve these issues with explicit embedding. That is, it
comes from the word processor mentality of the designers of Unicode.
They never stop to think that maybe an automated process that doesn't
know about character semantics could be writing strings, or that
syntax in a particular text file (like passwd, csv files, tsv files,
etc.) could preclude such treatment.
Did it every occur to you that it wasn't the "word processing mentality"
of the Unicode designers that led to ambiguities surviving in plain
text? It is simply the fact that there is no nice neat solution. Unicode
went farther than just about anyone else in solving the general case of
reordering plain bidi text for display without explicit directional codes.
The Unicode bidi reordering algorithm is not fundamentally broken, it
simply provides a result that is correct in many, but not all cases. If
you can defy 30 years of experience in implicit bidi reordering
implementations and come up with one that does the correct thing all the
time, you could be a very rich man.
Why is implicit so important?
Why does plain text still exist?
A bidi algorithm with minimal/no
implicit behavior works fine as long as you are not mixing
languages/scripts, and when mixing scripts it makes sense to use
explicit embedding -- especially since the cases of mixed scripts that
MUST work without formatting controls are files that are meant to be
machine-interpreted as opposed to pretty-printed for human
consumption.
I'm not quite sure what point you are trying to make here. Do away with
plain text?
In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
You are aware that numeric expressions can be written differently in
Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions
differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in
Arabic). This also affects other characters often used with numbers such
as percent and dollar sign. So even within strictly RTL scripts,
different reordering is required depending on which script is being
used. But if you know a priori which script is in use, reordering is
trivial.
So you have a choice, adapt your config file reader to ignore a few
characters or come up with an algorithm that displays plain text
correctly all the time.
What should happpen when editing source code? Should x = FOO(BAR);
have the argument on the left while x = FOO(bar); has it on the right?
Should source code require all RTL identifiers to be wrapped in
embedding codes? (They're illegal in ISO C and any language taking
identifier rules from ISO/IEC TR 10176, yet Hebrew and Arabic
characters are legal like all other characters used to write non-dead
languages.)
This is the choice of each programming language designer: either allow
directional override codes in the source or ban them. Those than ban
them obviously assume that knowledge of the language's syntax is
sufficient to allow an editor to present the source code text reasonably
well.
You left out the part where Unicode says that none of these things is
strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
How is a line ending with newline in a text file not a paragraph? A
poorly formatted paragraph, to be sure, but a paragraph nonetheless. The
Unicode Standard says paragraph separators are required for the
reordering algorithm. There is no reason why a line can't be viewed as a
paragraph. And it even works reasonably well most of the time.
BTW, what part of ISO/IEC 9899 are you referring to? All I see is
ยง7.19.2.7 which says something about lines being limited to 254
characters and a terminating newline character. No definitions of lines
or paragraphs that I see off hand.
I'm aware that unlike many other standardization processes, the
Unicode Consortium was very inconsistent in its application of this
rule. Many people consider Han unification to beak existing practice.
UCS-2 which they initially tried to push onto people, as well as
UTF-1, heavily broke existing practice as well. The semantics of SHY
break existing standards from ISO-8859. Replacing UCS-2 with UTF-16
broke existing practice on Windows by causing MS's implementation of
wchar_t to violate C/POSIX by not representing a complete character
anymore. Etc. etc. etc.
Han unification did indeed break existing practice, but I think you will
find that the IRG (group of representatives from all Han-using
countries) feels that in the long run, it was the best thing to do.
UCS-2 didn't so much break existing practice as come along at one of the
most confusing periods of internationalization retrofitting of the C
libraries and language. The wchar_t type was in the works before UCS-2
came along. And in most implementations it could hold a UCS-2 character.
I don't recall UTF-1 being around long enough to have much of an impact.
Consider how quickly it was discarded in favor of UTF-8. And I certainly
don't recall UTF-1 being forced on anyone.
On the other hand they had no problem with filling the beginning of
the BMP with useless legacy characters for the sake of compatibility,
thus forcing South[east] Asian scripts which use many characters in
each word into the 3-byte range of UTF-8...
Those "useless" legacy characters avoided breaking many existing
applications, most of which were not written for Southeast Asia. Some
scripts had to end up in the 3-byte range of UTF-8. Are you in a
position to determine who should and should not be in that range? Have
you even considered why they ended up in that range?
Unicode is far from ideal, but it's what we're stuck with, I agree.
However UAX#9 is inconsistent with the definition of a text file and
with good programming practice and thus alternate ways to present RTL
text acceptably (such as an entirely RTL display for RTL users) are
needed. I've read rants from some of the Arabeyes folks that they're
so disappointed with UAX#9 that they'd rather go the awful route of
storing text backwards!!
So are you implying that good programming practice requires lines are to
be ended with a newline and paragraphs are separated by two newlines?
What about the 25 year convention of CRLF on DOS/Win? What about the 20
year practice of using CR on Mac? Should we denounce them as heretics to
be excommunicated and unilaterally dictate to all that newline is the
only answer, just like you seem to think the Unicode Consortium did?
Like others who didn't like the Unicode bidi reordering approach, the
Arabeyes people were welcome to continue doing things the way they
wanted. Interoperability problems often either kill these companies or
force them to go Unicode at some level.
Why is it someone else's responsibility to code it? You are the one that
finds decades of experience unacceptable. Stop whining and fix it.
That's what I did. I'm still working on it 13 years later, but I'm not
complaining any more.
You did not fix it because it cannot be fixed anymore than you can
tell me whether 1,200 means the number 1200 (printed in an ugly legacy
form) or a cvs list of 1 and 200. Nor can I fix it. I'm well aware
that any implicit bidi at the terminal level WILL display blatently
wrong and misleading information in numerous real world cases, and
that text will jump around the terminal in an unpredictable and
illogical fashion under cursor control and deletion, replacement, or
insertion over existing text. As such, I deem such a feature a waste
of time to implement. It will mess up more stuff than it 'fixes'.
So you do understand. If it isn't fixable, what point is there in
complaining about it? Find a better way.
The alternatives are either to display characters in the wrong order
(siht ekil) or to unify the flow of text to one direction without
alterring the visual representation (which rotation accomplishes).
Naturally fullscreen programs can draw bidi text in its natural
directionality either by swapping character order (but some special
treatment of combining marks and Arabic shaping must be done by the
application first in order for this to work, and it will render
copy-and-paste from the terminal mostly useless) or by using ECMA-48
bidi controls (but don't expect anyone to use these until curses is
adapted for it and until screen supports it).
Hmm. Sounds just like a bidi reordering algorithm I heard about. You
know. The one the Unicode Consortium is touting.
I have a lot of experience with ECMA-48 (ISO/IEC 6429) and ISO/IEC 2022.
All I will say about them is Unicode is a lot easier to deal with. Have
a look at the old kterm code if you want to see how complicated things
can get. And that was one of the cleaner implementations I've seen over
the years.
Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.
Languages are messy. Scripts are not, except for a very few bidi
scripts. Even Indic scripts are relatively easy.
Hah! I often hear the same sentiment from people who don't know the
difference between a glyph and a character. Yes, it is true that Indic,
and even Khmer and Burmese scripts are relatively easy. All you need to
do is create the right set of glyphs.
This frequently gives you multiple glyph codes for each abstract
character. To do anything with the text, a mapping between glyph and
abstract character is necessary for every program that uses that text.
Again, I would encourage you to try it yourself. Talk is cheap,
experience teaches.
UAX#9 requires
imposing language semantics onto characters which is blatently wrong
and which is the source of the mess.
If 30 years of experience has led to blatantly wrong semantics, then
quit whining about it and fix it! The Unicode Consortium isn't deaf,
dumb, or stupid. They have been known to honor actual evidence of
incorrect behavior and change things when necessary. But they aren't
going to change things just because you find it inconveniently complicated.
This insistence on making simple
things into messes is why no real developers want to support i18n and
why only the crap like GNOME and KDE and OOO support it decently. I
believe that both 'sides' are wrong and that universal i18n on unix is
possible but only after you accept that unix lives by the New Jersey
approach and not the MIT approach.
I have been complaining about the general trend to over-complicate and
over-standardize software for years. These days the "art" of programming
only exists in the output of a rare handful of programmers. Don't worry
about it. Software will collapse under it's own weight in time. You just
have to be patient and wait until that happens and be ready with all
your simpler solutions.
<sarcasm>
But you better hurry up with those simpler solutions, the increasing
creep of unnecessary complexity into software is happening fast. The
crash is coming! It will probably arrive with /The Singularity/.
</sarcasm>
--
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab We find comfort among those who
New Mexico State University agree with us, growth among those
Box 30001, MSC 3CRL who don't.
Las Cruces, NM 88003 -- Frank A. Clark
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/