Rich Felker wrote:

It went farther because it imposed language-specific semantics in
places where they do not belong. These semantics are correct with
sentences written in human languages which would not have been hard to
explicitly mark up, especially with a word processor doing it for you.
On the other hand they're horribly wrong in computer languages
(meaning any text file meant to be computer-read and -interpreted, not
just programming languages) where explicit markup is highly
undesirable or even illegal.


The Unicode Consortium is quite correctly more concerned with human
languages than programming languages. I think you are arguing yourself
into a dead end. Programming languages are ephemeral and some might argue they are in fact slowly converging with human languages.

Why does plain text still exist?

Read Eric Raymond's "The Art of Unix Programming". He answers the
question quite well.


You missed the point completely. Support of implicit bidirectionality
exists precisely because plain text exists. And it isn't going away any
time soon.

Or I could just ask: should we write C code in MS Word .doc format?

No reason to. Programming editors work well as they are and will
continue to work well after being adapted for Unicode.

I'm not quite sure what point you are trying to make here. Do away with plain text?

No, rather that handling of bidi scripts in plain text should be
biased towards computer languages rather than human languages. This is
both because plain text files are declining in use for human language
texts and increasing in use for computer language texts, and because
the display issues in human language texts can be solved with explicit
embeddign markers (which an editor or word processor could even
auto-insert for you) while the same marks are unwelcome in computer
languages.


You don't appear to have any experience writing lexical scanners for
programming languages. If you did, you would know how utterly trivial it
is to ignore embedded bidi codes an editor might introduce.

Though I haven't checked myself, I wouldn't be surprised if Perl,
Python, PHP, and a host of other programming languages weren't already
doing this, making your concerns pointless. You would probably find it
instructive to look at some lexical scanners.

In particular, an algorithm that only applies reordering within single
'words' would give the desired effects for writing numbers in an RTL
context and for writing single LTR words in a RTL context or single
RTL words in a LTR context. Anything more than that (with unlimited
long range reordering behavior) would then require explicit embedding.
You are aware that numeric expressions can be written differently in Hebrew and Arabic, yes? Sometimes the reordering of numeric expressions differ (i.e. 1/2 in Latin and Hebrew would be presented as 2/1 in Arabic). This also affects other characters often used with numbers such as percent and dollar sign. So even within strictly RTL scripts, different reordering is required depending on which script is being used. But if you know a priori which script is in use, reordering is trivial.

This is part of the "considered harmful" of bidi. :)
I'm not familiar with all this stuff, but as a mathematician I'm
curious how mathematicians working in these languages write. BTW
mathematical notation is an interesting example where traditional
storage order is visual and not logical.


Considered harmful? This is standard practice in these languages and has
been for a long time. You can't seriously expect readers of RTL
languages to just throw away everything they've learned since childhood
and learn to read their mathematical expressions backwards? Or simply
require that their scripts never appear in a plain text file? That is
ignorant at best and arrogant at worst.

This is the choice of each programming language designer: either allow directional override codes in the source or ban them. Those than ban them obviously assume that knowledge of the language's syntax is sufficient to allow an editor to present the source code text reasonably well.

It's simply not acceptable to need an editor that's aware of language
syntax in order to present the code for viewing and editing. You could
work around the problem by inserting dummy comments to prevent the
bidi algo from taking effect but that's really ugly and essentially
makes RTL scripts unusable in programming if the editor applies
Unicode bidi algo to the display.


You really need to start looking at code and stop pontificating from a
poorly understood position. Just about every programming editor out
there is already aware of programming language syntax. Many different
programming languages in most cases.

You left out the part where Unicode says that none of these things is strictly required.
This is blatently false. UAX#9 talks about PARAGRAPHS. Text files
consist of LINES. If you don't believe me read ISO/IEC 9899 or IEEE
1003.1-2001. The algorithm in UAX#9 simply cannot be applied to text
files as it is written because text files do not define paragraphs.
How is a line ending with newline in a text file not a paragraph? A poorly formatted paragraph, to be sure, but a paragraph nonetheless. The Unicode Standard says paragraph separators are required for the reordering algorithm. There is no reason why a line can't be viewed as a paragraph. And it even works reasonably well most of the time.

Actually it does not work with embedding. If a (semantic) paragraph
has been split into multiple lines, the bidi embedding levels will be
broken and cannot be processed by the UAX#9 algorithm without trying
to reconstruct an idea of what the whole "paragraph" meant. Also, a
problem occurs if the first character of a new line happens to be an
LTR character (some embedded English text?) in a (semantic) paragraph
that's Arabic or Hebrew.


This is trivially obvious. Why do you think I said "poorly formed
paragraph?" The obvious implication is that every once in a while,
reordering errors will happen because the algorithm is being applied to
a single line of a paragraph.

As you acknowledge below, a line it not necessarily an
unlimited-length object and in email it should not be longer than 80
characters (or preferably 72 or so to allow for quoting). So you can't
necessarily just take the MS Notepad approach of omitting newlines and
treating lines as paragraphs although this make be appropriate in some
uses of text files.

So instead of a substantive argument why a line can't be viewed as a
paragraph, you simply imply that it just can't be done. Weak.


I'm talking about the definition of a text file as a sequence of
lines, which might (on stupid legacy implementations) even be
fixed-width fields. It's under the stdio stuff about the difference
between text and binary mode. I could look it up but I don't feel like
digging thru the pdf file right now..


That section doesn't provide definitions of line or paragraph.

Han unification did indeed break existing practice, but I think you will find that the IRG (group of representatives from all Han-using countries) feels that in the long run, it was the best thing to do.

I agree it was best to do too. I just pointed it out as being contrary
to your claim that they made every effort not to break existing
practice.


For a mathematician, you are quite good at ignoring inconvenient logic.
The phrase "every effort to avoid breaking existing practice" does not
logically imply that no existing practice was broken. Weak.


I suppose this is true and I don't know the history and internal
politics well enough to know who was responsible for what. However,
unlike doublebyte/multibyte charsets which were becoming prevalent at
the time, UCS-2 data does not form valid C strings. A quick glance at
some historical remarks from unicode.org and Rob Pike suggests that
UTF-8 was invented well before any serious deployment of Unicode, i.e.
that the push for UCS-2 was deliberately aimed at breaking things,
though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
than the consortium as a whole.


You can ask any of the Unicode people from those companies and will get
the same answer. Something had to be done and UCS-2 was the answer at
the time. Conspiracy theories do not substantive argument make.

Those "useless" legacy characters avoided breaking many existing applications, most of which were not written for Southeast Asia. Some scripts had to end up in the 3-byte range of UTF-8. Are you in a position to determine who should and should not be in that range? Have

IMO the answer is common sense. Languages that have a low information
per character density (lots of letters/marks per word, especially
Indic) should be in 2-byte range and those with high information
density (especially ideographic) should be in 3-byte range. If it
weren't for so many legacy Latin blocks near the beginning of the
character set, most or all scripts for low-density languages could
have fit in the 2-byte range.


So you simply assume that nobody bothered to look into things like
information density et al during the formation of the Unicode
Standard? You don't appear to be aware of the social and political
ramifications involved in making decisions like that. It doesn't matter if it makes sense from a mathematical point of view, nations and people are involved.

you even considered why they ended up in that range?

Probably at the time of allocation, UTF-8 was not even around yet. I
haven't studied that portion of Unicode history. Still the legacy
characters could have been put at the end with CJK compat forms and
preshaped Arabic forms, etc. or even outside the BMP.

Scripts were placed when information about their encodings became
available to the Unicode Consortium. It's that simple. No big conspiracy
to give SEA scripts short shrift.


So you do understand. If it isn't fixable, what point is there in complaining about it? Find a better way.

That's what I'm trying to do... Maybe some Hebrew or Arabic users who
dislike the whole bidi mess (the one Israeli user I'm in contact with
hates bidi and thinks it's backwards...not a good sample size but
interesting nonetheless) will agree and try my ideas for a
unidirectional presentation and like them. Or maybe they'll think it's
ugly and look for other solutions.


Sure. Lots of people don't like the situation, but nobody has come up
with anything better. There is a very good reason for that.


Applications can draw their own bidi text with higher level formatting
information, of course. I'm thinking of a terminal-mode browser that
has the bidi text in HTML with <dir> tags and whatnot, or apps with a
text 'gui' consisting of separated interface elements.


Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
Have you checked out how it handles Unicode?

I have a lot of experience

Could you tell me some of what you've worked on and what conclusions
you reached? I'm not familiar with your work.

Well, you can refer to the kterm code for some of my work with ISO/IEC 2022, and I may be able to dig up an ancient version of Motif (ca. 1993) I adapted to use ISO/IEC 6429 and ISO/IEC 2022, and shortly after that first Motif debacle, I attempted unsuccessfully to get a variant of cxterm working with a combination of the two standards.

The conclusion was simple. The code quickly got too complicated to debug. All kinds of little boundary (buffer/screen) effects kept cropping up thanks to multi-byte escape sequences.


ISO 2022 is an abomination, certainly not an acceptable way to store
text due to its stateful nature, and although it works for
_displaying_ text, it's ugly even for that.

I've read ECMA-48 bidi stuff several times and still can't make any
sense of it, so I agree it's disgusting too. It does seem powerful but
powerful is often a bad thing. :)


Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte escape sequences.

All I will say about them is Unicode is a lot easier to deal with. Have

Easier to deal with because it solves an easier problem. UAX#9 tells
you what to do when you have explicit paragraph division and unbounded
search capability forwards and backwards. Neither of these exists in a
character cell device environment, and (depending on your view of what
constitutes a proper text file) possibly not in a text file either. My
view of a text file (maybe not very popular these days?) is that it's
a more-restricted version of a character cell terminal (no cursor
positioning allowed) but with unlimited height.

Having implemented UAX #9 and a couple of other approaches that produce the same or similar results, I don't see any problem using it to render text files. If your text file has one paragraph per line, then you will see occasional glitches in mixed LTR & RTL text.


a look at the old kterm code if you want to see how complicated things can get. And that was one of the cleaner implementations I've seen over the years.

Does it implement ECMA-48 version of bidi? Or random unspecified bidi
like mlterm? Or..?

kterm had ISO/IEC 2022 support. Very few people attempted to use ISO/IEC 6429 because they didn't understand it very well and they knew how complicated ISO/IEC 2022 was all by itself.

Hah! I often hear the same sentiment from people who don't know the difference between a glyph and a character.

I think we've established that I know the difference..

Yes, it is true that Indic, and even Khmer and Burmese scripts are relatively easy. All you need to do is create the right set of glyphs.

Exactly. That's a lot of work...for the font designer. Almost no work
for the application author or for the machine at runtime.

This frequently gives you multiple glyph codes for each abstract character. To do anything with the text, a mapping between glyph and abstract character is necessary for every program that uses that text.

No, it's necessary only for the terminal. The programs using the text
need not have any idea what language/script it comes from. This is the
whole beauty of using such apps.


I suspect you missed my point. Using glyph codes as an encoding gets complicated fast. You can ask anyone who has tried to do any serious NLP work with pre-Unicode Indic text. We are still having to write analysers and converters to figure out the correct abstract characters and their order for many scripts. I can provide a mapping table for one Burmese encoding that shows how hideously complicated it can get to map a glyph encoding to the underlying linear abstract character necessary to do any kind of linguistic analysis.

They generally don't change things in incompatible ways, certainly not
in ways that would require retrofitting existing data with proper
embedding codes. What they might consider doing though is adding a
support level 1.5 or such. Right now UAX#9 (implicitly?) says that an
application not implementing at least implicit bidi algorithm must not
interpret RTL characters visually at all.

Well, they don't want a program that simply reverses RTL segments claiming conformance with UAX #9, it is better to see it backward than to see it wrong. You can ask native users of RTL scripts about that. And ask more than one.


Well in many cases my "simple solutions" are too simple for people
who've gotten used to bloated featuresets and gotten used to putting
up with slowness, bugs, and insecurity. But we'll see. My whole family
of i18n-related projects started out with a desire to switch to UTF-8
everywhere and to have Latin, Tibetan, and Japanese support at the
console level without increased bloat, performance penalties, and huge
dependency trees. From there I first wrote a super-small UTF-8-only C
library and then turned towards the terminal emulator issue, which in
turn led to the font format issue, etc. etc. :) Maybe after a whole
year passes I'll have roughly what I wanted.


I don't recall having seen your "simple solutions" so I can't dismiss them off-hand as not being complicated enough yet. Like I said a couple emails ago, sometimes it doesn't matter if you have a better answer, but if it really is simple, accurate, and on the Internet, you can count on it supplanting the bloat eventually.

BTW, now that the holiday has passed, I probably won't have time to reply at similar length. But it's been fun.
--
---------------------------------------------------------------------------
Mark Leisher
Computing Research Lab             Nowadays, the common wisdom is to
New Mexico State University        celebrate diversity - as long as you
Box 30001, MSC 3CRL                don't point out that people are
Las Cruces, NM  88003              different.    -- Colin Quinn


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to