Re: Bidi considered harmful? :)

Rich Felker Mon, 04 Sep 2006 22:04:06 -0700

On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote:
> Rich Felker wrote:
> >
> >It went farther because it imposed language-specific semantics in
> >places where they do not belong. These semantics are correct with
> >sentences written in human languages which would not have been hard to
> >explicitly mark up, especially with a word processor doing it for you.
> >On the other hand they're horribly wrong in computer languages
> >(meaning any text file meant to be computer-read and -interpreted, not
> >just programming languages) where explicit markup is highly
> >undesirable or even illegal.
> 
> The Unicode Consortium is quite correctly more concerned with human
> languages than programming languages. I think you are arguing yourself
> into a dead end. Programming languages are ephemeral and some might 
> argue they are in fact slowly converging with human languages.


Arrg, C is not going away anytime soon. C is THE LANGUAGE as far as
POSIX is concerned. The reason I said "arrg" is that I feel like this
gap between the core values of the "i18n bloatware crowd" and the
"hardcore lowlevel efficient software crowd" is what keeps good i18n
out of the best software. When you talk about programming languages
converging with human languages somehow all I can think of us Perl...
yuck! Larry Wall's been great about pushing Unicode and UTF-8, but
Perl itself is a horrible mess. The implementation is hopelessly bad
and there's little hope of there ever being a reimplementation.

Anyway as I've said again and again, it's no problem for human
language text to have explicit embedding tagging. It doesn't need to
conform to syntax rules (oh yeah Perl code doesn't need to either ;)).
Fancy editors can even insert tags for you. On the other hand,
stuffing extra control characters into machine-read texts with
specific syntactical and semantic rules is not possible. You can't
even just strip these characters when processing because, depending on
the semantics of the file, they may either be controlling the display
of the file or literal embedding controls to be used when the strings
from the file are printed to their final destination.

> >Or I could just ask: should we write C code in MS Word .doc format?
> 
> No reason to. Programming editors work well as they are and will
> continue to work well after being adapted for Unicode.

No, if they perform the algorithm in UAX#9 they will display garbled
unreadable code. Or does C somehow qualify as a "higher level
protocol" for formatting?

> You don't appear to have any experience writing lexical scanners for
> programming languages. If you did, you would know how utterly trivial it
> is to ignore embedded bidi codes an editor might introduce.

I'm quite aware that it's simple to code, but also illegal according
to the specs. Also you're ignoring the more troublesome issues...
Obviously you can't remove them inside strings. :) Issues with
comments too..

> Though I haven't checked myself, I wouldn't be surprised if Perl,
> Python, PHP, and a host of other programming languages weren't already
> doing this, making your concerns pointless.

I doubt it, but even it they do, these are toy languages with one
implementation and no specification (and in Perl's case, for which
it's hopeless to even try to write a specification). It's easy to hack
whatever you want and break compatibility with every new release of
the language when your implementation is the only one. It's much
harder when you're working with an international standard for a
language that's been around (and rather stable!) approaching-40-years
and intended to have multiple interoperable implementations.

> You can't seriously expect readers of RTL
> languages to just throw away everything they've learned since childhood
> and learn to read their mathematical expressions backwards? Or simply
> require that their scripts never appear in a plain text file? That is
> ignorant at best and arrogant at worst.

I've seen examples that show that UAX#9 just butchers mathematical
expressions in the absence of explicit bidi control.

> You really need to start looking at code and stop pontificating from a
> poorly understood position. Just about every programming editor out
> there is already aware of programming language syntax. Many different
> programming languages in most cases.

Cheap regex-based syntax hilighting is not the same thing at all. But
this is aside from the point, that it's fundamentally WRONG to need a
special tool that knows about the syntax of your computer language in
order to edit it. What if you've designed your own language to solve a
particular problem? Do you have to go and modify your editor to make
it display this text correctly for this language? NO! That's the whole
reason we have plain text. You can edit it without having to have a
special program!

> >As you acknowledge below, a line it not necessarily an
> >unlimited-length object and in email it should not be longer than 80
> >characters (or preferably 72 or so to allow for quoting). So you can't
> >necessarily just take the MS Notepad approach of omitting newlines and
> >treating lines as paragraphs although this make be appropriate in some
> >uses of text files.
> 
> So instead of a substantive argument why a line can't be viewed as a
> paragraph, you simply imply that it just can't be done. Weak.

No, I agree that it can be. I'm just saying that a line can't do the
things you expect a paragraph to do, though. In particular it can't be
arbitrarily long in any plain text context, although it could be in
some.

> >I'm talking about the definition of a text file as a sequence of
> >lines, which might (on stupid legacy implementations) even be
> >fixed-width fields. It's under the stdio stuff about the difference
> >between text and binary mode. I could look it up but I don't feel like
> >digging thru the pdf file right now..
> 
> That section doesn't provide definitions of line or paragraph.

See 7.19.2 Streams.

> >I agree it was best to do too. I just pointed it out as being contrary
> >to your claim that they made every effort not to break existing
> >practice.
> 
> For a mathematician, you are quite good at ignoring inconvenient logic.
> The phrase "every effort to avoid breaking existing practice" does not
> logically imply that no existing practice was broken. Weak.

Read the history. Han unification was one of the very first points of
Unicode, even though it was obvious that it would break much existing
practice. This seems to have been connected to the misguided goal of
trying to make everything into fixed-width 16bit characters. From what
I understand, early Unicode was making every effort _to break_
existing practice. Their motto was "...begin at 0 and add the next
character" which to me implies "throw out everything that already
exists and start from scratch." I've never seen the early drafts but I
wouldn't be surprised if the original characters 0-127 didn't even
match ASCII.

> >I suppose this is true and I don't know the history and internal
> >politics well enough to know who was responsible for what. However,
> >unlike doublebyte/multibyte charsets which were becoming prevalent at
> >the time, UCS-2 data does not form valid C strings. A quick glance at
> >some historical remarks from unicode.org and Rob Pike suggests that
> >UTF-8 was invented well before any serious deployment of Unicode, i.e.
> >that the push for UCS-2 was deliberately aimed at breaking things,
> >though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more
> >than the consortium as a whole.
> 
> You can ask any of the Unicode people from those companies and will get
> the same answer. Something had to be done and UCS-2 was the answer at
> the time. Conspiracy theories do not substantive argument make.

I've been researching what I can with the little information available
and it seems that the early Unicode architects got a strong disgust
for variable-size characters from their experience with Shift_JIS
(which was extremely poorly designed) and other CJK encodings and
developed a dogma that fixed-width was the way to go. There are
numerous references to this sort of thinking in "10 Years of Unicode"
published under history on unicode.org.

> So you simply assume that nobody bothered to look into things like
> information density et al during the formation of the Unicode
> Standard? You don't appear to be aware of the social and political
> ramifications involved in making decisions like that. It doesn't matter 
> if it makes sense from a mathematical point of view, nations and people 
> are involved.

Latin text (which is mostly ASCII anyway) would go up in size by a few
percent while many languages would go down by 33%. Sounds like a fair
trade. I'm sure there are political ramifications, and of course the
answer is always: do what pleases the countries with the most
money/power rather than doing what serves the largest population and
the population that has the greatest scarcity of storage space...

> Scripts were placed when information about their encodings became
> available to the Unicode Consortium. It's that simple. No big conspiracy
> to give SEA scripts short shrift.

Honestly I think they just didn't care about UTF-8 at the time because
they still had delusions that people would switch to UCS-2 for
everything. Also I've been told that the arrangement was intended to
be "West to East"..

> >Applications can draw their own bidi text with higher level formatting
> >information, of course. I'm thinking of a terminal-mode browser that
> >has the bidi text in HTML with <dir> tags and whatnot, or apps with a
> >text 'gui' consisting of separated interface elements.
> 
> Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser.
> Have you checked out how it handles Unicode?

The only app I've seriously checked out is mined simply because most
apps don't have support for bidi on the console (and many still don't
even know how to use wcwidth...! including emacs!! :( ).

If lynx handles bidi specially I'd be interested in seeing what it
does. However this brings up another interesting question: what should
lynx -dump do? :) Naturally dumping in visual order is wrong, but
generating a text file that will look right when displayed according
to UAX#9 sounds quite difficult, especially when you take multiple
columns, etc. into account. Of course lynx is old broken crap that
doesn't even support tables so maybe it has it easier.. :) These days
I use ELinks, but it has very very poor i18n support. :(

> >I've read ECMA-48 bidi stuff several times and still can't make any
> >sense of it, so I agree it's disgusting too. It does seem powerful but
> >powerful is often a bad thing. :)
> 
> Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte 
> escape sequences.

I'm confused what you mean by multi-byte escape sequences. What I know
of as ISO 2022 is the charset-switching escapes used for legacy CJK
support and "vt100 linedrawing characters", but you seem to be talking
about something related to bidi. Does ISO 2022 have bidi controls as
well?

> >>All I will say about them is Unicode is a lot easier to deal with. Have 
> >
> >Easier to deal with because it solves an easier problem. UAX#9 tells
> >you what to do when you have explicit paragraph division and unbounded
> >search capability forwards and backwards. Neither of these exists in a
> >character cell device environment, and (depending on your view of what
> >constitutes a proper text file) possibly not in a text file either. My
> >view of a text file (maybe not very popular these days?) is that it's
> >a more-restricted version of a character cell terminal (no cursor
> >positioning allowed) but with unlimited height.
> 
> Having implemented UAX #9 and a couple of other approaches that produce 
> the same or similar results, I don't see any problem using it to render 
> text files. If your text file has one paragraph per line, then you will 
> see occasional glitches in mixed LTR & RTL text.

Seek somewhere in the middle of the line and type a character of the
opposite directionality. Watch the whole line jump around and the
character you just typed end up in a different column from where your
cursor was placed.

This sort of thing will happen all the time in a terminal when the app
goes to draw interface elements, etc. over top of part of the text. If
it doesn't, i.e. if the terminal implements a sort of "hard implicit
bidi", then the terminal will just hopelessly corrupt unless the
program has explicit bidi logic matching the terminal's.

> >>This frequently gives you multiple glyph codes for each abstract 
> >>character. To do anything with the text, a mapping between glyph and 
> >>abstract character is necessary for every program that uses that text.
> >
> >No, it's necessary only for the terminal. The programs using the text
> >need not have any idea what language/script it comes from. This is the
> >whole beauty of using such apps.
> 
> I suspect you missed my point. Using glyph codes as an encoding gets 
> complicated fast.

Yes but where did I say anything about glyph codes? In both Unicode
and ISCII text everything is character codes, not glyph codes. Sorry
but I don't understand what you were trying to say..

> Well, they don't want a program that simply reverses RTL segments 
> claiming conformance with UAX #9, it is better to see it backward than 
> to see it wrong. You can ask native users of RTL scripts about that. And 
> ask more than one.

It says more than that; it says that a program is forbidden from
interpreting the characters visually at all if it doesn't perform at
least the implicit part of UAX#9. From my reading, this means that
UAX#9 deems it worse to show the RTL characters in LTR order than not
to show them at all. It also precludes display strategies like the one
I proposed.

> >Well in many cases my "simple solutions" are too simple for people
> >who've gotten used to bloated featuresets and gotten used to putting
> >up with slowness, bugs, and insecurity. But we'll see. My whole family
> >of i18n-related projects started out with a desire to switch to UTF-8
> >everywhere and to have Latin, Tibetan, and Japanese support at the
> >console level without increased bloat, performance penalties, and huge
> >dependency trees. From there I first wrote a super-small UTF-8-only C
> >library and then turned towards the terminal emulator issue, which in
> >turn led to the font format issue, etc. etc. :) Maybe after a whole
> >year passes I'll have roughly what I wanted.
> >
> 
> I don't recall having seen your "simple solutions" so I can't dismiss 

http://svn.mplayerhq.hu/libc/trunk/
About 100kb of code and a few kb of data. E.g. iconv is 2kb, missing
support for CJK legacy encodings at present, final size should be
about 2.5-2.7kb.

Terminal emulator uuterm isn't checked in yet but it's looking like
the whole program with support for all scripts (except RTL scripts, if
you don't count non-UAX#9-conformant display as support) will come to
about 50kb of code static linked. Plus about 1.5 meg for a complete
font.


On a separate note... maybe it would help it I express and clarify my
view on UAX#9:

I think it very much has its place and it's great when formatting
content that is known to be human-language text for display in the
traditional form expected by most readers. However, IMO what UAX#9
should be seen as is a specification of the correspondence between the
stored "logical order" text and the traditional print form, in a way
as a definition of "logical order" text. It's important to have this
kind of definition for legal purposes especially, so e.g. if someone
has signed a document containing particular bidi text, it's clear what
printed text ordering that binary text is meant to represent and thus
clear what was signed.

On the other hand, I find the whole idea of bidirectionality harmful.
Human language text has always involved ambiguity as far as
interpreting the meaning, but aside from bidi text, at least there is
an unambiguous way to display the characters so that their logical
order is clear to the reader, and this method does not require the
machine to interpret the human language at all.

With bidi thrown in, not only does the presentation completely _fail_
to represent the logical order of the text. In fact it's possible to
construct bidi text where the presentation order is completely
deceptive... this could, for example, be used for googlebombing or
evading spam filters by permuting the characters of your text to
include or avoid certain words or phrases. The author of Yudit also
identifies examples that have security implications.

Along with the other reasons I have discussed regarding breaking text
file and character cell sanity, this is why, in my view, bidi is
"considered harmful". I don't expect RTL script users to switch to
LTR. What I do propose is a way for LTR users to view text containing
RTL characters without the need for bidi and without "ekil esnesnon
siht", as well as a way for RTL users to have an entirely-RTL
environment rather than a bidi one. The latter still requires some
more consideration regarding mathematical expressions and numerals. At
this point I have no idea whether such a thing would be of interest to
a significant number of RTL users but I suspect primarily-LTR users
with an occasional need for reading Arabic or Hebrew words or phrases
would like it. Both of these approaches have the side-effect of making
RTL scripts "just work" in any application without the need for
special bidi support at the application level or the terminal level.

> BTW, now that the holiday has passed, I probably won't have time to 
> reply at similar length. But it's been fun.

Ah well, I tried to strip my reply down to the most
interesting/relevant parts in case you do have time for some replies,
but it looks like I've still left a lot in.

Thanks for discussing in any case.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Bidi considered harmful? :)

Reply via email to