On Mon, Sep 04, 2006 at 08:19:02PM -0600, Mark Leisher wrote: > Rich Felker wrote: > > > >It went farther because it imposed language-specific semantics in > >places where they do not belong. These semantics are correct with > >sentences written in human languages which would not have been hard to > >explicitly mark up, especially with a word processor doing it for you. > >On the other hand they're horribly wrong in computer languages > >(meaning any text file meant to be computer-read and -interpreted, not > >just programming languages) where explicit markup is highly > >undesirable or even illegal. > > The Unicode Consortium is quite correctly more concerned with human > languages than programming languages. I think you are arguing yourself > into a dead end. Programming languages are ephemeral and some might > argue they are in fact slowly converging with human languages.
Arrg, C is not going away anytime soon. C is THE LANGUAGE as far as POSIX is concerned. The reason I said "arrg" is that I feel like this gap between the core values of the "i18n bloatware crowd" and the "hardcore lowlevel efficient software crowd" is what keeps good i18n out of the best software. When you talk about programming languages converging with human languages somehow all I can think of us Perl... yuck! Larry Wall's been great about pushing Unicode and UTF-8, but Perl itself is a horrible mess. The implementation is hopelessly bad and there's little hope of there ever being a reimplementation. Anyway as I've said again and again, it's no problem for human language text to have explicit embedding tagging. It doesn't need to conform to syntax rules (oh yeah Perl code doesn't need to either ;)). Fancy editors can even insert tags for you. On the other hand, stuffing extra control characters into machine-read texts with specific syntactical and semantic rules is not possible. You can't even just strip these characters when processing because, depending on the semantics of the file, they may either be controlling the display of the file or literal embedding controls to be used when the strings from the file are printed to their final destination. > >Or I could just ask: should we write C code in MS Word .doc format? > > No reason to. Programming editors work well as they are and will > continue to work well after being adapted for Unicode. No, if they perform the algorithm in UAX#9 they will display garbled unreadable code. Or does C somehow qualify as a "higher level protocol" for formatting? > You don't appear to have any experience writing lexical scanners for > programming languages. If you did, you would know how utterly trivial it > is to ignore embedded bidi codes an editor might introduce. I'm quite aware that it's simple to code, but also illegal according to the specs. Also you're ignoring the more troublesome issues... Obviously you can't remove them inside strings. :) Issues with comments too.. > Though I haven't checked myself, I wouldn't be surprised if Perl, > Python, PHP, and a host of other programming languages weren't already > doing this, making your concerns pointless. I doubt it, but even it they do, these are toy languages with one implementation and no specification (and in Perl's case, for which it's hopeless to even try to write a specification). It's easy to hack whatever you want and break compatibility with every new release of the language when your implementation is the only one. It's much harder when you're working with an international standard for a language that's been around (and rather stable!) approaching-40-years and intended to have multiple interoperable implementations. > You can't seriously expect readers of RTL > languages to just throw away everything they've learned since childhood > and learn to read their mathematical expressions backwards? Or simply > require that their scripts never appear in a plain text file? That is > ignorant at best and arrogant at worst. I've seen examples that show that UAX#9 just butchers mathematical expressions in the absence of explicit bidi control. > You really need to start looking at code and stop pontificating from a > poorly understood position. Just about every programming editor out > there is already aware of programming language syntax. Many different > programming languages in most cases. Cheap regex-based syntax hilighting is not the same thing at all. But this is aside from the point, that it's fundamentally WRONG to need a special tool that knows about the syntax of your computer language in order to edit it. What if you've designed your own language to solve a particular problem? Do you have to go and modify your editor to make it display this text correctly for this language? NO! That's the whole reason we have plain text. You can edit it without having to have a special program! > >As you acknowledge below, a line it not necessarily an > >unlimited-length object and in email it should not be longer than 80 > >characters (or preferably 72 or so to allow for quoting). So you can't > >necessarily just take the MS Notepad approach of omitting newlines and > >treating lines as paragraphs although this make be appropriate in some > >uses of text files. > > So instead of a substantive argument why a line can't be viewed as a > paragraph, you simply imply that it just can't be done. Weak. No, I agree that it can be. I'm just saying that a line can't do the things you expect a paragraph to do, though. In particular it can't be arbitrarily long in any plain text context, although it could be in some. > >I'm talking about the definition of a text file as a sequence of > >lines, which might (on stupid legacy implementations) even be > >fixed-width fields. It's under the stdio stuff about the difference > >between text and binary mode. I could look it up but I don't feel like > >digging thru the pdf file right now.. > > That section doesn't provide definitions of line or paragraph. See 7.19.2 Streams. > >I agree it was best to do too. I just pointed it out as being contrary > >to your claim that they made every effort not to break existing > >practice. > > For a mathematician, you are quite good at ignoring inconvenient logic. > The phrase "every effort to avoid breaking existing practice" does not > logically imply that no existing practice was broken. Weak. Read the history. Han unification was one of the very first points of Unicode, even though it was obvious that it would break much existing practice. This seems to have been connected to the misguided goal of trying to make everything into fixed-width 16bit characters. From what I understand, early Unicode was making every effort _to break_ existing practice. Their motto was "...begin at 0 and add the next character" which to me implies "throw out everything that already exists and start from scratch." I've never seen the early drafts but I wouldn't be surprised if the original characters 0-127 didn't even match ASCII. > >I suppose this is true and I don't know the history and internal > >politics well enough to know who was responsible for what. However, > >unlike doublebyte/multibyte charsets which were becoming prevalent at > >the time, UCS-2 data does not form valid C strings. A quick glance at > >some historical remarks from unicode.org and Rob Pike suggests that > >UTF-8 was invented well before any serious deployment of Unicode, i.e. > >that the push for UCS-2 was deliberately aimed at breaking things, > >though I suspect it was Apple, Sun, and Microsoft pushing UCS-2 more > >than the consortium as a whole. > > You can ask any of the Unicode people from those companies and will get > the same answer. Something had to be done and UCS-2 was the answer at > the time. Conspiracy theories do not substantive argument make. I've been researching what I can with the little information available and it seems that the early Unicode architects got a strong disgust for variable-size characters from their experience with Shift_JIS (which was extremely poorly designed) and other CJK encodings and developed a dogma that fixed-width was the way to go. There are numerous references to this sort of thinking in "10 Years of Unicode" published under history on unicode.org. > So you simply assume that nobody bothered to look into things like > information density et al during the formation of the Unicode > Standard? You don't appear to be aware of the social and political > ramifications involved in making decisions like that. It doesn't matter > if it makes sense from a mathematical point of view, nations and people > are involved. Latin text (which is mostly ASCII anyway) would go up in size by a few percent while many languages would go down by 33%. Sounds like a fair trade. I'm sure there are political ramifications, and of course the answer is always: do what pleases the countries with the most money/power rather than doing what serves the largest population and the population that has the greatest scarcity of storage space... > Scripts were placed when information about their encodings became > available to the Unicode Consortium. It's that simple. No big conspiracy > to give SEA scripts short shrift. Honestly I think they just didn't care about UTF-8 at the time because they still had delusions that people would switch to UCS-2 for everything. Also I've been told that the arrangement was intended to be "West to East".. > >Applications can draw their own bidi text with higher level formatting > >information, of course. I'm thinking of a terminal-mode browser that > >has the bidi text in HTML with <dir> tags and whatnot, or apps with a > >text 'gui' consisting of separated interface elements. > > Ahh. Yes. That sounds a lot like lynx. A popular terminal-mode browser. > Have you checked out how it handles Unicode? The only app I've seriously checked out is mined simply because most apps don't have support for bidi on the console (and many still don't even know how to use wcwidth...! including emacs!! :( ). If lynx handles bidi specially I'd be interested in seeing what it does. However this brings up another interesting question: what should lynx -dump do? :) Naturally dumping in visual order is wrong, but generating a text file that will look right when displayed according to UAX#9 sounds quite difficult, especially when you take multiple columns, etc. into account. Of course lynx is old broken crap that doesn't even support tables so maybe it has it easier.. :) These days I use ELinks, but it has very very poor i18n support. :( > >I've read ECMA-48 bidi stuff several times and still can't make any > >sense of it, so I agree it's disgusting too. It does seem powerful but > >powerful is often a bad thing. :) > > Well, ISO/IEC 2022 and ISO/IEC 6429 do things the same way: multibyte > escape sequences. I'm confused what you mean by multi-byte escape sequences. What I know of as ISO 2022 is the charset-switching escapes used for legacy CJK support and "vt100 linedrawing characters", but you seem to be talking about something related to bidi. Does ISO 2022 have bidi controls as well? > >>All I will say about them is Unicode is a lot easier to deal with. Have > > > >Easier to deal with because it solves an easier problem. UAX#9 tells > >you what to do when you have explicit paragraph division and unbounded > >search capability forwards and backwards. Neither of these exists in a > >character cell device environment, and (depending on your view of what > >constitutes a proper text file) possibly not in a text file either. My > >view of a text file (maybe not very popular these days?) is that it's > >a more-restricted version of a character cell terminal (no cursor > >positioning allowed) but with unlimited height. > > Having implemented UAX #9 and a couple of other approaches that produce > the same or similar results, I don't see any problem using it to render > text files. If your text file has one paragraph per line, then you will > see occasional glitches in mixed LTR & RTL text. Seek somewhere in the middle of the line and type a character of the opposite directionality. Watch the whole line jump around and the character you just typed end up in a different column from where your cursor was placed. This sort of thing will happen all the time in a terminal when the app goes to draw interface elements, etc. over top of part of the text. If it doesn't, i.e. if the terminal implements a sort of "hard implicit bidi", then the terminal will just hopelessly corrupt unless the program has explicit bidi logic matching the terminal's. > >>This frequently gives you multiple glyph codes for each abstract > >>character. To do anything with the text, a mapping between glyph and > >>abstract character is necessary for every program that uses that text. > > > >No, it's necessary only for the terminal. The programs using the text > >need not have any idea what language/script it comes from. This is the > >whole beauty of using such apps. > > I suspect you missed my point. Using glyph codes as an encoding gets > complicated fast. Yes but where did I say anything about glyph codes? In both Unicode and ISCII text everything is character codes, not glyph codes. Sorry but I don't understand what you were trying to say.. > Well, they don't want a program that simply reverses RTL segments > claiming conformance with UAX #9, it is better to see it backward than > to see it wrong. You can ask native users of RTL scripts about that. And > ask more than one. It says more than that; it says that a program is forbidden from interpreting the characters visually at all if it doesn't perform at least the implicit part of UAX#9. From my reading, this means that UAX#9 deems it worse to show the RTL characters in LTR order than not to show them at all. It also precludes display strategies like the one I proposed. > >Well in many cases my "simple solutions" are too simple for people > >who've gotten used to bloated featuresets and gotten used to putting > >up with slowness, bugs, and insecurity. But we'll see. My whole family > >of i18n-related projects started out with a desire to switch to UTF-8 > >everywhere and to have Latin, Tibetan, and Japanese support at the > >console level without increased bloat, performance penalties, and huge > >dependency trees. From there I first wrote a super-small UTF-8-only C > >library and then turned towards the terminal emulator issue, which in > >turn led to the font format issue, etc. etc. :) Maybe after a whole > >year passes I'll have roughly what I wanted. > > > > I don't recall having seen your "simple solutions" so I can't dismiss http://svn.mplayerhq.hu/libc/trunk/ About 100kb of code and a few kb of data. E.g. iconv is 2kb, missing support for CJK legacy encodings at present, final size should be about 2.5-2.7kb. Terminal emulator uuterm isn't checked in yet but it's looking like the whole program with support for all scripts (except RTL scripts, if you don't count non-UAX#9-conformant display as support) will come to about 50kb of code static linked. Plus about 1.5 meg for a complete font. On a separate note... maybe it would help it I express and clarify my view on UAX#9: I think it very much has its place and it's great when formatting content that is known to be human-language text for display in the traditional form expected by most readers. However, IMO what UAX#9 should be seen as is a specification of the correspondence between the stored "logical order" text and the traditional print form, in a way as a definition of "logical order" text. It's important to have this kind of definition for legal purposes especially, so e.g. if someone has signed a document containing particular bidi text, it's clear what printed text ordering that binary text is meant to represent and thus clear what was signed. On the other hand, I find the whole idea of bidirectionality harmful. Human language text has always involved ambiguity as far as interpreting the meaning, but aside from bidi text, at least there is an unambiguous way to display the characters so that their logical order is clear to the reader, and this method does not require the machine to interpret the human language at all. With bidi thrown in, not only does the presentation completely _fail_ to represent the logical order of the text. In fact it's possible to construct bidi text where the presentation order is completely deceptive... this could, for example, be used for googlebombing or evading spam filters by permuting the characters of your text to include or avoid certain words or phrases. The author of Yudit also identifies examples that have security implications. Along with the other reasons I have discussed regarding breaking text file and character cell sanity, this is why, in my view, bidi is "considered harmful". I don't expect RTL script users to switch to LTR. What I do propose is a way for LTR users to view text containing RTL characters without the need for bidi and without "ekil esnesnon siht", as well as a way for RTL users to have an entirely-RTL environment rather than a bidi one. The latter still requires some more consideration regarding mathematical expressions and numerals. At this point I have no idea whether such a thing would be of interest to a significant number of RTL users but I suspect primarily-LTR users with an occasional need for reading Arabic or Hebrew words or phrases would like it. Both of these approaches have the side-effect of making RTL scripts "just work" in any application without the need for special bidi support at the application level or the terminal level. > BTW, now that the holiday has passed, I probably won't have time to > reply at similar length. But it's been fun. Ah well, I tried to strip my reply down to the most interesting/relevant parts in case you do have time for some replies, but it looks like I've still left a lot in. Thanks for discussing in any case. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
