Re: suit file
If it was packaged up correctly on the sending side (using BinHex, probably) then the bits might still be intact. I believe fontforge can read binhex'ed files even on non-Mac OS operating systems and from any filesystem. Good luck! -Ben On 2009-05-04, Jan Willem Stumpel jstum...@planet.nl wrote: Ben Wiley Sittler wrote: It's a font suitcase, and IIRC the font data is actually in the resource fork. At least under Mac OS X, fontforge seems to be able to deal with these. If you have the file on a non-Mac OS machine it may well be corrupt, since non-Mac filesystems do not preserve the resource fork data. This file was sent to me by a friend, from a Mac computer, by e-mail, and then saved on my ext3 HD. Any danger that it was corrupted, or incomplete? Regards, Jan -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: suit file
It's a font suitcase, and IIRC the font data is actually in the resource fork. At least under Mac OS X, fontforge seems to be able to deal with these. If you have the file on a non-Mac OS machine it may well be corrupt, since non-Mac filesystems do not preserve the resource fork data. On 2009-05-03, Rich Felker dal...@aerifal.cx wrote: On Sun, May 03, 2009 at 08:02:40AM +0200, Jan Willem Stumpel wrote: I have a font for an exotic language (Javanese) that I want to convert to UTF-8 encoding. Problem is, the font file was made on a Macintosh using Fontographer, and it has a .suit file extension that Fontforge doesn't know how to handle. Anyone knows of a conversion tool under Linux that can change a *.suit file to ttf? Googling for suit file format turns up lots of SEO-spam sites with no details on what the format really looks like. I think it's just some sort of primitive archive format that contains the ttf (or several ttf's) and you may be able to search for a ttf header within it and then just throw away the suit crap at the beginning using dd. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: garbled file names on a linux/windows volume
if you need to fix a lot of these automatically from a shell script, you might consider something like this: python -c 'import sys, urllib; print urllib.unquote( .join(sys.argv[1:])).decode(utf-8).encode(iso-8859-1)' \ '%C3%83%C2%A9' \ '%C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91' é 专辑 it works like echo, but decodes the %-escaping and one of the levels of utf-8 encoding. On Fri, Oct 31, 2008 at 1:31 PM, Andries E. Brouwer [EMAIL PROTECTED] wrote: On Sat, Nov 01, 2008 at 01:51:42AM +0800, Ray Chuan wrote: using an edonkey client, which has a function to convert file names to url-friendly strings (aka ed2k links), i was able to see that é showed up as %C3%83%C2%A9, while the more complex 专辑 (#19987;#36753;) would be %C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91. You converted twice to UTF-8, so have to go back once. (é is U+00e9 which is 1111 10101001 in UTF-8, but if you read the latter as Latin-1 and convert once more to UTF-8 you get 1111 1011 1110 10101001, that is, %C3%83%C2%A9 as you reported) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8
glad it was rejected. the only really sensible approach i have yet seen is utf-8b (see my take on it here: http://bsittler.livejournal.com/10381.html and another implementation here: http://hyperreal.org/~est/utf-8b/ ) the utf-8b approach is superior to many others in that binary is preserved, but it does not inject control characters. instead it is an extension to utf-8 that allows all byte sequences, both those that are valid utf-8 and those that are not. when converting utf-8 - utf-16, the bytes in invalid utf-8 sequences - unpaired utf-16 surrogates. the correspondence is 1-1, so data is never lost. valid paired surrogates are unaffected (and are used for characters outside the bmp.) i realize i've mentioned this before, but i feel i should mention it whenever someone mentions a non-data-preserving proposal (like converting everything invalid to U+FFFD REPLACEMENT CHARACTER) or an actively harmful proposal (like converting invalid bytes into U+001A SUB which has well-defined and sometimes-destructive semantics.) On 4/27/07, Christopher Fynn [EMAIL PROTECTED] wrote: Rich Felker wrote: On Fri, Apr 27, 2007 at 05:15:16PM +0600, Christopher Fynn wrote: N3266 was discussed and rejected by WG2 yesterday. As you pointed out there are all sorts of problems with this proposal, and accepting it would break many existing implementations. That's good to hear. In followup, I think the whole idea of trying to standardize error handling is flawed. What you should do when encountering invalid data varies a lot depending on the application. For filenames or text file contents you probably want to avoid corrupting them at all costs, even if they contain illegal sequences, to avoid catastrophic data loss or vulnerabilities. On the other hand, when presenting or converting data, there are many approaches that are all acceptable. These include dropping the corrupt data, replacing it with U+FFFD, or even interpreting the individual bytes according to a likely legacy codepage. This last option is popular for example in IRC clients and works well to deal with the stragglers who refuse to upgrade their clients to use UTF-8. Also, some applications may wish to give fatal errors and refuse to process data at all unless it's valid to begin with. Rich Yes. Someone who was there tells me the main reason it was rejected was that it was considered out of scope for ISO 10646 or even Unicode to dictate what a process should do in an error condition. Should it throw an exception, etc. etc. The UTF-8 validity specification is expressed in terms of what constitutes a valid string or substring rather than what a process needs to do in a given condition. Neither standard wants to get into the game of standardizing API type things like what processes should do. - Chris -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: perl unicode support
please before embarking on such a path think about what happens when someone else happens to use an actual character in the PUA which collides with your escape. better to use something invalid to represent something invalid. markus kuhn said it best, see e.g. here: http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html and specifically, option d, Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence, basically each invalid input 0xnn byte is mapped to the unpaired surrogate 0xDCnn (which are all in the range 0xDC80 ... 0xDCFF). on output, the reverse is done (unpaired surrogates from that range are mapped to the corresponding bytes.) the particular scheme described there has a name (utf-8b) and several implementations, and is widely applicable to situations involving mixed utf-8 and binary data where the binary needs to be preserved while also treating the utf-8 parts with Unicode or UCS semantics. -ben On 3/31/07, Rich Felker [EMAIL PROTECTED] wrote: On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote: Rich Felker wrote: Again, software which does not handle corner cases correctly is crap. Why are you confusing special-case with corner case? I never said that software shouldn't handle corner cases such as illegal UTF-8 sequences. I meant that an editor that handles illegal UTF-8 sequences other than by simply rejecting the edit request is a bit if a special case compared to general-purpose software, say a XML processor, for which some specification requires (or recommends?) that the processor ignore or reject any illegal sequences. The software isn't failing to handle the corner case; it is handling it--by explicitly rejecting it. It is a corner case! Imagine a situation like this: 1. I open a file in my text editor for editing, unaware that it contains invalid sequences. 2. The editor either silently clobbers them, or presents some sort of warning (which, as a newbie, I will skip past as quickly as I can) and then clobbers them. 3. I save the file, and suddenly I've irreversibly destroyed huge amounts of data. It's simply not acceptable for opening a file and resaving it to not yield exactly the same, byte-for-byte identical file, because it can lead either to horrible data corruption or inability to edit when your file has somehow gotten malformed data into it. If your editor corrupts files like this, it's broken and I would never even consider using it. As an example of broken behavior (but different from what you're talking about since it's not UTF-8), XEmacs converts all characters to its own nasty mule encoding when it loads the file. It proceeds to clobber all Unicode characters which don't also exist in legacy mule character sets, and upon saving, the file is horribly destroyed. Yes this situation is different, but the only difference is that UTF-8 is a proper standard and mule is a horrible hack. The clobbering is just as wrong either way. (I'm hoping that XEmacs developers will fix this someday soon since I otherwise love XEmacs, but this is pretty much a show-stopper since it clobbers characters I actually use..) What I meant (given the quoted part below you replied before) was that if you're dealing with a file that overall isn't valid UTF-8, how would you know whether a particular part that looks like valid UTF-8, representing some characters per the UTF-8 interpretation, really represents those characters or is an erroneously mixed-in representation of other characters in some other encoding? Since you're talking about preserving what's there as opposed to doing anything more than that, I would guess you answer is that it really doesn't matter. (Whether you treater 0xCF 0xBF as a correct the UTF-8 sequence and displayed the character U+03FF or, hypothetically, treated it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and displayed those characters, you'd still write the same bytes back out.) Yes, that's exactly my answer. You might as well show it as the character in case it really was supposed to be the character. Now it sounds like we at least understand what one another are saying. example, if at one point you see the UTF-8-illegal byte sequence 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER. You said you're talking about a text editor, that reads bytes, displays legal UTF-8 sequences as the characters they represent in UTF-8, doesn't reject other UTF-8-illegal bytes, and does something with those bytes. What does it do with such a byte? It seems you were taking about mapping it to some character to display it. Are you talking about something else, such as displaying the hex value of the byte? Yes. Actually GNU Emacs displays octal instead of hex, but it's the same idea. The pager less displays hex, such as BF, in reverse video, and shows legal sequences that make
Re: Non-ASCII characters in file names
awesome, and thank you! however, utf-8 filenames given on the command line still do not work... the get turned into iso-8859-1, which is then utf-8 encoded before saving (?!) here's my (partial) utf-8 workaround for emacs so far: (if (string-match XEmacs\\|Lucid emacs-version) nil (condition-case nil (eval (if (string-match \\.\\(UTF\\|utf\\)-?8$ (or (getenv LC_CTYPE) (or (getenv LC_ALL) (or (getenv LANG) C '(concat (set-terminal-coding-system 'utf-8) (set-keyboard-coding-system 'utf-8) (set-default-coding-systems 'utf-8) (setq file-name-coding-system 'utf-8) (set-language-environment UTF-8 ((error Language environment not defined: \UTF-8\) nil))) On 3/17/07, Rich Felker [EMAIL PROTECTED] wrote: On Sat, Mar 17, 2007 at 09:51:53AM -0700, Ben Wiley Sittler wrote: emacs seems not to handle utf-8 filenames at all, regardless of locale. (setq file-name-coding-system 'utf-8) ~Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Non-ASCII characters in file names
yeah, using the newer 'emacs-snapshot' (GNU Emacs 22.0.91.1) here on ubuntu feisty solves most of the UTF-8 related problems in emacs, including command line argument encoding. since i deal with some data in non-utf-8 encodings (iso-2022, iso-2022-jp, iso-8859-x, etc.) and interact with other X11 applciations that use compound-text in their selections, i do not think some of those settings would work for me. i agree that looking for a particular substring in the locale name is the wrong approach. on a linux system i should perhaps base this on the output of the locale charmap command instead, but my rusty elisp is not up to that task at the moment. fortunately the UTF-8 locales all seem to end with .UTF-8 on this system. On 3/18/07, Rich Felker [EMAIL PROTECTED] wrote: On Sun, Mar 18, 2007 at 08:41:48AM -0700, Ben Wiley Sittler wrote: awesome, and thank you! however, utf-8 filenames given on the command line still do not work... the get turned into iso-8859-1, which is then utf-8 encoded before saving (?!) here's my (partial) utf-8 workaround for emacs so far: (if (string-match XEmacs\\|Lucid emacs-version) nil (condition-case nil (eval (if (string-match \\.\\(UTF\\|utf\\)-?8$ (or (getenv LC_CTYPE) (or (getenv LC_ALL) (or (getenv LANG) C '(concat (set-terminal-coding-system 'utf-8) (set-keyboard-coding-system 'utf-8) (set-default-coding-systems 'utf-8) (setq file-name-coding-system 'utf-8) (set-language-environment UTF-8 ((error Language environment not defined: \UTF-8\) nil))) Here are all my relevant emacs settings. They work in at least emacs-21 and later; however, emacs-21 seems to be having trouble with UTF-8 on the command line and I don't know any way around that. ; Force unix and utf-8 (setq inhibit-eol-conversion t) (prefer-coding-system 'utf-8) (setq locale-coding-system 'utf-8) (set-terminal-coding-system 'utf-8) (set-keyboard-coding-system 'utf-8) (set-selection-coding-system 'utf-8) (setq file-name-coding-system 'utf-8) (setq coding-system-for-read 'utf-8) (setq coding-system-for-write 'utf-8) Note that the last two may be undesirable; they force ALL files to be treated as UTF-8, skipping any detection. This allows me to edit files which may have invalid sequences in them (like Kuhn's decoder test file) or which are a mix of binary data and UTF-8. I use the experimental unicode-2 branch of GNU emacs, and with it, forcing UTF-8 does not corrupt non-UTF-8 files. The invalid sequences are simply shown as octal byte codes and saved back to the file as they were in the source. I cannot confirm that this will not corrupt files on earlier versions of GNU emacs, however, and XEmacs ALWAYS corrupts files visited as UTF-8 (it converts any unicode character for which it does not have a corresponding emacs-mule character into a replacement character) so it's entirely unsuitable for use with UTF-8 until that's fixed (still broken in latest cvs as of a few months ago..). BTW looking for UTF-8 in the locale string is a bad idea since UTF-8 is not necessarily a special encoding but may be the native encoding for the selected language. nl_langinfo(CODESET) is the only reliable determination and I doubt emacs provides any direct way of accessing it. :( ~Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: High-Speed UTF-8 to UTF-16 Conversion
just a hypothesis, but i'm guessing that at the time they put this together, both major platforms (win32 and java) dealing with DOM used ucs-2 (and now use utf-16) internally. even today, win32 and java mostly do not use utf-8. the only form widely supported outside of linux and unix systems is utf-8 with a fictitious byte order mark (obviously as a byte-oriented encoding this is useless) which is of course incompatible with tools used on unix and linux, and with many web browsers. Notepad uses this form, and Java uses a bunch of incompatible utf-8 extensions in its serializations (incorrect encoding of NUL and incorrect encoding of plane 1 ... plane 16 using utf-8 sequences corresponding to individual surrogate codes). unfortunately this is perpetuated in several network protocols, and e.g. is what one does when interfacing to Oracle or MySQL. even on mac os x, where it's the encoding used for the unix-type filesystem access, it's still not the default text encoding in TextEdit, and utf-8 text files don't work (i.e. they open as MacRoman or whatever Mac* encoding is paired with the OS language.) fortunately this si configurable, unfortunately changing it breaks all sorts of other stuff (apps frequently still ship with macroman README files, etc.) so basically, if you want it to work i recommend switching to linux, unix, plan 9, or similar :( On 3/17/07, Christopher Fynn [EMAIL PROTECTED] wrote: Colin Paul Adams wrote: Rich == Rich Felker [EMAIL PROTECTED] writes: Rich Indeed, this was what I was thinking of. Thanks for Rich clarifying. BTW, any idea WHY they brought the UTF-16 Rich nonsense to DOM/DHTML/etc.? I don't know for certain, but I can speculate well, I think. DOM was a micros**t invention (and how it shows!). NT was UCS-2 (effectively). AFAIK Unicode was originally only planned to be a 16-bit encoding. the The Unicode Consortium and ISO 10646 then agreed to synchronize the two standards - though originally Unicode was only going to be a 16-bit subset of the UCS. A little after that Unicode decided to support UCS characters beyond plane 0. Anyway at the time NT was being designed (late eighties) Unicode was supposed to be limited to 65536 characers and UTF-8 hadn't been thought of, so 16-bits probably seemed like a good idea. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Non-ASCII characters in file names
emacs seems not to handle utf-8 filenames at all, regardless of locale. On 3/17/07, SrinTuar [EMAIL PROTECTED] wrote: The test suite is currently distributed as a zip file. It so happens that the file concerned is named using ISO-8859-1 on the distributors system. On my system, doing ls from the GNOME console shows the name as xgespr?ch.xml. Whereas Emacs dired shows the name as xgespräch.xml. Zip files treat filenames as byte arrays, so zip tends to be clumsy when you get zipfiles created on legacy systems. Its compatible with utf-8 at least, so zipfiles you make yourself should have no problems. So I went back to LANG=en_GB.UTF-8, unzipped the distribution again, and re-named the file, thanks to your help. ls now shows the correct file name. Emacs shows xgespräch.xml. And the test works. I tried emacs and saw the same problem you did. vim seems to work correctly with locales. Allthough advising a switch to vim is probably more responsible, a quick seach revealed this link: http://linux.seindal.dk/item32.html Has anyone any illuminating comments to make? I'm particularly interested in the distribution problem. You could have the distributor change his locale to utf-8 and rename the files on his filesystem.
Re: High-Speed UTF-8 to UTF-16 Conversion
I believe it's more DHTML that is the problem. DOMString is specified to be UTF-16. Likewise for ECMAScript strings, IIRC, although they may still be officially UCS-2. In practice ECMAScript specifies (and implementations provide) such minimal Unicode support (no canonicalization or character class primitives [combining, etc.], for instance, and no way to work with characters rather than UCS-2 codes/surrogate halves, and no access to codecs other than UTF-8-UTF-16 [often buggy and incomplete, and rarely able to deal with errors in any way other than throwing exceptions], nor any access to the Unicode names database or the Unihan database) that applications are basically on their own. On 16 Mar 2007 21:59:06 +, Colin Paul Adams [EMAIL PROTECTED] wrote: Rich == Rich Felker [EMAIL PROTECTED] writes: Rich UTF-8. There's no good reason for using UTF-16 at all; it's Rich just a bad implementation choice. IIRC either HTML or XML Rich (yes I know they're different but I forget which does it..) I don't ever recall seeing this in HTML, but it certainly isn't in XML. The only thing XML has to say on the subject is that XML parsers must be able to read both. -- Colin Adams Preston Lancashire -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: A call for fixing aterm/rxvt/etc...
just two cents: i did this some years back for the links and elinks web browsers (it's the utf-8 i/o option available in some versions of each) and the results are fairly mixed -- copy-n-paste fails horribly in an app converted in this way, and i assume the same would be true of a terminal emulator in a window system like X11. on the other hand, it meant i and others could use these browsers on e.g. mac os x years before someoine undertook the much more in-depth utf-8 and unicode support now in progress for elinks. using luit for this sounds appealing, but in my experience luit (a) crashes frequently and (b) is easily confused by escape sequences and has no user interface for resetting all its iso-2022 state, so in practice it works for only a few apps. that said, it would probably be better thanthe current state of affairs. On 2/23/07, Rich Felker [EMAIL PROTECTED] wrote: These days we have at least xterm, urxvt, mlterm, gnome-terminal, and konsole which support utf-8 fairly well, but on the flip side there's still a huge number of terminal emulators which do not respect the user's encoding at all and always behave in a legacy-8bit-codepage way. Trying to help users in #irssi, etc. with charset issues, I've come to believe that it's a fairly significant problem: users get frustrated with utf-8 because the terminal emulator they want to use (which might be chosen based on anti-bloat sentiment or, quite the opposite, on a desire for specialized eye candy only available in one or two programs) forces their system into a mixed-encoding scenario where they have both utf-8 and non-utf-8 data in the filesystem and text files. How hard would it be to go through the available terminal emulators, evaluate which ones lack utf-8 support, and provide at least minimal fixes? In particular, are there any volunteers? What I'm thinking of as a minimal fix is just putting utf-8 conversion into the input and output layers. It would still be fine for most users of these apps if the terminal were limited to a 256-character subset of UCS, didn't support combining characters or CJK, etc. as long as the data sent and received over the PTY device is valid UTF-8, so that the (valid and correct) assumption of applications running on the terminal that characters are encoded in the locale's encoding is satisfied. Perhaps this could be done via a reverse luit -- that is, a program like luit or an extension to luit that assumes the physical terminal is using an 8bit legacy codepage rather than UTF-8. Then these terminals could simply be patched to run luit if the locale's encoding is not single-byte. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Proposed fix for Malayalam ( other Indic?) chars and wcwidth
just tried this in a few terminals, here are the results: GNOME Terminal 2.16.1: U+0D30 U+0D4A displayed with width 3 U+0D30 U+0D46 U+0D3E displayed with width 3 NOTE: displays very differently in each case Konsole 1.6.5: U+0D30 U+0D4A displayed with width 3 U+0D30 U+0D46 U+0D3E displayed with width 4 NOTE: displays very differently in each case mlterm 2.9.3: U+0D30 U+0D4A displayed with width 2 U+0D30 U+0D46 U+0D3E displayed with width 2 NOTE: displays identically in each case On 10/16/06, Bruno Haible [EMAIL PROTECTED] wrote: Hello Rich, These characters are combining marks that attach on both sides of a cluster, and have canonical equivalence to the two separate pieces from which they are built, but yet Markus' wcwidth implementation and GNU libc assign them a width of 1. It appears very obvious to me that there's no hope of rendering both of these parts using only 1 character cell on a character cell device, and even if it were possible, it also seems horribly wrong for canonically equivalent strings to have different widths. What rendering to other terminal emulators produce for these characters, especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit a patch to glibc based on the data of just 1 terminal emulator. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Next Generation Console Font?
see mlterm, please — some of these are very useful display forms, and already in use for character-cell terminal emulators. as for triple-cell glyphs, see emacs w/arabic presentation forms. On 8/20/06, Rich Felker [EMAIL PROTECTED] wrote: On Sat, Aug 19, 2006 at 11:20:55AM -0700, Ben Wiley Sittler wrote: sorry, cat-typing sent that email a bit early. here's the rest: for indic scripts and arabic having triple-cell ligatures is really indispensible for readable text. for east asian text a ttb, rtl columnar display mode is really, really nice. For a terminal? Why? Do you want to see: l s - l [...] ??? I suspect not. If anyone really does want this behavior, then by all means they can make a terminal with different orientation. But until I hear about someone really wanting this I'll assume such claims come from faux-counter-imperial chauvinism where western academics in ivory towers tell people in other cultures that they must preserve their traditions for their own sake with no regard for practicality, and end up doing nothing but _disadvantaging_ people. a passable job at least for CJK. how to handle single-cell vs. double-cell vs. triple-cell glyphs in vertical presentation is a I've never heard of a triple-cell glyph. Certainly the standard wcwidth (Kuhn's version) has no such thing. tricky problem - short runs (= 2 cells) should probably be displayed as horizontal inclusions, longer runs should probably be rotated. Nonsense. A terminal does not have the luxury to decide such things. You're confusing terminal with word processor or maybe even with TeX... why don't we have escape sequences for switching between the DBCS and non-DBCS cell behaviors, and for rotating the terminal display for vertical text vs. horizontal text? Because it's not useful. Applications will not use it. All the terminal emulator needs to do is: 1. display raw text in a form that's not offensive -- this is necessary so that terminal-unaware programs just writing to stdout will work. 2. provide cursor positioning functions (minimal) and (optionally) scrolling/insert/delete and other small optimizations. Anything more is just pure bloat because it won't be supported by curses and applications are written either to curses or to vt102. Note that mixing vertical and horizontal is sometimes done in the typographic world but is probably not needed for terminal emulators (this requires a layout engine much more advanced than the unicode bidi algorithm, capable of laying out This most certainly does not belong in a terminal emulator. Apps (such as text based web browsers) wishing to do elegant multi-orientation formatting can do the cursor positioning and such themselves. Users preferring a vertical orientation can configure their terminals as such. This is a matter of user preference, not application control, and thus there should NOT be a way for applications to control or override it. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Next Generation Console Font?
for displaying doublebyte-charset documents the east asian width semantics are indispensible. there are very good reasons to have two modes for the terminal — east asian (all but ascii and explicitly narrow kana/hangeul/etc. as two cells) and non-east-asian (all but kanji/hanzi/hanja, hangeul, and kana single-width). the first is cell-compatible with the DBCS terminals (useful for viewing forms, character-cell art, webpages, etc., including e.g. doublewidth cyrillic characters used as graphics) and the second with non-DBCS terminals (actual cyrillic text, for example.) iuv c On 8/17/06, David Starner [EMAIL PROTECTED] wrote: On 8/17/06, Rich Felker [EMAIL PROTECTED] wrote: This is nothing but glibc being idiotic. Yes it's _allowed_ to do this according to POSIX (POSIX makes no requirements about correspondence of the values returned to any other standard) but it's obviously incorrect for the width of À to be anything but 1, even if it was historically displayed wide (wtf?!) on some legacy CJK terminal types. It's not obviously incorrect; in a CJK terminal, everything but ASCII was double-width, which actually a very convienant way of doing things. Many of these fonts are still around, and I suspect that many users still use terminals that expect everything but ASCII to be double-width. glibc here is merely supporting the way things work. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
UTF-8B support for libiconv
[ this is in response to a truly ancient linux-utf8 thread ] i wrote a patch that provides UTF-8 + binary in one codec with no hand-waving, using Markus Kuhn's brilliant proposal to encode invalid bytes 0xyz using unpaired surrogates U+DCyz. this means there need not be a text/binary distinction for UTF-8-using programs. legal UTF-8 decodes/encodes correctly, and other bytes are handled as opaque U+DCxx on input and correctly serialized on output. so one can once again consider editing a binary format with a notepad-type editor without sacrificing internationalization support. Markus Kuhn's description of the idea: (search for option d) http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html the patch: http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff enjoy! (not sure how/whether this fits into the official distro, but i hope it gets used) -ben -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/