Re: suit file

2009-05-04 Thread Ben Wiley Sittler
If it was packaged up correctly on the sending side (using BinHex,
probably) then the bits might still be intact. I believe fontforge can
read binhex'ed files even on non-Mac OS operating systems and from any
filesystem.

Good luck!
-Ben


On 2009-05-04, Jan Willem Stumpel jstum...@planet.nl wrote:
 Ben Wiley Sittler wrote:

 It's a font suitcase, and IIRC the font data is actually in
 the resource fork. At least under Mac OS X, fontforge seems
 to be able to deal with these. If you have the file on a
 non-Mac OS machine it may well be corrupt, since non-Mac
 filesystems do not preserve the resource fork data.

 This file was sent to me by a friend, from a Mac computer, by
 e-mail, and then saved on my ext3 HD. Any danger that it was
 corrupted, or incomplete?

 Regards, Jan


 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: suit file

2009-05-03 Thread Ben Wiley Sittler
It's a font suitcase, and IIRC the font data is actually in the
resource fork. At least under Mac OS X, fontforge seems to be able
to deal with these. If you have the file on a non-Mac OS machine it
may well be corrupt, since non-Mac filesystems do not preserve the
resource fork data.

On 2009-05-03, Rich Felker dal...@aerifal.cx wrote:
 On Sun, May 03, 2009 at 08:02:40AM +0200, Jan Willem Stumpel wrote:
 I have a font for an exotic language (Javanese) that I want to
 convert to UTF-8 encoding. Problem is, the font file was made on a
 Macintosh using Fontographer, and it has a .suit file extension
 that Fontforge doesn't know how to handle.

 Anyone knows of a conversion tool under Linux that can change a
 *.suit file to ttf?

 Googling for suit file format turns up lots of SEO-spam sites with no
 details on what the format really looks like. I think it's just some
 sort of primitive archive format that contains the ttf (or several
 ttf's) and you may be able to search for a ttf header within it and
 then just throw away the suit crap at the beginning using dd.

 Rich

 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: garbled file names on a linux/windows volume

2008-10-31 Thread Ben Wiley Sittler
if you need to fix a lot of these automatically from a shell script,
you might consider something like this:

python -c 'import sys, urllib; print urllib.unquote(
.join(sys.argv[1:])).decode(utf-8).encode(iso-8859-1)' \
   '%C3%83%C2%A9' \
   '%C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91'

é 专辑

it works like echo, but decodes the %-escaping and one of the levels
of utf-8 encoding.

On Fri, Oct 31, 2008 at 1:31 PM, Andries E. Brouwer
[EMAIL PROTECTED] wrote:
 On Sat, Nov 01, 2008 at 01:51:42AM +0800, Ray Chuan wrote:

 using an edonkey client, which has a function to convert file names to
 url-friendly strings (aka ed2k links), i was able to see that é
 showed up as %C3%83%C2%A9, while the more complex 专辑
 (#19987;#36753;) would be %C3%A4%C2%B8%C2%93%C3%A8%C2%BE%C2%91.

 You converted twice to UTF-8, so have to go back once.

 (é is U+00e9 which is 1111 10101001 in UTF-8, but if you read
 the latter as Latin-1 and convert once more to UTF-8 you get
 1111 1011 1110 10101001, that is, %C3%83%C2%A9 as you reported)


 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

2007-04-27 Thread Ben Wiley Sittler

glad it was rejected. the only really sensible approach i have yet
seen is utf-8b (see my take on it here:
http://bsittler.livejournal.com/10381.html and another implementation
here: http://hyperreal.org/~est/utf-8b/ )

the utf-8b approach is superior to many others in that binary is
preserved, but it does not inject control characters. instead it is an
extension to utf-8 that allows all byte sequences, both those that are
valid utf-8 and those that are not. when converting utf-8 - utf-16,
the bytes in invalid utf-8 sequences - unpaired utf-16 surrogates.
the correspondence is 1-1, so data is never lost. valid paired
surrogates are unaffected (and are used for characters outside the
bmp.)

i realize i've mentioned this before, but i feel i should mention it
whenever someone mentions a non-data-preserving proposal (like
converting everything invalid to U+FFFD REPLACEMENT CHARACTER) or an
actively harmful proposal (like converting invalid bytes into U+001A
SUB which has well-defined and sometimes-destructive semantics.)

On 4/27/07, Christopher Fynn [EMAIL PROTECTED] wrote:

Rich Felker wrote:
 On Fri, Apr 27, 2007 at 05:15:16PM +0600, Christopher Fynn wrote:
 N3266 was discussed and rejected by WG2 yesterday. As you pointed out
 there are all sorts of problems with this proposal, and accepting it
 would break many existing implementations.

 That's good to hear. In followup, I think the whole idea of trying to
 standardize error handling is flawed. What you should do when
 encountering invalid data varies a lot depending on the application.
 For filenames or text file contents you probably want to avoid
 corrupting them at all costs, even if they contain illegal sequences,
 to avoid catastrophic data loss or vulnerabilities. On the other hand,
 when presenting or converting data, there are many approaches that are
 all acceptable. These include dropping the corrupt data, replacing it
 with U+FFFD, or even interpreting the individual bytes according to a
 likely legacy codepage. This last option is popular for example in IRC
 clients and works well to deal with the stragglers who refuse to
 upgrade their clients to use UTF-8. Also, some applications may wish
 to give fatal errors and refuse to process data at all unless it's
 valid to begin with.

 Rich


Yes. Someone who was there tells me the main reason it was rejected was
that it was considered out of scope for ISO 10646 or even Unicode to
dictate what a process should do in an error condition. Should it throw
an exception, etc. etc. The UTF-8 validity specification is expressed in
terms of what constitutes a valid string or substring rather than what a
process needs to do in a given condition. Neither standard wants to get
into the game of standardizing API type things like what processes
should do.

- Chris
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: perl unicode support

2007-04-01 Thread Ben Wiley Sittler

please before embarking on such a path think about what happens when
someone else happens to use an actual character in the PUA which
collides with your escape. better to use something invalid to
represent something invalid. markus kuhn said it best, see e.g. here:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html

and specifically, option d, Emit a malformed UTF-16 sequence for
every byte in a malformed UTF-8 sequence, basically each invalid
input 0xnn byte is mapped to the unpaired surrogate 0xDCnn (which are
all in the range 0xDC80 ... 0xDCFF). on output, the reverse is done
(unpaired surrogates from that range are mapped to the corresponding
bytes.)

the particular scheme described there has a name (utf-8b) and
several implementations, and is widely applicable to situations
involving mixed utf-8 and binary data where the binary needs to be
preserved while also treating the utf-8 parts with Unicode or UCS
semantics.

-ben

On 3/31/07, Rich Felker [EMAIL PROTECTED] wrote:

On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
 Rich Felker wrote:
  Again, software which does not handle corner cases correctly is crap.

 Why are you confusing special-case with corner case?

 I never said that software shouldn't handle corner cases such as illegal
 UTF-8 sequences.

 I meant that an editor that handles illegal UTF-8 sequences other than
 by simply rejecting the edit request is a bit if a special case compared
 to general-purpose software, say a XML processor, for which some
 specification requires (or recommends?) that the processor ignore or
 reject any illegal sequences.  The software isn't failing to handle the
 corner case; it is handling it--by explicitly rejecting it.

It is a corner case! Imagine a situation like this:

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.

2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.

3. I save the file, and suddenly I've irreversibly destroyed huge
amounts of data.

It's simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it. If your editor
corrupts files like this, it's broken and I would never even consider
using it.

As an example of broken behavior (but different from what you're
talking about since it's not UTF-8), XEmacs converts all characters to
its own nasty mule encoding when it loads the file. It proceeds to
clobber all Unicode characters which don't also exist in legacy mule
character sets, and upon saving, the file is horribly destroyed. Yes
this situation is different, but the only difference is that UTF-8 is
a proper standard and mule is a horrible hack. The clobbering is just
as wrong either way.

(I'm hoping that XEmacs developers will fix this someday soon since I
otherwise love XEmacs, but this is pretty much a show-stopper since it
clobbers characters I actually use..)

 What I meant (given the quoted part below you replied before) was that
 if you're dealing with a file that overall isn't valid UTF-8, how would
 you know whether a particular part that looks like valid UTF-8,
 representing some characters per the UTF-8 interpretation, really
 represents those characters or is an erroneously mixed-in representation
 of other characters in some other encoding?

 Since you're talking about preserving what's there as opposed to doing
 anything more than that, I would guess you answer is that it really
 doesn't matter.  (Whether you treater 0xCF 0xBF as a correct the UTF-8
 sequence and displayed the character U+03FF or, hypothetically, treated
 it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and
 displayed those characters, you'd still write the same bytes back out.)

Yes, that's exactly my answer. You might as well show it as the
character in case it really was supposed to be the character. Now it
sounds like we at least understand what one another are saying.

   example, if at one point you see the UTF-8-illegal byte sequence
   0x00 0xBF and assume that that 0xBF byte means character U+00BF, then
 
  This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.

 You said you're talking about a text editor, that reads bytes, displays
 legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
 reject other UTF-8-illegal bytes, and does something with those bytes.

 What does it do with such a byte?  It seems you were taking about
 mapping it to some character to display it.  Are you talking about
 something else, such as displaying the hex value of the byte?

Yes. Actually GNU Emacs displays octal instead of hex, but it's the
same idea. The pager less displays hex, such as BF, in reverse
video, and shows legal sequences that make 

Re: Non-ASCII characters in file names

2007-03-18 Thread Ben Wiley Sittler

awesome, and thank you! however, utf-8 filenames given on the command
line still do not work... the get turned into iso-8859-1, which is
then utf-8 encoded before saving (?!)

here's my (partial) utf-8 workaround for emacs so far:

(if (string-match XEmacs\\|Lucid emacs-version)
   nil
 (condition-case nil (eval
  (if
  (string-match \\.\\(UTF\\|utf\\)-?8$
(or (getenv LC_CTYPE)
(or (getenv LC_ALL)
(or (getenv LANG)
C
  '(concat (set-terminal-coding-system 'utf-8)
   (set-keyboard-coding-system 'utf-8)
   (set-default-coding-systems 'utf-8)
   (setq file-name-coding-system 'utf-8)
   (set-language-environment UTF-8
   ((error Language environment not defined: \UTF-8\) nil)))

On 3/17/07, Rich Felker [EMAIL PROTECTED] wrote:

On Sat, Mar 17, 2007 at 09:51:53AM -0700, Ben Wiley Sittler wrote:
 emacs seems not to handle utf-8 filenames at all, regardless of locale.

(setq file-name-coding-system 'utf-8)

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Non-ASCII characters in file names

2007-03-18 Thread Ben Wiley Sittler

yeah, using the newer 'emacs-snapshot' (GNU Emacs 22.0.91.1) here on
ubuntu feisty solves most of the UTF-8 related problems in emacs,
including command line argument encoding. since i deal with some data
in non-utf-8 encodings (iso-2022, iso-2022-jp, iso-8859-x, etc.) and
interact with other X11 applciations that use compound-text in their
selections, i do not think some of those settings would work for me.

i agree that looking for a particular substring in the locale name is
the wrong approach. on a linux system i should perhaps base this on
the output of the locale charmap command instead, but my rusty elisp
is not up to that task at the moment. fortunately the UTF-8 locales
all seem to end with .UTF-8 on this system.

On 3/18/07, Rich Felker [EMAIL PROTECTED] wrote:

On Sun, Mar 18, 2007 at 08:41:48AM -0700, Ben Wiley Sittler wrote:
 awesome, and thank you! however, utf-8 filenames given on the command
 line still do not work... the get turned into iso-8859-1, which is
 then utf-8 encoded before saving (?!)

 here's my (partial) utf-8 workaround for emacs so far:

 (if (string-match XEmacs\\|Lucid emacs-version)
nil
  (condition-case nil (eval
   (if
   (string-match \\.\\(UTF\\|utf\\)-?8$
 (or (getenv LC_CTYPE)
 (or (getenv LC_ALL)
 (or (getenv LANG)
 C
   '(concat (set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-default-coding-systems 'utf-8)
(setq file-name-coding-system 'utf-8)
(set-language-environment UTF-8
((error Language environment not defined: \UTF-8\) nil)))

Here are all my relevant emacs settings. They work in at least
emacs-21 and later; however, emacs-21 seems to be having trouble with
UTF-8 on the command line and I don't know any way around that.

; Force unix and utf-8
(setq inhibit-eol-conversion t)
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(setq file-name-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)

Note that the last two may be undesirable; they force ALL files to be
treated as UTF-8, skipping any detection. This allows me to edit files
which may have invalid sequences in them (like Kuhn's decoder test
file) or which are a mix of binary data and UTF-8.

I use the experimental unicode-2 branch of GNU emacs, and with it,
forcing UTF-8 does not corrupt non-UTF-8 files. The invalid sequences
are simply shown as octal byte codes and saved back to the file as
they were in the source. I cannot confirm that this will not corrupt
files on earlier versions of GNU emacs, however, and XEmacs ALWAYS
corrupts files visited as UTF-8 (it converts any unicode character for
which it does not have a corresponding emacs-mule character into a
replacement character) so it's entirely unsuitable for use with UTF-8
until that's fixed (still broken in latest cvs as of a few months
ago..).

BTW looking for UTF-8 in the locale string is a bad idea since UTF-8
is not necessarily a special encoding but may be the native
encoding for the selected language. nl_langinfo(CODESET) is the only
reliable determination and I doubt emacs provides any direct way of
accessing it. :(

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-17 Thread Ben Wiley Sittler

just a hypothesis, but i'm guessing that at the time they put this
together, both major platforms (win32 and java) dealing with DOM used
ucs-2 (and now use utf-16) internally.

even today, win32 and java mostly do not use utf-8. the only form
widely supported outside of linux and unix systems is utf-8 with a
fictitious byte order mark (obviously as a byte-oriented encoding
this is useless) which is of course incompatible with tools used on
unix and linux, and with many web browsers. Notepad uses this form,
and Java uses a bunch of incompatible utf-8 extensions in its
serializations (incorrect encoding of NUL and incorrect encoding of
plane 1 ... plane 16 using utf-8 sequences corresponding to individual
surrogate codes). unfortunately this is perpetuated in several network
protocols, and e.g. is what one does when interfacing to Oracle or
MySQL.

even on mac os x, where it's the encoding used for the unix-type
filesystem access, it's still not the default text encoding in
TextEdit, and utf-8 text files don't work (i.e. they open as
MacRoman or whatever Mac* encoding is paired with the OS language.)
fortunately this si configurable, unfortunately changing it breaks all
sorts of other stuff (apps frequently still ship with macroman README
files, etc.)

so basically, if you want it to work i recommend switching to linux,
unix, plan 9, or similar :(

On 3/17/07, Christopher Fynn [EMAIL PROTECTED] wrote:

Colin Paul Adams wrote:

 Rich == Rich Felker [EMAIL PROTECTED] writes:

 Rich Indeed, this was what I was thinking of. Thanks for
 Rich clarifying. BTW, any idea WHY they brought the UTF-16
 Rich nonsense to DOM/DHTML/etc.?

 I don't know for certain, but I can speculate well, I think.

 DOM was a micros**t invention (and how it shows!). NT was UCS-2
 (effectively).

AFAIK Unicode was originally only planned to be a 16-bit encoding.
the The Unicode Consortium and ISO 10646 then agreed to synchronize the
two standards - though originally Unicode was only going to be a 16-bit
subset of the UCS. A little after that Unicode decided to support UCS
characters beyond plane 0.

Anyway at the time NT was being designed (late eighties) Unicode was
supposed to be limited to  65536 characers and UTF-8 hadn't been
thought of, so 16-bits probably seemed like a good idea.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Non-ASCII characters in file names

2007-03-17 Thread Ben Wiley Sittler

emacs seems not to handle utf-8 filenames at all, regardless of locale.

On 3/17/07, SrinTuar [EMAIL PROTECTED] wrote:

 The test suite is currently distributed as a zip file. It so happens
 that the file concerned is named using ISO-8859-1 on the distributors
 system. On my system, doing ls from the GNOME console shows the name
 as xgespr?ch.xml. Whereas Emacs dired shows the name as
 xgespräch.xml.

Zip files treat filenames as byte arrays, so zip tends to be clumsy when you get
zipfiles created on legacy systems. Its compatible with utf-8 at
least, so zipfiles you
make yourself should have no problems.

 So I went back to LANG=en_GB.UTF-8, unzipped the distribution again,
 and re-named the file, thanks to your help.

 ls now shows the correct file name. Emacs shows
 xgespräch.xml. And the test works.

I tried emacs and saw the same problem you did.  vim seems to work
correctly with locales.
Allthough advising a switch to vim is probably more responsible,
a quick seach revealed this link: http://linux.seindal.dk/item32.html



 Has anyone any illuminating comments to make? I'm particularly
 interested in the distribution problem.

You could have the distributor change his locale to utf-8 and rename the files
on his filesystem.



Re: High-Speed UTF-8 to UTF-16 Conversion

2007-03-16 Thread Ben Wiley Sittler

I believe it's more DHTML that is the problem.

DOMString is specified to be UTF-16. Likewise for ECMAScript strings,
IIRC, although they may still be officially UCS-2.

In practice ECMAScript specifies (and implementations provide) such
minimal Unicode support (no canonicalization or character class
primitives [combining, etc.], for instance, and no way to work with
characters rather than UCS-2 codes/surrogate halves, and no access to
codecs other than UTF-8-UTF-16 [often buggy and incomplete, and
rarely able to deal with errors in any way other than throwing
exceptions], nor any access to the Unicode names database or the
Unihan database) that applications are basically on their own.

On 16 Mar 2007 21:59:06 +, Colin Paul Adams
[EMAIL PROTECTED] wrote:

 Rich == Rich Felker [EMAIL PROTECTED] writes:

Rich UTF-8. There's no good reason for using UTF-16 at all; it's
Rich just a bad implementation choice. IIRC either HTML or XML
Rich (yes I know they're different but I forget which does it..)

I don't ever recall seeing this in HTML, but it certainly isn't in
XML.
The only thing XML has to say on the subject is that XML parsers must
be able to read both.
--
Colin Adams
Preston Lancashire

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: A call for fixing aterm/rxvt/etc...

2007-02-23 Thread Ben Wiley Sittler

just two cents: i did this some years back for the links and elinks
web browsers (it's the utf-8 i/o option available in some versions
of each) and the results are fairly mixed -- copy-n-paste fails
horribly in an app converted in this way, and i assume the same would
be true of a terminal emulator in a window system like X11. on the
other hand, it meant i and others could use these browsers on e.g. mac
os x years before someoine undertook the much more in-depth utf-8 and
unicode support now in progress for elinks.

using luit for this sounds appealing, but in my experience luit (a)
crashes frequently and (b) is easily confused by escape sequences and
has no user interface for resetting all its iso-2022 state, so in
practice it works for only a few apps.

that said, it would probably be better  thanthe current state of affairs.

On 2/23/07, Rich Felker [EMAIL PROTECTED] wrote:

These days we have at least xterm, urxvt, mlterm, gnome-terminal, and
konsole which support utf-8 fairly well, but on the flip side there's
still a huge number of terminal emulators which do not respect the
user's encoding at all and always behave in a legacy-8bit-codepage
way.

Trying to help users in #irssi, etc. with charset issues, I've come to
believe that it's a fairly significant problem: users get frustrated
with utf-8 because the terminal emulator they want to use (which might
be chosen based on anti-bloat sentiment or, quite the opposite, on a
desire for specialized eye candy only available in one or two
programs) forces their system into a mixed-encoding scenario where
they have both utf-8 and non-utf-8 data in the filesystem and text
files.

How hard would it be to go through the available terminal emulators,
evaluate which ones lack utf-8 support, and provide at least minimal
fixes? In particular, are there any volunteers?

What I'm thinking of as a minimal fix is just putting utf-8 conversion
into the input and output layers. It would still be fine for most
users of these apps if the terminal were limited to a 256-character
subset of UCS, didn't support combining characters or CJK, etc. as
long as the data sent and received over the PTY device is valid UTF-8,
so that the (valid and correct) assumption of applications running on
the terminal that characters are encoded in the locale's encoding is
satisfied.

Perhaps this could be done via a reverse luit -- that is, a program
like luit or an extension to luit that assumes the physical terminal
is using an 8bit legacy codepage rather than UTF-8. Then these
terminals could simply be patched to run luit if the locale's encoding
is not single-byte.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Proposed fix for Malayalam ( other Indic?) chars and wcwidth

2006-10-16 Thread Ben Wiley Sittler

just tried this in a few terminals, here are the results:

GNOME Terminal 2.16.1:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 3
NOTE: displays very differently in each case

Konsole 1.6.5:
U+0D30 U+0D4A displayed with width 3
U+0D30 U+0D46 U+0D3E displayed with width 4
NOTE: displays very differently in each case

mlterm 2.9.3:
U+0D30 U+0D4A displayed with width 2
U+0D30 U+0D46 U+0D3E displayed with width 2
NOTE: displays identically in each case

On 10/16/06, Bruno Haible [EMAIL PROTECTED] wrote:

Hello Rich,

 These characters are combining marks that attach on both
 sides of a cluster, and have canonical equivalence to the two separate
 pieces from which they are built, but yet Markus' wcwidth
 implementation and GNU libc assign them a width of 1. It appears very
 obvious to me that there's no hope of rendering both of these parts
 using only 1 character cell on a character cell device, and even if it
 were possible, it also seems horribly wrong for canonically equivalent
 strings to have different widths.

What rendering to other terminal emulators produce for these characters,
especially the ones from GNOME, KDE, Apple, and mlterm? I cannot submit
a patch to glibc based on the data of just 1 terminal emulator.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Next Generation Console Font?

2006-08-20 Thread Ben Wiley Sittler

see mlterm, please — some of these are very useful display forms, and
already in use for character-cell terminal emulators.

as for triple-cell glyphs, see emacs w/arabic presentation forms.

On 8/20/06, Rich Felker [EMAIL PROTECTED] wrote:

On Sat, Aug 19, 2006 at 11:20:55AM -0700, Ben Wiley Sittler wrote:
 sorry, cat-typing sent that email a bit early. here's the rest:

 for indic scripts and arabic having triple-cell ligatures is really
 indispensible for readable text.

 for east asian text a ttb, rtl columnar display mode is really, really
 nice.

For a terminal? Why? Do you want to see:

 l
 s

 -
 l
 [...]

??? I suspect not. If anyone really does want this behavior, then by
all means they can make a terminal with different orientation. But
until I hear about someone really wanting this I'll assume such claims
come from faux-counter-imperial chauvinism where western academics in
ivory towers tell people in other cultures that they must preserve
their traditions for their own sake with no regard for practicality,
and end up doing nothing but _disadvantaging_ people.

 a passable job at least for CJK. how to handle single-cell vs.
 double-cell vs. triple-cell glyphs in vertical presentation is a

I've never heard of a triple-cell glyph. Certainly the standard
wcwidth (Kuhn's version) has no such thing.

 tricky problem - short runs (= 2 cells) should probably be displayed
 as horizontal inclusions, longer runs should probably be rotated.

Nonsense. A terminal does not have the luxury to decide such things.
You're confusing terminal with word processor or maybe even with
TeX...

 why don't we have escape sequences for switching between the DBCS and
 non-DBCS cell behaviors, and for rotating the terminal display for
 vertical text vs. horizontal text?

Because it's not useful. Applications will not use it. All the
terminal emulator needs to do is:

1. display raw text in a form that's not offensive -- this is
   necessary so that terminal-unaware programs just writing to stdout
   will work.

2. provide cursor positioning functions (minimal) and (optionally)
   scrolling/insert/delete and other small optimizations.

Anything more is just pure bloat because it won't be supported by
curses and applications are written either to curses or to vt102.

 Note that mixing vertical and
 horizontal is sometimes done in the typographic world but is probably
 not needed for terminal emulators (this requires a layout engine much
 more advanced than the unicode bidi algorithm, capable of laying out

This most certainly does not belong in a terminal emulator. Apps
(such as text based web browsers) wishing to do elegant
multi-orientation formatting can do the cursor positioning and such
themselves. Users preferring a vertical orientation can configure
their terminals as such. This is a matter of user preference, not
application control, and thus there should NOT be a way for
applications to control or override it.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Next Generation Console Font?

2006-08-19 Thread Ben Wiley Sittler

for displaying doublebyte-charset documents the east asian width
semantics are indispensible. there are very good reasons to have two
modes for the terminal — east asian (all but ascii and explicitly
narrow kana/hangeul/etc. as two cells) and non-east-asian (all but
kanji/hanzi/hanja, hangeul, and kana single-width). the first is
cell-compatible with the DBCS terminals (useful for viewing forms,
character-cell art, webpages, etc., including e.g. doublewidth
cyrillic characters used as graphics) and the second with non-DBCS
terminals (actual cyrillic text, for example.)
iuv c

On 8/17/06, David Starner [EMAIL PROTECTED] wrote:

On 8/17/06, Rich Felker [EMAIL PROTECTED] wrote:
 This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
 according to POSIX (POSIX makes no requirements about correspondence
 of the values returned to any other standard) but it's obviously
 incorrect for the width of À to be anything but 1, even if it was
 historically displayed wide (wtf?!) on some legacy CJK terminal types.

It's not obviously incorrect; in a CJK terminal, everything but ASCII
was double-width, which actually a very convienant way of doing
things. Many of these fonts are still around, and I suspect that many
users still use terminals that expect everything but ASCII to be
double-width. glibc here is merely supporting the way things work.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



UTF-8B support for libiconv

2006-04-02 Thread Ben Wiley Sittler
[ this is in response to a truly ancient linux-utf8 thread ]

i wrote a patch that provides UTF-8 + binary in one codec with no
hand-waving, using Markus Kuhn's brilliant proposal to encode invalid
bytes 0xyz using unpaired surrogates U+DCyz. this means there need not
be a text/binary distinction for UTF-8-using programs. legal UTF-8
decodes/encodes correctly, and other bytes are handled as opaque
U+DCxx on input and correctly serialized on output. so one can once
again consider editing a binary format with a notepad-type editor
without sacrificing internationalization support.

Markus Kuhn's description of the idea: (search for option d)

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

the patch:

http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff

enjoy! (not sure how/whether this fits into the official distro, but i
hope it gets used)

-ben

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/