>> For example, if you get a line containing, in hex, >> >> d0 b0 d0 b0 d0 b0 20 20 20 20 20 20 20 20 40 >> >> then (assuming 8-character tabstops and -a in effect), then under >> 8859-1 you have (to use Unicode names) LATIN CAPITAL LETTER ETH and >> DEGREE SIGN, with the pair repeated three times, and you thus >> convert the first two of the spaces to a tab, but under UTF-8 you >> have three instances of CYRILLIC SMALL LETTER A and you thus convert >> the first five of the spaces to a tab. (Handling tabs in the input >> makes it even more complicated.)
> From the NetBSD Manpage you quote later: > "If the -a option is given, then tabs are inserted whenever they > would compress the resultant file by replacing two or more > characters." > Correct me if I'm wrong, But I don't see how utf8 has anything to do > with this? unexpand, as defined by that manpage, is defined to operate on characters, not bytes. Thus, questions such as "UTF-8 or 8859-1 or what?" are relevant, because they affect whether (for example) the first two octets of the line I gave constitute two characters, one character, part of one character, or what. Interpreted as 8859-1, the string of octets I gave starts with six nonblank characters, so, under the assumptions I described, the first two spaces should be converted to a tab. Under UTF-8, there are only three characters - represented by six octets - before the string of spaces, so the first five spaces should be replaced. (I'd cite other examples, but UTF-8 is the only multibyte character encoding I know well enough to give an example of.) > Was hoping to never have to touch utf8 while writing this. Don't blame you. I think UTF-8 is a major botch (variable-sized character representations? seriously??) and have as little to do with it as I can manage. Unfortunately, fixing it right requires converting "everything" from streams of octets to streams of Unicode codepoints, which is a lot of work. (I want to do it, someday, but it will be a major undertaking.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net