Hey gnu wizards, today I subscribed to this mailing list for a more or less philosphical question, that is already the subject of this mail: Whats the reason to suppress short unicode characters in printf?
But lets start with a bit of historical background how I stumbled over the following issue. As many linux users I'm a terminal junkie, or better I was a terminal junkie, until I discovered even better ways to edit and fire command lines, but that would be part of another big story. So to help me to locate unicode characters, I wrote a little script, that does it's best to print out any unicode character on a table #!/bin/sh if [ $# -lt 1 ] then echo too few arguments. I need the table in hex form xxx, where the lower byte will be replaced. exit 1 fi eval let t=0x$1 echo TABLE $t let c=$(tput cols) let c=$c\/16 let ci=0 for i in {0..255} do C=$(/usr/bin/printf "\\\\U000%03x%02x" $t $i) form="%03x%02x: $C\t" /usr/bin/printf "$form" $t $i let ++ci if [ $ci -ge $c ] then /usr/bin/printf "\n" let ci=0 fi done . This script might not be the best approach, but still is quite usefull. For example # unicode-table-yyy 001 gives TABLE 1 00100: Ā 00101: ā 00102: Ă 00103: ă 00104: Ą […] 001fa: Ǻ 001fb: ǻ 001fc: Ǽ 001fd: ǽ 001fe: Ǿ 001ff: ǿ that I also use to copy&paste in tmux terminals if I need a character in a selected range. Of course it helps to have a full unicode font installed. But when I want to show the table #0 I get some errors, such as […] 0003c: /usr/bin/printf: invalid universal character name \U0000003c 0003d: /usr/bin/printf: invalid universal character name \U0000003d 0003e: /usr/bin/printf: invalid universal character name \U0000003e 0003f: /usr/bin/printf: invalid universal character name \U0000003f 00040: @ […] , where `man -s 1 printf` tells us to use \UHHHHHHHH Unicode character with hex value HHHHHHHH (8 digits) that is implemented in a very complex way to support some seldom used terminals with non utf-8 encodings (I hope non utf-8 encodings are seldom these days.) 263 else if (*p == 'u' || *p == 'U') 264 { 265 char esc_char = *p; 266 unsigned int uni_value; 267 268 uni_value = 0; 269 for (esc_length = (esc_char == 'u' ? 4 : 8), ++p; 270 esc_length > 0; 271 --esc_length, ++p) 272 { 273 if (! isxdigit (to_uchar (*p))) 274 error (EXIT_FAILURE, 0, _("missing hexadecimal number in escape")); 275 uni_value = uni_value * 16 + hextobin (*p); 276 } 277 278 /* A universal character name shall not specify a character short 279 identifier in the range 00000000 through 00000020, 0000007F through 280 0000009F, or 0000D800 through 0000DFFF inclusive. A universal 281 character name shall not designate a character in the required 282 character set. */ 283 if ((uni_value <= 0x9f 284 && uni_value != 0x24 && uni_value != 0x40 && uni_value != 0x60) 285 || (uni_value >= 0xd800 && uni_value <= 0xdfff)) 286 error (EXIT_FAILURE, 0, _("invalid universal character name \\%c%0*x"), 287 esc_char, (esc_char == 'u' ? 4 : 8), uni_value); 288 289 print_unicode_char (stdout, uni_value, 0); 290 } -- from coreutils-8.23 src/printf.c Again another story would be the implementation of `print_unicode_char`, but my case is the if clause [283,285] that suppresses some unicode values tables 000 and tables [0d8,0df]. Whats the reason for this exception? The reason against this exception is clearly, when you fail for some values of a set C, anyone who uses your program with input from set C has to implement these exceptions too. So such spikes in the input set, carry through the whole IO chain, that anyone who uses your `printf` program, has to implement this exceptions to implement an error free algorithm on input characters from set C, making the complex world of programming computer even more complex, as it intrinsically is anyway. Looking forward for an intresting discussion about complexity and with kind regards, Ingo Krabbe -- Liberty for the Modules! -- https://medium.com/@azerbike/i-ve-just-liberated-my-modules-9045c06be67c