Whats the reason to suppress short unicode characters in printf?

Ingo Krabbe Fri, 25 Mar 2016 07:12:51 -0700

Hey gnu wizards,

today I subscribed to this mailing list for a more or less philosphical 
question, that is already the subject of this mail: Whats the reason to 
suppress short unicode characters in printf?


But lets start with a bit of historical background how I stumbled over the 
following issue.

As many linux users I'm a terminal junkie, or better I was a terminal junkie, 
until I discovered even better ways to edit and fire command lines, but that 
would be part of another big story.

So to help me to locate unicode characters, I wrote a little script, that does 
it's best to print out any unicode character on a table

        #!/bin/sh
        if [ $# -lt 1 ]
        then echo too few arguments. I need the table in hex form xxx, where 
the lower byte will be replaced.
        exit 1
        fi
        eval let t=0x$1
        echo TABLE $t
        let c=$(tput cols)
        let c=$c\/16
        let ci=0
        for i in {0..255}
        do
        C=$(/usr/bin/printf "\\\\U000%03x%02x" $t $i)
        form="%03x%02x: $C\t"
        /usr/bin/printf "$form" $t $i
        let ++ci
        if [ $ci -ge $c ]
        then /usr/bin/printf "\n"
        let ci=0
        fi
        done

. This script might not be the best approach, but still is quite usefull. For 
example

        # unicode-table-yyy 001

gives

        TABLE 1
        00100: Ā        00101: ā        00102: Ă        00103: ă        00104: 
Ą        
        […]
        001fa: Ǻ        001fb: ǻ        001fc: Ǽ        001fd: ǽ        001fe: 
Ǿ        
        001ff: ǿ

that I also use to copy&paste in tmux terminals if I need a character in a 
selected range.

Of course it helps to have a full unicode font installed.

But when I want to show the table #0 I get some errors, such as 

        […]
        0003c: /usr/bin/printf: invalid universal character name \U0000003c
        0003d: /usr/bin/printf: invalid universal character name \U0000003d
        0003e: /usr/bin/printf: invalid universal character name \U0000003e
        0003f: /usr/bin/printf: invalid universal character name \U0000003f
        00040: @        
        […]

, where `man -s 1 printf` tells us to use

       \UHHHHHHHH
              Unicode character with hex value HHHHHHHH (8 digits)

that is implemented in a very complex way to support some seldom used terminals 
with non utf-8 encodings (I hope non utf-8 encodings are seldom these days.) 

        263       else if (*p == 'u' || *p == 'U')
        264         {
        265           char esc_char = *p;
        266           unsigned int uni_value;
        267     
        268           uni_value = 0;
        269           for (esc_length = (esc_char == 'u' ? 4 : 8), ++p;
        270                esc_length > 0;
        271                --esc_length, ++p)
        272             {
        273               if (! isxdigit (to_uchar (*p)))
        274                 error (EXIT_FAILURE, 0, _("missing hexadecimal 
number in escape"));
        275               uni_value = uni_value * 16 + hextobin (*p);
        276             }
        277     
        278           /* A universal character name shall not specify a 
character short
        279              identifier in the range 00000000 through 00000020, 
0000007F through
        280              0000009F, or 0000D800 through 0000DFFF inclusive. A 
universal
        281              character name shall not designate a character in the 
required
        282              character set.  */
        283           if ((uni_value <= 0x9f
        284                && uni_value != 0x24 && uni_value != 0x40 && 
uni_value != 0x60)
        285               || (uni_value >= 0xd800 && uni_value <= 0xdfff))
        286             error (EXIT_FAILURE, 0, _("invalid universal character 
name \\%c%0*x"),
        287                    esc_char, (esc_char == 'u' ? 4 : 8), uni_value);
        288     
        289           print_unicode_char (stdout, uni_value, 0);
        290         }
         -- from coreutils-8.23 src/printf.c

Again another story would be the implementation of `print_unicode_char`, but my 
case is the if clause [283,285] that suppresses some unicode values tables 000 
and tables [0d8,0df].

Whats the reason for this exception?

The reason against this exception is clearly, when you fail for some values of 
a set C, anyone who uses your program with input from set C has to implement 
these exceptions too. So such spikes in the input set, carry through the whole 
IO chain, that anyone who uses your `printf` program, has to implement this 
exceptions to implement an error free algorithm on input characters from set C, 
making the complex world of programming computer even more complex, as it 
intrinsically is anyway.

Looking forward for an intresting discussion about complexity and with kind 
regards,

Ingo Krabbe

-- 

Liberty for the Modules!

        -- 
https://medium.com/@azerbike/i-ve-just-liberated-my-modules-9045c06be67c

Whats the reason to suppress short unicode characters in printf?

Reply via email to