Package: unicode Version: 2.8-1.1 Severity: normal unicode(1) documents that, by default, it produces output in the charset nominated by the user's locale. (This is in the documentation of the "-i" option, which can be used to specify an output charset; it specifically says that with a "properly set up locale" the option should not be needed.) In fact it does not reliably respect the environmental locale; whether it uses it depends partly on which environment variables are being used to specify it and partly on which locale is specified. (I'm not clear on what the logic actually is.) If it doesn't use the locale then, empirically, it outputs in UTF-8, which is not a safe default.
If no locale-relevant environment variables are set, meaning that the environmental locale is C, then unicode(1) doesn't respect the locale, and outputs in UTF-8: $ env - locale charmap ANSI_X3.4-1968 $ env - unicode --brief -x 2603 | od -tx1 0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 0000020 41 4e 0a 0000023 If the C locale is explicitly set in the LANG environment variable, then unicode(1) doesn't respect the locale, and outputs in UTF-8: $ env - LANG=C locale charmap ANSI_X3.4-1968 $ env - LANG=C unicode --brief -x 2603 | od -tx1 0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 0000020 41 4e 0a 0000023 But if a Latin-1 locale is set in the LANG environment variable, then unicode(1) does respect it: $ env - LANG=de_DE.iso88591 locale charmap ISO-8859-1 $ env - LANG=de_DE.iso88591 unicode --brief -x 2603 | od -tx1 0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e 0000020 0a 0000021 Specifying the charset aspect of a locale using the LC_CTYPE environment variable produces the same locale-dependent results as using LANG: $ env - LC_CTYPE=C locale charmap ANSI_X3.4-1968 $ env - LC_CTYPE=C unicode --brief -x 2603 | od -tx1 0000000 e2 98 83 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 0000020 41 4e 0a 0000023 $ env - LC_CTYPE=de_DE.iso88591 locale charmap ISO-8859-1 $ env - LC_CTYPE=de_DE.iso88591 unicode --brief -x 2603 | od -tx1 0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e 0000020 0a 0000021 If a locale is specified using the LC_ALL environment variable, however, then it seems to always be respected: $ env - LC_ALL=C locale charmap ANSI_X3.4-1968 $ env - LC_ALL=C unicode --brief -x 2603 | od -tx1 0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e 0000020 0a 0000021 $ env - LC_ALL=de_DE.iso88591 locale charmap ISO-8859-1 $ env - LC_ALL=de_DE.iso88591 unicode --brief -x 2603 | od -tx1 0000000 3f 20 55 2b 32 36 30 33 20 53 4e 4f 57 4d 41 4e 0000020 0a 0000021 The impact of this bug is that for many reasonable setups, with locale correctly described in the environment and matching actual output device capabilities, unicode(1) generates output containing spurious control characters that not only don't display what was intended but also sometimes screw up the display of other parts of the output. I'm specifically seeing that result with a Latin-1 terminal emulator and the C locale. -zefram