On 03/13/2013 09:34 PM, Eric Blake wrote: > On 03/13/2013 02:16 PM, Marc Grondin wrote: >> Good Afternoon, > > Hello, and thanks for the report. > >> >> My client was attempting to run the command : od -c on this xml file (sample >> only) >> ------------------------------------------------------------------------------ >> <?xml version = '1.0' encoding = 'UTF-8'?> >> <top> >> <x>丸</x> > > Here, you are representing a character in UTF-8 > >> He was getting this output : >> ------------------------------------------------------------------------------ >> 0000000 < ? x m l v e r s i o n = >> 0000020 ' 1 . 0 ' e n c o d i n g = >> 0000040 ' U T F - 8 ' ? > \n < t o p > >> 0000060 \n < x > � � � < / x > \n > > and here, you were running od in a different character set: > >> This all based on the LANG env. He was using : >> LANG=en_US.iso88591, instead of >> LANG=en_US.UTF-8 > > In ISO-88591, every byte is a character, and those particular bytes > happen to be printable, so od was faithfully replaying the character as > printable, only to then be shown by your UTF-8 terminal as an invalid > UTF-8 sequence. Mismatching character sets between your program and > your terminal is always a recipe for confusion. > > However, you HAVE identified a bug, in our documentation. > >> >> ------------------------------------------------------------------------------ >> >> Question : >> Since the output is based on the ASCII character set, should it not, in both >> cases give a numerical output (as it did in scenario #2) >> for a symbol outside the ascii/extended-ascii character set ? > > Our documentation is lying. Here's what POSIX says about od -c: > > http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html > "Interpret bytes as characters specified by the current setting of the > LC_CTYPE category. Certain non-graphic characters appear as C escapes: > "NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others > appear as 3-digit octal numbers." > > Nothing in there restricts the output to ASCII only. The bytes that are > showing up as � are graphic characters in your current choice of > LC_CTYPE, so there is no escaping performed (since escaping is permitted > only on non-graphic characters). If your terminal was using the same > character set as you ran od under, you would see proper graphical > characters in the ISO-88591 set (but then again, you wouldn't see the > nice 丸 character that the UTF-8 was representing). > > Coreutils is properly obeying the locale, what is wrong is the info > documentation which stated: > > `-c' > Output as ASCII characters or backslash escapes.
I agree. Thanks for the detailed description. > In reality, that should state something like: > Output as characters in the current locale, using octal sequences > or backslash escapes for all non-graphic bytes. Note we output spaces, so I'd s/non-graphic/non-printable/. Also multi byte is always going to be problematic displaying in a grid like this, so we'll probably continue to do as we do now for the utf8 example above and output octal and dots. So therefore s/characters/single byte characters/. > > Meanwhile, if you want to guarantee ASCII-only output from od, you have > to use a different format, such as -b or -tx1, or use LC_ALL=C on a > system where the C locale does not treat non-ascii bytes as graphical > characters (most glibc systems, including the one you are using, fit > this bill). > cheers, Pádraig.
