bug#13947: bug report for core-utils command : OD

Pádraig Brady Wed, 13 Mar 2013 14:54:02 -0700

On 03/13/2013 09:34 PM, Eric Blake wrote:
> On 03/13/2013 02:16 PM, Marc Grondin wrote:
>> Good Afternoon, 
> 
> Hello, and thanks for the report.
> 
>>
>> My client was attempting to run the command : od -c on this xml file (sample 
>> only) 
>> ------------------------------------------------------------------------------
>> <?xml version = '1.0' encoding = 'UTF-8'?>
>> <top>
>>    <x>丸</x>
> 
> Here, you are representing a character in UTF-8
> 
>> He was getting this output : 
>> ------------------------------------------------------------------------------
>> 0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
>> 0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
>> 0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
>> 0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    
> 
> and here, you were running od in a different character set:
> 
>> This all based on the LANG env.  He was using : 
>> LANG=en_US.iso88591, instead of
>> LANG=en_US.UTF-8 
> 
> In ISO-88591, every byte is a character, and those particular bytes
> happen to be printable, so od was faithfully replaying the character as
> printable, only to then be shown by your UTF-8 terminal as an invalid
> UTF-8 sequence.  Mismatching character sets between your program and
> your terminal is always a recipe for confusion.
> 
> However, you HAVE identified a bug, in our documentation.
> 
>>
>> ------------------------------------------------------------------------------
>>
>> Question : 
>> Since the output is based on the ASCII character set, should it not, in both 
>> cases give a numerical output (as it did in scenario #2) 
>> for a symbol outside the ascii/extended-ascii character set ? 
> 
> Our documentation is lying.  Here's what POSIX says about od -c:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
> "Interpret bytes as characters specified by the current setting of the
> LC_CTYPE category. Certain non-graphic characters appear as C escapes:
> "NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
> appear as 3-digit octal numbers."
> 
> Nothing in there restricts the output to ASCII only.  The bytes that are
> showing up as � are graphic characters in your current choice of
> LC_CTYPE, so there is no escaping performed (since escaping is permitted
> only on non-graphic characters).  If your terminal was using the same
> character set as you ran od under, you would see proper graphical
> characters in the ISO-88591 set (but then again, you wouldn't see the
> nice 丸 character that the UTF-8 was representing).
> 
> Coreutils is properly obeying the locale, what is wrong is the info
> documentation which stated:
> 
> `-c'
>      Output as ASCII characters or backslash escapes.


I agree. Thanks for the detailed description.

> In reality, that should state something like:

>      Output as characters in the current locale, using octal sequences
> or backslash escapes for all non-graphic bytes.

Note we output spaces, so I'd s/non-graphic/non-printable/.

Also multi byte is always going to be problematic displaying
in a grid like this, so we'll probably continue to do as
we do now for the utf8 example above and output octal and dots.
So therefore s/characters/single byte characters/.

> 
> Meanwhile, if you want to guarantee ASCII-only output from od, you have
> to use a different format, such as -b or -tx1, or use LC_ALL=C on a
> system where the C locale does not treat non-ascii bytes as graphical
> characters (most glibc systems, including the one you are using, fit
> this bill).
> 

cheers,
Pádraig.

bug#13947: bug report for core-utils command : OD

Reply via email to