Re: od: display verbatim utf8 characters along side dump

enh Thu, 03 Apr 2025 07:09:57 -0700

about once a year i have this thought too, usually while looking at
[possible] malware with non-latin identifiers/strings ... my idea was
always that the character should take the first slot, and then i'd use
a unicode ellipsis (or something visually distinct from the usual `.`)
to show which bytes were the non-initial bytes of that.

reasons why i've never sent a toybox patch include:
1. what about characters for which wcwidth() returns 2? especially if
they land on the last "slot" of the text part of the dump?
2. what about combining characters?
  a. for accents etc it's probably fine to show them as separate characters.
  b. but for things like flags ... there's no obvious good way to deal
with that that doesn't involve pretty deep understanding of how to
combine all the relevant code points.
3. what about characters for which wcwidth() returns 0?
4. even for simple cases, this would mean saving state between lines,
if you have a three byte character start two bytes from the end of a
line.

is it still useful if you only handle the simple cases? maybe. but
this has always felt like too big a can of worms for me to send a
patch...

i'd certainly be curious to know whether there's any precedent in
other tools (including GUI ones).

On Thu, Apr 3, 2025 at 7:58 AM Pádraig Brady <p...@draigbrady.com> wrote:
>
> On 03/04/2025 07:10, Avid Seeker wrote:
> > ```
> > $ echo -n this is a 🐕 | od -cx
> > 0000000   t   h   i   s       i   s       a     360 237 220 225
> >             6874    7369    6920    2073    2061    9ff0    9590
> > 0000016
> > ```
> >
> > Can od print UTF-8 characters verbatim instead of encoding them in octal?
> >
> > I guess as its name suggests, that's not possible. But if it can do it
> > for ASCII characters what prevents it from also applying it to UTF-8
> > characters?
> >
> > If it's not possible, any suggestions or alternative tools would be
> > apprecited.
> >
> > Avid
>
>
> Well there is a bit of a layout issue with multi-byte chars.
> With which byte do you align the literal character with?
> Also if aligning with spaces there is ambiguity as to whether
> there was a space there in the input or not.
>
>    $ echo -n this is á 🐕 | od -tc -tx1
>    0000000   t   h   i   s       i   s     303 241     360 237 220 225
>             74  68  69  73  20  69  73  20  c3  a1  20  f0  9f  90  95
>
>    $ echo -n this is á 🐕 | od  -tx1z
>    0000000 74 68 69 73 20 69 73 20 c3 a1 20 f0 9f 90 95     >this is .. ....<
>
>
> Now in the first form above at least I guess there isn't much ambiguity with 
> spaces,
> and we could continue to align multi-byte chars to the last nibble.
>
> thanks,
> Pádraig
>

Re: od: display verbatim utf8 characters along side dump

Reply via email to