Some requirements that came up in a non-email discussion:
* ability to edit and save data - editing at hex and bits level. File size
limits for edit are acceptable.
* display of bits
* Right-to-left byte order
* word-size-sensitive display (e.g., user can set to use 70 bits wide)
* support for parse & unparse cycle (comparison of output data from
unparse, to input data in hex/bits display)

The rest of this email is a bunch of random pointers/ideas about binary
data display/edit. Hopefully useful, not TL;DR.

This is what some least-significant-bit-first data lines look like: They
are 70 bits wide, because that's the word size of the format. The byte
order is Right to Left.

    00 1100 0011 1000 0000 0000 0000 0000 0100 0000 0101 0000 0000 1000 0000
1110 1000 1100
    00 0000 0000 0000 0000 0001 0101 1001 1111 1110 1010 1000 0101 1011
0011 1001 1010 0010
    11 1111 1111 1000 0000 0000 1101 0110 0000 0000 0000 0000 0000 0000
0000 0000 0000 0101
    00 0000 0000 0000 0001 1000 0000 0000 0111 1111 1000 0000 0000 0000
0000 0000 0000 1101

I have highlighted the first fields of the first word. Just to show how
non-byte-aligned this sort of data is.

This same kind of data is sometimes padded with extra bits which would show
up on the left. 2 bits more is pretty common, as then each "word" is an
even 9 bytes, which would make using a hex representation potentially
useful.  But I've also seen 5 bits of padding, and 75 bit words are no
help.  So a user needs to be able to say how wide they want the
presentation, in bits.

The above 4 lines of bits,... that data format is often preceded by a
32-byte long big-endian mostSignificantBitFirst header all of which is
byte-oriented byte-aligned data, and is most easily understood by looking
at an ordinary L-to-R hex dump.

Hence, users need to be able to examine a file of this sort of data, and
break the data at byte 32, so that from byte 33 (base 1 numbering) onwards
for the next 35 bytes (70 bits x 4 = 280 bits = 35 bytes) use the
bit-oriented 70-bit-wide display. A typical data file will have many such
header+message, suggesting one must switch back and forth between
presentations of the data.

You should also look at this bit-order tutorial:
https://daffodil.apache.org/tutorials/bitorder.tutorial.tdml.xml which
discusses R-to-L byte display also.
This tutorial should convince you there is no need to be reordering the
bits, only the bytes. I.e, in the above 70 bit words, the first byte is
"1000 1100", regardless of whether the presentation is L-to-R or R-to-L.

The Daffodil CLI debugger has a "data dump" utility that creates R-to-L
dump display things like this:

fedcba9876543210   ffee ddcc bbaa 9988 7766 5544 3322 1100  87654321
cø€␀␀␀wü␚’gU€␀gä  63f8 8000 0000 77fc 1a92 6755 8000 67e4 :00000000
       ␀␀␁›¶þ␐HD                   00 0001 9bb6 fe10 4844 :00000010

That example is in the TestDump.scala file in the
daffodil-io/src/test/scala/org/apache/daffodil/io/TestDump.scala file.
The chars on the left are iso-8859-1 code points, except for the
control-pictures characters used to represent those code points.

(Email isn't lining up these characters correctly due to the
control-pictures characters (like for NUL, DLE, SUB, etc.) not being fixed
width in this font. I don't think there is a fixed-width font in the world
with every unicode code point in it.)

There's also examples there of L-to-R dumps for utf-8, utf-16, and utf-32
data. E.g., this is utf-8 with some 3-byte Kanji chars:

87654321  0011 2233 4455 6677 8899 aabb ccdd eeff
0~1~2~3~4~5~6~7~8~9~a~b~c~d~e~f~
00000000: 4461 7465 20e5 b9b4 e69c 88e6 97a5 3d32 D~a~t~e~␣~年~~~~月~~~~日
~~~~=~2~
00000010: 3030 33e5 b9b4 3038 e69c 8832 37e6 97a5 0~0~3~年~~~~0~8~月~~~~2~7~日
~~~~

Character sets are in general quite problematic, as there are some that
include shift-chars which change the interpretation of bytes as to what
character they correspond to. Mojibake
<https://en.wikipedia.org/wiki/Mojibake> is sometimes unavoidable.
Defaulting to just iso-8859-1 (where every byte is a valid character) is
perfectly reasonable in many situations and is probably fine for first cut.

Reply via email to