Some requirements that came up in a non-email discussion: * ability to edit and save data - editing at hex and bits level. File size limits for edit are acceptable. * display of bits * Right-to-left byte order * word-size-sensitive display (e.g., user can set to use 70 bits wide) * support for parse & unparse cycle (comparison of output data from unparse, to input data in hex/bits display)
The rest of this email is a bunch of random pointers/ideas about binary data display/edit. Hopefully useful, not TL;DR. This is what some least-significant-bit-first data lines look like: They are 70 bits wide, because that's the word size of the format. The byte order is Right to Left. 00 1100 0011 1000 0000 0000 0000 0000 0100 0000 0101 0000 0000 1000 0000 1110 1000 1100 00 0000 0000 0000 0000 0001 0101 1001 1111 1110 1010 1000 0101 1011 0011 1001 1010 0010 11 1111 1111 1000 0000 0000 1101 0110 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 00 0000 0000 0000 0001 1000 0000 0000 0111 1111 1000 0000 0000 0000 0000 0000 0000 1101 I have highlighted the first fields of the first word. Just to show how non-byte-aligned this sort of data is. This same kind of data is sometimes padded with extra bits which would show up on the left. 2 bits more is pretty common, as then each "word" is an even 9 bytes, which would make using a hex representation potentially useful. But I've also seen 5 bits of padding, and 75 bit words are no help. So a user needs to be able to say how wide they want the presentation, in bits. The above 4 lines of bits,... that data format is often preceded by a 32-byte long big-endian mostSignificantBitFirst header all of which is byte-oriented byte-aligned data, and is most easily understood by looking at an ordinary L-to-R hex dump. Hence, users need to be able to examine a file of this sort of data, and break the data at byte 32, so that from byte 33 (base 1 numbering) onwards for the next 35 bytes (70 bits x 4 = 280 bits = 35 bytes) use the bit-oriented 70-bit-wide display. A typical data file will have many such header+message, suggesting one must switch back and forth between presentations of the data. You should also look at this bit-order tutorial: https://daffodil.apache.org/tutorials/bitorder.tutorial.tdml.xml which discusses R-to-L byte display also. This tutorial should convince you there is no need to be reordering the bits, only the bytes. I.e, in the above 70 bit words, the first byte is "1000 1100", regardless of whether the presentation is L-to-R or R-to-L. The Daffodil CLI debugger has a "data dump" utility that creates R-to-L dump display things like this: fedcba9876543210 ffee ddcc bbaa 9988 7766 5544 3322 1100 87654321 cø€␀␀␀wü␚’gU€␀gä 63f8 8000 0000 77fc 1a92 6755 8000 67e4 :00000000 ␀␀␁›¶þ␐HD 00 0001 9bb6 fe10 4844 :00000010 That example is in the TestDump.scala file in the daffodil-io/src/test/scala/org/apache/daffodil/io/TestDump.scala file. The chars on the left are iso-8859-1 code points, except for the control-pictures characters used to represent those code points. (Email isn't lining up these characters correctly due to the control-pictures characters (like for NUL, DLE, SUB, etc.) not being fixed width in this font. I don't think there is a fixed-width font in the world with every unicode code point in it.) There's also examples there of L-to-R dumps for utf-8, utf-16, and utf-32 data. E.g., this is utf-8 with some 3-byte Kanji chars: 87654321 0011 2233 4455 6677 8899 aabb ccdd eeff 0~1~2~3~4~5~6~7~8~9~a~b~c~d~e~f~ 00000000: 4461 7465 20e5 b9b4 e69c 88e6 97a5 3d32 D~a~t~e~␣~年~~~~月~~~~日 ~~~~=~2~ 00000010: 3030 33e5 b9b4 3038 e69c 8832 37e6 97a5 0~0~3~年~~~~0~8~月~~~~2~7~日 ~~~~ Character sets are in general quite problematic, as there are some that include shift-chars which change the interpretation of bytes as to what character they correspond to. Mojibake <https://en.wikipedia.org/wiki/Mojibake> is sometimes unavoidable. Defaulting to just iso-8859-1 (where every byte is a valid character) is perfectly reasonable in many situations and is probably fine for first cut.