On Wed, 04 Apr 2018, to...@tuxteam.de wrote: > On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote: > > On Tue, 03 Apr 2018, Michael Lange wrote: > > > I believe (please anyone correct me if I am wrong) that "text" files > > > won't contain any null byte; many text editors even refuse to open such a > > > > Depends on the encoding. For ASCII, ISO-8859-* and UTF-8 (and any other > > modern encoding AFAIK, other than modified UTF-8), any zero bytes map > > one-to-one to the NUL character/code point. I don't recall how it is on > > other common encodings of the 80's and 90's, though. > > Try UTF-16, what Microsoft (and a couple of years ago Apple) love to > call "Unicode": in more "Western" contexts every second byte is NULL!
Ah, yes. I forgot about them, indeed. UTF-16BE and UTF-16LE will have zero bytes in the resulting byte stream. And I suppose one could call them "modern encodings", even if they are horrifying to use when compared to UTF-8 (UTF-16 has byte-order issues) or UTF-32 (UTF-16 has surrogate pairs). > > Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use > > bytes with the value of zero when encoding characters, so NUL is encoded > > by a different sequence, and you can safely use a byte with the value of > > zero for some out-of-band control [...] > > Yes, the problem is that someone else before you could have been doing > exactly that. You can modified-UTF-8 bit-packing to encode anything, and the result will be zero-free (and it will restore the zeroes when decoded). The price is a size increase (it is a variant of UTF-8 that uses two bytes to encode NUL, which would take just one byte in normal UTF-8). There are much better bit packing schemes if you just need to escape zeroes ;-) That said, it is always safe to break valid "modified UTF-8" into records using zeroes, as long as you don't expect the result to be valid UTF-8 (it isn't valid UTF-8 because NULs will be encoded using a non-minimal byte sequence that *will* decode to a zero even if it is invalid) or valid modified UTF-8 (it isn't valid modified UTF-8 because 0 is not valid as an encoding for NUL in modified UTF-8). But a lax UTF-8 or modified UTF-8 *would* parse "modified UTF-8 with zero as record separators" and reconstruct the unicode text properly (but it would read the record separators as NULs, so you'd get extra NULs in the resulting text). That, of course, assumes you have unicode text as the input (encoding doesn't matter, as long as you know it), and recode it to modified UTF-8 before you add the zeroes as end-of-record marks. This is not about bit-packing generic binary data. > I'd guard against that. It's not exactly difficult, the traditional > "escape" mechanism (aka character stuffing) does it pretty well... Yes, any bitstuffing/escape-based wrapping would do. -- Henrique Holschuh