On 2021-12-16, Laslo Hunhold <[email protected]> wrote: > I know this thread is already long enough, but I took my time now to > read deeper into the topic. Please read below, as we might come to a > conclusion there now.
Thanks for sticking with it. I know this topic is quite pedantic and hypothetical, but I think it's still important to consider and understand. > Interestingly, there was even an internal discussion on the > gcc-bugtracker[0] about this. They were thinking about adding an > attribute __attribute__((no_alias)) to the uint8_t typedef so it would > explicitly lose the aliasing-exception. > > There's a nice rant on [1] and a nice discussion on [2] about this > whole thing. And to be honest, at this point I still wasn't 100% > satisfied. Thanks for the links. The aliasing discussion in [0] is very interesting, and I will definitely bookmark [1] to use as a reference in the future. > What convinced me was how they added UTF-8-literals in C11. There you > can define explicit UTF-8 literals as u8"Hällö Wörld!" and they're of > type char[]. So even though char * is a bit ambiguous, we document well > that we expect an UTF-8 string. C11 goes further and accomodates us > with ways to portably define them. Interestingly, there is a C23 proposal[0] to introduce char8_t as a typedef for unsigned char and change the type (!) of UTF-8 string literals from char * to char8_t * (aka unsigned char *). It has not been discussed in any meeting yet, but it will be interesting to see what the committee thinks of it. I don't think u8 string literals are widely used at this point, but it's weird to see a proposal breaking backwards compatibility like this. [0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm > To also address this point, here's what we can do to make us all happy: > > 1) Change the API to accept char* > 2) Cast the pointers internally to (unsigned char *) for bitwise > modifications. We may do that as we may alias with char, unsigned > char and signed char. > 3) Treat it as an invalid code point when any bit higher than the 9th > is set. This is actually already in the implementation, as we have > strict ranges. > > Please take a look at the attached diff and let me know what you think. > Is this portable and am I correct to assume we might even handle > chars longer than 8 bit properly? I agree with all of this. Your patch looks good to me. > There's just one open question: Do you know of a better way than to do > > (char *)(unsigned char[]){ 0xff, 0xef, 0xa0 } > > to specify a literal char-array with specific bit-patterns? I believe "\xff\xef\xa0" also works, but I am not very confident about this; the wording of the standard is not clear to me. It says (6.4.4.4p6) > The hexadecimal digits that follow the backslash and the letter x > in a hexadecimal escape sequence are taken to be part of the > construction of a single character for an integer character constant > or of a single wide character for a wide character constant. The > numerical value of the hexadecimal integer so formed specifies the > value of the desired character or wide character. Okay, so '\xff' constructs a single character with value 255. But, is '\xff' considered an integer character constant containing a single character? Then (6.4.4.4p10): > An integer character constant has type int. The value of an integer > character constant containing a single character that maps to a > single-byte execution character is the numerical value of the > representation of the mapped character interpreted as an integer. Does this one apply? Not sure because later sentences mention escape sequences explicitly, and it's not clear if 255 maps to a single-byte execution character if CHAR_MAX == 127. Also, I'm not sure how to parse the last part of the sentence (some grouping parentheses would be helpful). The representation of 255 is 11111111, so what does it mean to interpret as an integer (of what width)? > The value of an integer character constant containing more than one > character (e.g., 'ab'), or containing a character or escape sequence > that does not map to a single-byte execution character, is > implementation-defined. If '\xff' is considered to not map to a single-byte execution character, then this would indicate that it's implementation-defined. > If an integer character constant contains > a single character or escape sequence, its value is the one that > results when an object with type char whose value is that of the > single character or escape sequence is converted to type int. What does it mean for a char to have value of the escape sequence, since char may not be able to represent 255? Why are there two sentences that specify the value of an integer character constant containing a single character? If the first one applies, is this one ignored? The main thing that indicates to me that it is defined is example 2 in that section (6.4.4.4p13): > Consider implementations that use two's complement representation > for integers and eight bits for objects that have type char. In an > implementation in which type char has the same range of values as > signed char, the integer character constant '\xFF' has the value > -1; if type char has the same range of values as unsigned char, the > character constant '\xFF' has the value +255. It mentions two's complement and 8-bit char explicitly, and says '\xFF' has the value -1 (not "may have"). This makes me think that I should somehow be able to justify this using the above paragraphs. So I can't say for sure, and I haven't been very lucky with searching the web for discussion about this, but I think it should be fine to use hex escapes to construct string literals with specific bit patterns (at the very worst it is implementation defined).
