On Thu, 16 Dec 2021 14:01:48 -0800 Michael Forney <[email protected]> wrote:
Dear Michael, > Thanks for sticking with it. I know this topic is quite pedantic and > hypothetical, but I think it's still important to consider and > understand. yeah definitely! Most probably think that we're crazy discussing this stuff for so long, but it's imperative to have a "stable" API before releasing version 1. > Thanks for the links. The aliasing discussion in [0] is very > interesting, and I will definitely bookmark [1] to use as a reference > in the future. I'm glad you can make use of it! > Interestingly, there is a C23 proposal[0] to introduce char8_t as a > typedef for unsigned char and change the type (!) of UTF-8 string > literals from char * to char8_t * (aka unsigned char *). It has not > been discussed in any meeting yet, but it will be interesting to see > what the committee thinks of it. I don't think u8 string literals are > widely used at this point, but it's weird to see a proposal breaking > backwards compatibility like this. > > [0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm I stumbled upon that as well. > I agree with all of this. Your patch looks good to me. Thanks for checking the patch! Nice to hear that you agree. > > The hexadecimal digits that follow the backslash and the letter x > > in a hexadecimal escape sequence are taken to be part of the > > construction of a single character for an integer character constant > > or of a single wide character for a wide character constant. The > > numerical value of the hexadecimal integer so formed specifies the > > value of the desired character or wide character. > > Okay, so '\xff' constructs a single character with value 255. But, is > '\xff' considered an integer character constant containing a single > character? > > Then (6.4.4.4p10): > > > An integer character constant has type int. The value of an integer > > character constant containing a single character that maps to a > > single-byte execution character is the numerical value of the > > representation of the mapped character interpreted as an integer. > > Does this one apply? Not sure because later sentences mention escape > sequences explicitly, and it's not clear if 255 maps to a single-byte > execution character if CHAR_MAX == 127. Also, I'm not sure how to > parse the last part of the sentence (some grouping parentheses would > be helpful). The representation of 255 is 11111111, so what does it > mean to interpret as an integer (of what width)? > > > The value of an integer character constant containing more than one > > character (e.g., 'ab'), or containing a character or escape sequence > > that does not map to a single-byte execution character, is > > implementation-defined. > > If '\xff' is considered to not map to a single-byte execution > character, then this would indicate that it's implementation-defined. > > > If an integer character constant contains > > a single character or escape sequence, its value is the one that > > results when an object with type char whose value is that of the > > single character or escape sequence is converted to type int. > > What does it mean for a char to have value of the escape sequence, > since char may not be able to represent 255? Why are there two > sentences that specify the value of an integer character constant > containing a single character? If the first one applies, is this one > ignored? > > The main thing that indicates to me that it is defined is example 2 in > that section (6.4.4.4p13): > > > Consider implementations that use two's complement representation > > for integers and eight bits for objects that have type char. In an > > implementation in which type char has the same range of values as > > signed char, the integer character constant '\xFF' has the value > > -1; if type char has the same range of values as unsigned char, the > > character constant '\xFF' has the value +255. > > It mentions two's complement and 8-bit char explicitly, and says > '\xFF' has the value -1 (not "may have"). This makes me think that I > should somehow be able to justify this using the above paragraphs. > > So I can't say for sure, and I haven't been very lucky with searching > the web for discussion about this, but I think it should be fine to > use hex escapes to construct string literals with specific bit > patterns (at the very worst it is implementation defined). Thanks for digging through the standard! This was exactly the same pitfall I was facing and I'm not sure, to be honest. After all, I think just building an unsigned char-array and casting it to (char *) is probably the safest way to go. :) I'll push the commit and add a manpage for the UTF-8-functions. At that point, we should be ready for a first release. With best regards Laslo
