On Sat, 11 Dec 2021 15:18:56 -0800 Michael Forney <[email protected]> wrote:
Dear Michael, > Just want to mention up front that all of below is what I believe to > be true from my interpretation of the standard. I'm happy to be > corrected if I am wrong about any of this. thanks again for your elaborate response! > Neither conversion is undefined behavior, but unsigned char values > > CHAR_MAX converted to char is implementation defined. > > Conversion of a negative char value to unsigned char is defined by C99 > 6.3.1.3p2: > > > Otherwise, if the new type is unsigned, the value is converted by > > repeatedly adding or subtracting one more than the maximum value > > that can be represented in the new type until the value is in the > > range of the new type. > > Conversion of unsigned char values outside the range of char is > implementation defined by C99 6.3.1.3p3: > > > Otherwise, the new type is signed and the value cannot be > > represented in it; either the result is implementation-defined or an > > implementation-defined signal is raised. On Sat, 11 Dec 2021 15:33:12 -0800 Michael Forney <[email protected]> wrote: > On 2021-12-11, Michael Forney <[email protected]> wrote: > > Conversion of unsigned char values outside the range of char is > > implementation defined by C99 6.3.1.3p3: > > > >> Otherwise, the new type is signed and the value cannot be > >> represented in it; either the result is implementation-defined or > >> an implementation-defined signal is raised. > > Also worth noting, this clause still remains even in the current C23 > draft, which requires two's complement. So, assuming that CHAR_MAX == > 127, (char)0xFD will continue to be implementation defined and might > raise a signal. This is different from C++, which went a step further > to define conversion between all integer types to be the unique value > congruent to 2^N (where N is the number of bits of the destination > type). In [0] the gcc developers write in this regard: "For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised." However, it seems to be a bit pedantic when you want to convert a value that is more than one 2^N "away" from the signed range, because it probably assumed you made a mistake and warns about it. > >> > - .arr = (uint8_t[]){ 0xFD }, > >> > + .arr = (char[]){ > >> > + (unsigned char)0xFD, > >> > + }, > >> > >> This cast doesn't do anything here. Both 0xFD and (unsigned > >> char)0xFD have the same value (0xFD), which can't necessarily be > >> represented as char. For example if CHAR_MAX is 127, this > >> conversion is implementation defined and could raise a signal (C99 > >> 6.3.1.3p2). Now we're getting closer: gcc doesn't warn, because char and unsigned char have the same conversion rank. > >> I think using hex escapes in a string literal ("\xFD") has the > >> behavior you want here. You could also create an array of unsigned > >> char and cast to char *. > > > > From how I understood the standard it does make a difference. > > "0xFD" as is is an int-literal and it prints a warning stating that > > this cannot be cast to a (signed) char. However, it does not > > complain with unsigned char, so I assumed that the standard somehow > > safeguards it. > > I'm not sure why casting to unsigned char makes the warning go away. > The only difference is the type of the expression (int vs unsigned > char), but the rules in 6.3.1.3 don't care about the source type, only > its value. > > I'm not aware of any exception in the standard for unsigned char to > char conversion (but if there is one, I'd be interested to know). > > > But when I got it correctly, you are saying that this only works > > because I assume two's complement, right? So what's the portable > > way to work with chars? :) > > I guess it depends specifically on what you are trying to do. If you > want a char *, such that when it is cast to unsigned char * and > dereferenced, you get some value 0xAB, you could write "\xAB", or > (char *)(unsigned char[]){0xAB}. There isn't really a nice way to get > a char such that converting to unsigned char results in some value, > since this isn't usually what you want and can't be done in general > (with sign-magnitude, there is no char such that converting to > unsigned char results in 0x80). Alright, and C99 gives the guarantee in C99 6.4.4.4p9: "The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type __unsigned char__ for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant." So at least for the test-cases, using hexadecimal escapes in a string literal is probably the most elegant. This however doesn't solve the other way round (char -> unsigned char for bit-fiddling). > Regarding two's complement assumption, consider the UTF-8 encoding of > α: 0xCE 0xB1 or 11001110 10110001. If you interpret that as two's > complement, you get [-50, -79]. Converting to unsigned char will add > 256, resulting in [0xCE, 0xB1] like you want. However, with > sign-magnitude you get [-78, -49], converted to unsigned char is > [0xB2, 0xCF] (and something else for one's complement). If you instead > just interpret 11001110 10110001 as unsigned char, you get [0xCE, > 0xB1] without depending on the signed integer representation. With > C23, the only possible interpretation of 11001110 10110001 as signed > char is [-50, -79], so it doesn't matter if you go through char or > directly to unsigned char, the result is the same. > > Really, I think UTF-8 encoding stored in char * is kind of a lie, > since it doesn't really make sense to talk about negative code units, > but it is useful so that you can still use standard string libc > functions. The string.h functions are even specified to interpret as > unsigned char (C99 7.21.1p3): > > > For all functions in this subclause, each character shall be > > interpreted as if it had the type unsigned char (and therefore every > > possible object representation is valid and has a different value). So would you say that the only good way would be to only accept arrays of unsigned char in the API? I think this seems to be the logical conclusion. When I read more I found out that C++ introduced static_cast and reinterpret_cast for this simple reason: Assuming some crazy signed-int-representation we just make up in our heads (some random permutation of 0..255 to -127..128), it is impossible to really know the intent of the user passing us a (signed) char-array. Let's say "0b01010101" means "0" in our crazy signed type, does the user intend to convey to us a null-byte (which is simply "encoded" in the signed type), or does he literally mean "0b01010101"? With static_cast and reinterpret_cast you can handle both cases separately. One might say: 'Ah well, what does it matter?! You can rely on the implementation and assume that the user always meant the former!' However, this can really become a footgun if we're talking about FFIs. If I wrote a FFI to libgrapheme in some external language, I'd be happier to see an explicit unsigned char array rather than some signed-char-footgun due to the above reasons, even if we can make it work in some way within C. My initial intent was to handle systems that don't have an 8-bit integer type. This might sound crazy nowadays, but if you were really stuck on Mars with such a thing and you really had to work with UTF-8, you would simply read e.g. a UTF-8 encoded file and store each octet within the low bits of e.g. a 16-bit integer. The other way around would work respectively. In stdint-lingo you would want the type uint_least8_t, but that's what unsigned char is defined to be (unsigned int of at least 8 bits size). Two questions remain: 1) Would you also go down the route of just demanding an array of unsigned integers of at least 8 bits? 2) Would you define it as "unsigned char *" or "uint_least8_t *"? I'd almost favor the latter, given the entire library is already using the stdint-types. With best regards Laslo [0]:http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html#Integers-implementation
